word sense disambiguation, wsd, thesaurus-based methods, dictionary-based methods, supervised methods, lesk algorithm, michael lesk, simplified lesk, corpus lesk, graph-based methods, word similarity, word relatedness, path-based similarity, information content, surprisal, resnik method, lin method, elesk, extended lesk, semcor, collocational features, bag-of-words features, the window, lexical semantics, computational semantics, semantic analysis in language technology.
1. Seman&c
Analysis
in
Language
Technology
http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm
Word Sense Disambiguation
Marina
San(ni
san$nim@stp.lingfil.uu.se
Department
of
Linguis(cs
and
Philology
Uppsala
University,
Uppsala,
Sweden
Spring
2016
1
2. Previous
Lecture:
Word
Senses
• Homonomy,
polysemy,
synonymy,
metonymy,
etc.
Prac(cal
ac(vi(es:
1)
SELECTIONAL
RESTRICTIONS
2)
MANUAL
DISAMBIGUATION
OF
EXAMPLES
USING
SENSEVAL
SENSES
AIMS
OF
PRACTICAL
ACTIVITiES:
• STUDENTS
SHOULD
GET
ACQUINTED
WITH
REAL
DATA
• EXPLORATIONS
OF
APPLICATIONS,
RESOURCES
AND
METHODS.
2
3. No
preset
solu$ons
(this
slide
is
to
tell
you
that
you
are
doing
well
J
)
• Whatever
your
experience
with
data,
it
is
a
valuable
experience:
• Disappointment
• Frustra(on
• Feeling
lost
• Happiness
• Power
• Excitement
• …
• All
the
students
so
far
(also
in
previous
courses)
have
presented
their
own
solu(ons…
many
different
solu(ons
and
it
is
ok…
3
4. J&M
own
solu$ons:
Selec$onal
Restric$ons
(just
for
your
records,
does
not
mean
they
are
necessearily
beMer
than
yours…
)
4
5. Other
possible
solu$ons…
• Kissàconcrete
sense:
touching
with
lips/mouth
• animate
kiss
[using
lips/
mouth]
animate/inanimate
• Ex:
he
kissed
her;
• The
dolphin
kissed
the
kid
• Why
does
the
pope
kiss
the
ground
a^er
he
disembarks
...
• Kissàfigura(ve
sense:
touching
• animate
kiss
inanimate
• Ex:
"Walk
as
if
you
are
kissing
the
Earth
with
your
feet."
5
pursed
lips?
6. NO
solu$on
or
comments
provided
for
Senseval
• All
your
impressions
and
feelings
are
plausible
and
acceptable
J
6
7. Remember
that
in
both
ac$vi$es…
• You
have
experienced
cases
of
POLYSEMY!
• YOU
HAVE
TRIED
TO
DISAMBIGUATE
THE
SENSES
MANUALLY,
IE
WITH
YOUR
HUMAN
SKILLS…
7
9. Today:
Word
Sense
Disambigua$on
(WSD)
• Given:
• A
word
in
context;
• A
fixed
inventory
of
poten(al
word
senses;
• Create
a
system
that
automa(cally
decides
which
sense
of
the
word
is
correct
in
that
context.
10. Word
Sense
Disambigua$on:
Defini$on
• Word
Sense
Disambitua(on
(WSD)
is
the
TASK
of
determining
the
correct
sense
of
a
word
in
context.
• It
is
an
automa(c
task:
we
create
a
system
that
automa-cally
disambiguates
the
senses
for
us.
• Useful
for
many
NLP
tasks:
informa(on
retrieval
(apple
or
apple
?),
ques(on
answering
(does
United
serve
Philadelphia?),
machine
transla(on
(eng
”bat”
à
It:
pipistrello
or
mazza
?)
10
11. Anecdote:
the
poison
apple
• In
1954,
Alan
Turing
died
a^er
bi(ng
into
an
apple
laced
with
cyanide
• It
was
said
that
this
half-‐biten
apple
inspired
the
Apple
logo…
but
apparently
it
is
a
legend
J
• hmp://mentalfloss.com/ar(cle/64049/did-‐alan-‐turing-‐inspire-‐
apple-‐logo
11
12. Be
alert…
• Word
sense
ambiguity
is
pervasive
!!!
12
13. Acknowledgements
Most
slides
borrowed
or
adapted
from:
Dan
Jurafsky
and
James
H.
Mar(n
Dan
Jurafsky
and
Christopher
Manning,
Coursera
J&M(2015,
dra^):
hmps://web.stanford.edu/~jurafsky/slp3/
16. The
Simplified
Lesk
algorithm
• Let’s
disambiguate
“bank”
in
this
sentence:
The
bank
can
guarantee
deposits
will
eventually
cover
future
tui(on
costs
because
it
invests
in
adjustable-‐rate
mortgage
securi(es.
• given
the
following
two
WordNet
senses:
if overlap > max-overlap then
max-overlap overlap
best-sense sense
end
return(best-sense)
Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns the
number of words in common between two sets, ignoring function words or other words on a
stop list. The original Lesk algorithm defines the context in a more complex way. The Cor-
pus Lesk algorithm weights each overlapping word w by its logP(w) and includes labeled
training corpus data in the signature.
bank1 Gloss: a financial institution that accepts deposits and channels the
money into lending activities
Examples: “he cashed a check at the bank”, “that bank holds the mortgage
on my home”
bank2 Gloss: sloping land (especially the slope beside a body of water)
Examples: “they pulled the canoe up on the bank”, “he sat on the bank of
the river and watched the currents”
17. The
Simplified
Lesk
algorithm
The
bank
can
guarantee
deposits
will
eventually
cover
future
tui(on
costs
because
it
invests
in
adjustable-‐rate
mortgage
securi(es.
if overlap > max-overlap then
max-overlap overlap
best-sense sense
end
return(best-sense)
Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns the
number of words in common between two sets, ignoring function words or other words on a
stop list. The original Lesk algorithm defines the context in a more complex way. The Cor-
pus Lesk algorithm weights each overlapping word w by its logP(w) and includes labeled
training corpus data in the signature.
bank1 Gloss: a financial institution that accepts deposits and channels the
money into lending activities
Examples: “he cashed a check at the bank”, “that bank holds the mortgage
on my home”
bank2 Gloss: sloping land (especially the slope beside a body of water)
Examples: “they pulled the canoe up on the bank”, “he sat on the bank of
the river and watched the currents”
Choose
sense
with
most
word
overlap
between
gloss
and
context
(not
coun(ng
func(on
words)
18. Drawback
• Glosses
and
examples
migh
be
too
short
and
may
not
provide
enough
chance
to
overlap
with
the
context
of
the
word
to
be
disambiguated.
18
19. The
Corpus(-‐based)
Lesk
algorithm
• Assumes
we
have
some
sense-‐labeled
data
(like
SemCor)
• Take
all
the
sentences
with
the
relevant
word
sense:
These
short,
"streamlined"
mee-ngs
usually
are
sponsored
by
local
banks1,
Chambers
of
Commerce,
trade
associa-ons,
or
other
civic
organiza-ons.
• Now
add
these
to
the
gloss
+
examples
for
each
sense,
call
it
the
“signature”
of
a
sense.
Basically,
it
is
an
expansion
of
the
dic(onary
entry.
• Choose
sense
with
most
word
overlap
between
context
and
signature
(ie.
the
context
words
provided
by
the
resources).
20. Corpus
Lesk:
IDF
weigh$ng
• Instead
of
just
removing
func(on
words
• Weigh
each
word
by
its
`promiscuity’
across
documents
• Down-‐weights
words
that
occur
in
every
`document’
(gloss,
example,
etc)
• These
are
generally
func(on
words,
but
is
a
more
fine-‐grained
measure
• Weigh
each
overlapping
word
by
inverse
document
frequency
(IDF).
20
21. Graph-‐based
methods
• First,
WordNet
can
be
viewed
as
a
graph
• senses
are
nodes
• rela(ons
(hypernymy,
meronymy)
are
edges
• Also
add
edge
between
word
and
unambiguous
gloss
words
21
toastn
4
drinkv
1
drinkern
1
drinkingn
1
potationn
1
sipn
1
sipv
1
beveragen
1 milkn
1
liquidn
1foodn
1
drinkn
1
helpingn
1
supv
1
consumptionn
1
consumern
1
consumev
1
An
undirected
graph
is
set
of
nodes
tha
are
connected
together
by
bidirec(onal
edges
(lines).
22. How
to
use
the
graph
for
WSD
“She
drank
some
milk”
• choose
the
most
central
sense
(several
algorithms
have
been
proposed
recently)
22
drinkv
1
drinkern
1
beveragen
1
boozingn
1
foodn
1
drinkn
1 milkn
1
milkn
2
milkn
3
milkn
4
drinkv
2
drinkv
3
drinkv
4
drinkv
5
nutrimentn
1
“drink” “milk”
24. Word
Similarity
• Synonymy:
a
binary
rela(on
• Two
words
are
either
synonymous
or
not
• Similarity
(or
distance):
a
looser
metric
• Two
words
are
more
similar
if
they
share
more
features
of
meaning
• Similarity
is
properly
a
rela(on
between
senses
• We
do
not
say
“The
word
“bank”
is
not
similar
to
the
word
“slope”
“,
bu
w
say.
• Bank1
is
similar
to
fund3
• Bank2
is
similar
to
slope5
• But
we’ll
compute
similarity
over
both
words
and
senses
25. Why
word
similarity
• Informa(on
retrieval
• Ques(on
answering
• Machine
transla(on
• Natural
language
genera(on
• Language
modeling
• Automa(c
essay
grading
• Plagiarism
detec(on
• Document
clustering
26. Word
similarity
and
word
relatedness
• We
o^en
dis(nguish
word
similarity
from
word
relatedness
• Similar
words:
near-‐synonyms
• car, bicycle:
similar
• Related
words:
can
be
related
any
way
• car, gasoline:
related,
not
similar
Cf.
Synonyms:
car
&
automobile
27. Two
classes
of
similarity
algorithms
• Thesaurus-‐based
algorithms
• Are
words
“nearby”
in
hypernym
hierarchy?
• Do
words
have
similar
glosses
(defini(ons)?
• Distribu(onal
algorithms:
next
(me!
• Do
words
have
similar
distribu(onal
contexts?
28. Path-‐based
similarity
• Two
concepts
(senses/synsets)
are
similar
if
they
are
near
each
other
in
the
thesaurus
hierarchy
• =have
a
short
path
between
them
• concepts
have
path
1
to
themselves
29. Refinements
to
path-‐based
similarity
• pathlen(c1,c2) =
(distance
metric)
=
1
+
number
of
edges
in
the
shortest
path
in
the
hypernym
graph
between
sense
nodes
c1
and
c2
• simpath(c1,c2) =
• wordsim(w1,w2) = max sim(c1,c2)
c1∈senses(w1),c2∈senses(w2)
1
pathlen(c1,c2 )
Sense
similarity
metric:
1
over
the
distance!
Word
similarity
metric:
max
similarity
among
pairs
of
senses.
For
all
senses
of
w1
and
all
senses
of
w2,
take
the
similarity
between
each
of
the
senses
of
w1
and
each
of
the
senses
of
w2
and
then
take
the
maximum
similarity
between
those
pairs.
31. Problem
with
basic
path-‐based
similarity
• Assumes
each
link
represents
a
uniform
distance
• But
nickel
to
money
seems
to
us
to
be
closer
than
nickel
to
standard
• Nodes
high
in
the
hierarchy
are
very
abstract
• We
instead
want
a
metric
that
• Represents
the
cost
of
each
edge
independently
• Words
connected
only
through
abstract
nodes
• are
less
similar
32. Informa$on
content
similarity
metrics
• In
simple
words:
• We
define
the
probability
of
a
concept
C
as
the
probability
that
a
randomly
selected
word
in
a
corpus
is
an
instance
of
that
concept.
• Basically,
for
each
random
word
in
a
corpus
we
compute
how
probable
it
is
that
it
belongs
to
a
certain
concepts.
Resnik
1995.
Using
informa(on
content
to
evaluate
seman(c
similarity
in
a
taxonomy.
IJCAI
33. Formally:
Informa$on
content
similarity
metrics
• Let’s
define
P(c) as:
• The
probability
that
a
randomly
selected
word
in
a
corpus
is
an
instance
of
concept
c
• Formally:
there
is
a
dis(nct
random
variable,
ranging
over
words,
associated
with
each
concept
in
the
hierarchy
• for
a
given
concept,
each
observed
noun
is
either
•
a
member
of
that
concept
with
probability
P(c)
• not
a
member
of
that
concept
with
probability
1-P(c)
• All
words
are
members
of
the
root
node
(En(ty)
• P(root)=1
• The
lower
a
node
in
hierarchy,
the
lower
its
probability
Resnik
1995.
Using
informa(on
content
to
evaluate
seman(c
similarity
in
a
taxonomy.
IJCAI
34. Informa$on
content
similarity
• For
every
word
(ex
“natural
eleva(on”),
we
count
all
the
words
in
that
concepts,
and
then
we
normalize
by
the
total
number
of
words
in
the
corpus.
• we
get
a
probability
value
that
tells
us
how
probable
it
is
that
a
random
word
is
a
an
instance
of
that
concept
P(c) =
count(w)
w∈words(c)
∑
N
geological-‐forma(on
shore
hill
natural
eleva(on
coast
cave
gromo
ridge
…
en(ty
In
order
o
compute
the
probability
of
the
term
"natural
eleva(on",
we
take
ridge,
hill
+
natural
eleva(on
itself
35. Informa$on
content
similarity
• WordNet
hierarchy
augmented
with
probabili(es
P(c)
D.
Lin.
1998.
An
Informa(on-‐Theore(c
Defini(on
of
Similarity.
ICML
1998
36. Informa$on
content:
defini$ons
1. Informa(on
content:
1. IC(c) = -log P(c)
2. Most
informa(ve
subsumer
(Lowest
common
subsumer)
LCS(c1,c2) =
The
most
informa(ve
(lowest)
node
in
the
hierarchy
subsuming
both
c1
and
c2
37. IC
aka…
• A
lot
of
people
prefer
the
term
surprisal
to
informa(on
or
to
informa(on
content.
-‐log
p(x)
It
measures
the
amount
of
surprise
generated
by
the
event
x.
The
smaller
the
probability
of
x,
the
bigger
the
surprisal
is.
It's
helpful
to
think
about
it
this
way,
par(cularly
for
linguis(cs
examples.
37
38. Using
informa$on
content
for
similarity:
the
Resnik
method
• The
similarity
between
two
words
is
related
to
their
common
informa(on
• The
more
two
words
have
in
common,
the
more
similar
they
are
• Resnik:
measure
common
informa(on
as:
• The
informa(on
content
of
the
most
informa(ve
(lowest)
subsumer
(MIS/LCS)
of
the
two
nodes
• simresnik(c1,c2) = -log P( LCS(c1,c2) )
Philip
Resnik.
1995.
Using
Informa(on
Content
to
Evaluate
Seman(c
Similarity
in
a
Taxonomy.
IJCAI
1995.
Philip
Resnik.
1999.
Seman(c
Similarity
in
a
Taxonomy:
An
Informa(on-‐Based
Measure
and
its
Applica(on
to
Problems
of
Ambiguity
in
Natural
Language.
JAIR
11,
95-‐130.
39. Dekang
Lin
method
• Intui(on:
Similarity
between
A
and
B
is
not
just
what
they
have
in
common
• The
more
differences
between
A
and
B,
the
less
similar
they
are:
• Commonality:
the
more
A
and
B
have
in
common,
the
more
similar
they
are
• Difference:
the
more
differences
between
A
and
B,
the
less
similar
• Commonality:
IC(common(A,B))
• Difference:
IC(descrip(on(A,B)-‐IC(common(A,B))
Dekang
Lin.
1998.
An
Informa(on-‐Theore(c
Defini(on
of
Similarity.
ICML
40. Dekang
Lin
similarity
theorem
• The
similarity
between
A
and
B
is
measured
by
the
ra(o
between
the
amount
of
informa(on
needed
to
state
the
commonality
of
A
and
B
and
the
informa(on
needed
to
fully
describe
what
A
and
B
are
simLin(A, B)∝
IC(common(A, B))
IC(description(A, B))
• Lin
(altering
Resnik)
defines
IC(common(A,B))
as
2
x
informa(on
of
the
LCS
simLin(c1,c2 ) =
2logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )
42. The
(extended)
Lesk
Algorithm
• A
thesaurus-‐based
measure
that
looks
at
glosses
• Two
concepts
are
similar
if
their
glosses
contain
similar
words
• Drawing
paper:
paper
that
is
specially
prepared
for
use
in
dra^ing
• Decal:
the
art
of
transferring
designs
from
specially
prepared
paper
to
a
wood
or
glass
or
metal
surface
• For
each
n-‐word
phrase
that’s
in
both
glosses
• Add
a
score
of
n2
• Paper
and
specially
prepared
for
1
+
22
=
5
• Compute
overlap
also
for
other
rela(ons
• glosses
of
hypernyms
and
hyponyms
46. Basic
idea
• If
we
have
data
that
has
been
hand-‐labelled
with
correct
word
senses,
we
can
used
a
supervised
learning
approach
and
learn
from
it!
• We
need
to
extract
features
and
train
a
classifier
• The
output
of
training
is
an
automa(c
system
capable
of
assigning
sense
labels
TO
unlabelled
words
in
a
context.
46
47. Two
variants
of
WSD
task
• Lexical
Sample
task
• (we
need
labelled
corpora
for
individual
senses)
• Small
pre-‐selected
set
of
target
words
(ex
difficulty)
• And
inventory
of
senses
for
each
word
• Supervised
machine
learning:
train
a
classifier
for
each
word
• All-‐words
task
• (each
word
in
each
sentence
is
labelled
with
a
sense)
• Every
word
in
an
en(re
text
• A
lexicon
with
senses
for
each
word
SENSEVAL
1-‐2-‐3
48. Supervised
Machine
Learning
Approaches
• Summary
of
what
we
need:
• the
tag
set
(“sense
inventory”)
• the
training
corpus
• A
set
of
features
extracted
from
the
training
corpus
• A
classifier
49. Supervised
WSD
1:
WSD
Tags
• What’s
a
tag?
A
dic(onary
sense?
• For
example,
for
WordNet
an
instance
of
“bass”
in
a
text
has
8
possible
tags
or
labels
(bass1
through
bass8).
50. 8
senses
of
“bass”
in
WordNet
1. bass
-‐
(the
lowest
part
of
the
musical
range)
2. bass,
bass
part
-‐
(the
lowest
part
in
polyphonic
music)
3. bass,
basso
-‐
(an
adult
male
singer
with
the
lowest
voice)
4. sea
bass,
bass
-‐
(flesh
of
lean-‐fleshed
saltwater
fish
of
the
family
Serranidae)
5. freshwater
bass,
bass
-‐
(any
of
various
North
American
lean-‐fleshed
freshwater
fishes
especially
of
the
genus
Micropterus)
6. bass,
bass
voice,
basso
-‐
(the
lowest
adult
male
singing
voice)
7. bass
-‐
(the
member
with
the
lowest
range
of
a
family
of
musical
instruments)
8. bass
-‐
(nontechnical
name
for
any
of
numerous
edible
marine
and
freshwater
spiny-‐finned
fishes)
51. SemCor
<wf
pos=PRP>He</wf>
<wf
pos=VB
lemma=recognize
wnsn=4
lexsn=2:31:00::>recognized</wf>
<wf
pos=DT>the</wf>
<wf
pos=NN
lemma=gesture
wnsn=1
lexsn=1:04:00::>gesture</wf>
<punc>.</punc>
51
SemCor: 234,000 words from Brown Corpus,
manually tagged with WordNet senses
52. Supervised
WSD:
Extract
feature
vectors
Intui$on
from
Warren
Weaver
(1955):
“If
one
examines
the
words
in
a
book,
one
at
a
(me
as
through
an
opaque
mask
with
a
hole
in
it
one
word
wide,
then
it
is
obviously
impossible
to
determine,
one
at
a
(me,
the
meaning
of
the
words…
But
if
one
lengthens
the
slit
in
the
opaque
mask,
un(l
one
can
see
not
only
the
central
word
in
ques(on
but
also
say
N
words
on
either
side,
then
if
N
is
large
enough
one
can
unambiguously
decide
the
meaning
of
the
central
word…
The
prac(cal
ques(on
is
:
``What
minimum
value
of
N
will,
at
least
in
a
tolerable
frac(on
of
cases,
lead
to
the
correct
choice
of
meaning
for
the
central
word?”
the
window
54. Two
kinds
of
features
in
the
vectors
• Colloca$onal
features
and
bag-‐of-‐words
features
• Colloca$onal/Paradigma$c
• Features
about
words
at
specific
posi(ons
near
target
word
• O^en
limited
to
just
word
iden(ty
and
POS
• Bag-‐of-‐words
• Features
about
words
that
occur
anywhere
in
the
window
(regardless
of
posi(on)
• Typically
limited
to
frequency
counts
Generally speaking, a
collocation is a
sequence of words or
terms that co-occur
more often than would
be expected by
chance. But here the
meaning is not exactly
this…
55. Examples
• Example
text
(WSJ):
An
electric
guitar
and
bass
player
stand
off
to
one
side
not
really
part
of
the
scene
• Assume
a
window
of
+/-‐
2
from
the
target
56. Examples
• Example
text
(WSJ)
An
electric
guitar
and
bass
player
stand
off
to
one
side
not
really
part
of
the
scene,
• Assume
a
window
of
+/-‐
2
from
the
target
57. Colloca$onal
features
• Posi(on-‐specific
informa(on
about
the
words
and
colloca(ons
in
window
• guitar
and
bass
player
stand
• word
1,2,3
grams
in
window
of
±3
is
common
encoding local lexical and grammatical information that can often accurately isola
a given sense.
For example consider the ambiguous word bass in the following WSJ sentenc
(16.17) An electric guitar and bass player stand off to one side, not really part of
the scene, just as a sort of nod to gringo expectations perhaps.
A collocational feature vector, extracted from a window of two words to the rig
and left of the target word, made up of the words themselves, their respective part
of-speech, and pairs of words, that is,
[wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1
i 2,wi+1
i ] (16.1
would yield the following vector:
[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]
High performing systems generally use POS tags and word collocations of leng
1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and N
For example consider the ambiguous word bass in the following WSJ sent
6.17) An electric guitar and bass player stand off to one side, not really par
the scene, just as a sort of nod to gringo expectations perhaps.
collocational feature vector, extracted from a window of two words to the
d left of the target word, made up of the words themselves, their respective
-speech, and pairs of words, that is,
[wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1
i 2,wi+1
i ] (
ould yield the following vector:
[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]
gh performing systems generally use POS tags and word collocations of l
2, and 3 from a window of words 3 to the left and 3 to the right (Zhong an
58. Bag-‐of-‐words
features
• “an
unordered
set
of
words”
–
posi(on
ignored
• Choose
a
vocabulary:
a
useful
subset
of
words
in
a
training
corpus
• Either:
the
count
of
how
o^en
each
of
those
terms
occurs
in
a
given
window
OR
just
a
binary
“indicator”
1
or
0
59. Co-‐Occurrence
Example
• Assume
we’ve
semled
on
a
possible
vocabulary
of
12
words
in
“bass”
sentences:
[fishing,
big,
sound,
player,
fly,
rod,
pound,
double,
runs,
playing,
guitar,
band]
• The
vector
for:
guitar
and
bass
player
stand
[0,0,0,1,0,0,0,0,0,0,1,0]
61. Classifica$on
• Input:
•
a
word
w
and
some
features
f
•
a
fixed
set
of
classes
C
=
{c1,
c2,…,
cJ}
• Output:
a
predicted
class
c∈C
Any
kind
of
classifier
• Naive
Bayes
• Logis(c
regression
• Neural
Networks
• Support-‐vector
machines
• k-‐Nearest
Neighbors
• etc.