Word Sense Disambiguation and Induction

Introduction WSD WSI Evaluation and Issues Wikipedia Summary

Word Sense Disambiguation and Induction

Leon Derczynski

University of Sheﬃeld

27 January 2011

Leon Derczynski University of Sheﬃeld


Origin

Originally a course at ESSLLI 2011, Copenhagen
by Roberto Navigli and Simone Ponzetto



Outline

1 Introduction

2 WSD

3 WSI

4 Evaluation and Issues

5 Wikipedia

6 Summary



General Problem

Being able to disambiguate words in context is a crucial
problem
Can potentially help improve many other NLP applications
Polysemy is everywhere – our job is to model this
Ambiguity is rampant.
I saw a man who is 98 years old and can still walk and tell
jokes.
saw:26 man:11 years:4 old:8 can:5 still:4 walk:10 tell:8 jokes:3
43 929 600 possible senses for this simple sentence.
general problem, ambiguity is rampant



Word Senses

Monosemous words – only one meaning; plant life, internet
Polysemous words – more than one meaning; bar, bass
A word sense is a commonly-accepted meaning of a word.
We are fond of fruit such as the kiwifruit and banana.



Enumerative Approach

Fixed sense inventory enumerates the range of possible
meanings of a word
Context is used to select a particular sense
chop vegetables with a knife, was stabbed with a knife
However, we may want to add senses.



WSD Tasks

Diﬀerent representations of senses change the way we think
about WSD
Lexical sample – disambiguate a restricted set of words
All words – disambiguate all content words
Cross lingual WSD – disambiguate a target word by labeling it
with the appropriate translation in other languages; eg.
English coach → German Bus/Linienbus/Omnibus/Reisebus.



Representing the Context

Text is unstructured, and needs to be made machine-readable.
Flat representation (surface features) vs. Structured
representation (graphs, trees)
Local features: local context of a word usage, e.g. PoS tags
and surrounding word forms
Topical features: general topic of a sentence or discourse,
represented as a bag of words
Syntactic features: argument-head relations between target
and rest of sentence
Semantic features: previously established word senses



Knowledge Resources

Structured and Unstructured
Thesauri, machine-readable dictionaries, semantic networks
(WordNet)
BabelNet – Babel synsets, with semantic relations (is-a,
part-of)
Raw corpora
Collocation (Web1T)



Applications

Information extraction – acronym expansion, disambiguate
people names, domain-speciﬁc IE
Information retrieval
Machine Translation
Semantic web
Question answering



Approaches

Supervised WSD: classiﬁcation task, hand-labelled data
KB WSD: uses knowledge resources, no training
Unsupervised: performs WSI
Word sense dominance: ﬁnd predominant sense of a word
Domain-driven WSD: use domain information as vectors to
compare with senses of w



Supervised WSD

Given a set of manually sense-annotated examples (training
set), learn a classiﬁer
Features for WSD: Bag of words, bigrams, collocations, VP
and NP heads, PoS
Using WordNet as a sense inventory, SemCor is a readily
available source of sense-labelled data
Current SotA performance from SVMs



Knowledge-based WSD

Exploit knowledge resources (dictionaries, thesauri,
collocations) to assign senses
Lower performance than supervised methods, but wider
coverage
No need to train or be tuned to a task/domain



Gloss Overlap

Knowledge-based method proposed by Lesk (1986)
Retrieve all sense definitions of target word
Compare each sense definition with the definitions of other
words in context
Choose the sense with the most overlap
To disambiguate pine cone;
pine: 1. a kind of evergreen tree; 2. to waste away through
sorrow.
cone: 1. a solid body which narrows to a point; 2. something
of this shape; 3. fruit of certain evergreen trees.



Lexical Chains

Knowledge-based method proposed by Hirst and St Onge
(1998)
A lexical chain is a sequence of semantically related words in a
text
Assign scores to senses based on the chain of related words it
is in



PageRank

Knowledge-based method proposed by Agirre and Soroa
(2009)
Build a graph including all synsets of words in the input text
Assign an initial low value to each node in the graph
Apply PageRank (Brin and Page) to the graph, and select
synsets with the highest PR



Knowledge Acquisition Bottleneck

WSD needs knowledge! Corpora, dictionaries, semantic
networks
More knowledge is required to improve the performance of
both:
Supervised systems – more training data
Knowledge based systems – richer networks



Minimally Supervised WSD

Human supervision is expensive, but required for training
examples or a knowledge base
Minimally supervised approaches aim to learn classiﬁers from
annotated data with minimal human supervision



Bootstrapping

Given a set labelled examples L, a set of unlabelled examples
U and a classiﬁer c:
1. Choose N examples from U and add them to U ′
2. Train c on L and label U ′
3. Select K most conﬁdently labelled instances from U ′ and
assign them to L
Repeat until U or K is empty



Word Sense Induction

Based on the idea that one sense of a word will have similar
neighbouring words
Follows the idea that the meaning of a word is given by its
usage
We induce word sense from input text by clustering word
occurrences



Clustering

Unsupervised machine learning for grouping similar objects
into groups
No a priori input (sense labels)
Context clustering: each occurrence of a word is represented
as a context vector; cluster vectors into groups
Word clustering: cluster words which are semantically similar
and thus have a speciﬁc meaning



Word Clustering

Aims to cluster words which are semantically similar
Lin (1998) proposes this method:
1. Extract dependency triples from a text corpus
John eats a yummy kiwi → (eat subj John), (kiwi obj-of eat),
(kiwi det a) ...
2. Deﬁne a measure of similarity between two words
3. Use similarity scores to create a similarity tree; start with a
root node, and add recursively add children in descending
order of similarity.



Lin’s approach: example



WSI: pros and cons

+ Actually performs word sense disambiguation
+ Aims to divide the occurrences of a word into a number of
classes
- Makes objective evaluation more diﬃcult if not
domain-speciﬁc



Disambiguation Evaluation

Disambiguation is easy to evaluate – we have discrete sense
inventories
Evaluate with Coverage (answers given),
Precision and Recall, and then F1
Accuracy – correct answers / total answers



Disambiguation Baselines

MFS – Most Frequent Sense
Strong baseline - 50-60% accuracy on lexical sample task
Doesn’t take into account genre (e.g. star in astrophysics /
newswire)
Subject to idiosyncracies of corpus



Evaluation with gold-standard clustering

Given a standard clustering, compare the gold standard and
output clustering
Can evaluate with set Entropy, Purity
Also RandIndex (similar to Jacquard) and F-Score.



Discrimination Baselines

All-in-one: group all words into one big cluster
Random: produce a random set of clusters



Pseudowords

Discrimination evaluation method
Generates new words with artiﬁcial ambiguity
Select two or more monosemous terms from gold standard
data
Given all their occurrences in a corpus, replace them with a
pseudoword formed by joining the two terms
Compare automatic discrimination to gold standard



SemEval-2007

Lexical sample and all-words coarse grained WSD
Preposition disambiguation
Evaluation of WSD on cross-language RI
WSI, lexical substitution
Top systems reach 88.7% accuracy (on lexical sample) and
82.5% (on all-words)



SemEval-2010

Fifth event of its kind
Includes speciﬁc cross-lingual tasks
Combined WSI/WSD task
Domain-speciﬁc all-words task



Issues

Representation of word senses: enumerative vs. generative
approach
Knowledge Acquisition Bottleneck: not enough data!
Beneﬁts for AI/NLP applications



Alleviating the Knowledge Acquisition Bottleneck

Weakly-supervised algorithms, incorporating bootstrapping or
active learning
Continuing manual eﬀorts – WordNet, Open Mind Word
Expert, OntoNotes
Automatic enrichment of knowledge resources – collocation
and relation triple extraction, BabelNet



Future Challenges

How can we mine even larger repositories of textual data –
e.g. the whole web! – to create huge knowledge repositories?
How can we design high performance and scalable algorithms
to use this data?
Need to decide which kind of word sense are needed for which
application
Still, need to develop a general representation of word senses



Wikipedia as sense inventory

Wikipedia articles provide an inventory of disambiguated word
senses and entity references
Task: Use their occurrences in texts, i.e. the internal
Wikipedia hyperlinks, as named entity and sense annotations
The articles’ texts provide a sense annotated corpus



Mihalcea (2007)
Mihalcea proposes a method for automatically generating
sense-tagged data using Wikipedia
Rhythm is the arrangement of sounds in time. Meter animates
time in regular pulse groupings, called measures or [[bar
(music)—bar]].
The nightlife is particularly active around the beachfront
promenades because of its many nightclubs and [[bar
(establishment)—bars]].
1. Extract all paragraphs in Wikipedia containing word w
2. Collect all possible labels l2 ..ln for w
3. Map each label l to its WordNet sense s
4. Annotate each occurrence of li |w with its sense s
System trained on Wikipedia signiﬁcantly outperforms MFS
and Lesk baselines


Knowledge-rich WSD

General aim is to relieve knowledge acquisition bottleneck of
NLP systems, with WSD as a case study
Main ideas:
- Extend WordNet with millions of semantic relations (using
Wikipedia)
- Apply knowledge-based WSD to exploit extended WordNet
Results: integration of many, many semantic relations in
knowledge-based systems yields performance competitive with
SotA supervised approaches



Wikiﬁcation

The task of generating hyperlinks to disambiguated Wikipedia
concepts
Two sub-tasks: automatic keyword extraction, WSD
Wikify!1 can perform KW extraction by extracting candidates
and then ranking them
The system does knowledge-based and data-driven WSD,
ﬁltering out annotations that contain disagreements
Disambiguate links using relatedness, commonness (prior
probability of a sense), and context quality (context terms).

1
Csomai and Mihalcea (2008)


Questions

Thank you. Are there any questions?


Word Sense Disambiguation and Induction

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Word Sense Disambiguation and Induction

Ähnlich wie Word Sense Disambiguation and Induction (7)

Mehr von Leon Derczynski

Mehr von Leon Derczynski (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Word Sense Disambiguation and Induction