(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
An Improved Approach to Word Sense Disambiguation
1. A knowledge based approach
Word Sense Disambiguation
Submitted by:
Pradeep Sachdeva – 10104678
Surbhi Verma – 10104686
Supervisor:
Dr. Sandeep Kumar Singh
2. • Words in the English language often
correspond to different meanings in different
contexts. Such words are referred to as
polysemous words (words having more than
one sense).
• This project presents a knowledge based
algorithm for disambiguating polysemous
word in any given sentence using
computational linguistics tool, WordNet.
Problem Statement
3. The album includes a few instrumental pieces.
His efforts have been instrumental in solving the problem.
Consider the following sentences:
4. The solution to the problem of WSD impacts
other computer related writing such as:
• improving relevance of search engines
• anaphora resolution,
• coherence and inference.
WSD is an intermediate language engineering
technology which could improve applications
such as information retrieval (IR).
Relevance of WSD
5. • Supervised Methods
• Unsupervised Methods
Dictionary or knowledge based methods
Different Approaches
6. • Supervised methods are based on the assumption that
the context can provide enough evidence on its own to
disambiguate words. However, they are subject to a
new knowledge acquisition bottleneck since they rely
on substantial amounts of manually sense-tagged
corpora for training, which are laborious and expensive
to create.
• They depend crucially on the existence of manually
annotated examples for every word sense, a requisite
that can so far be met only for a handful of words for
testing purposes.
Supervised Methods
7. • In this approach the underlying assumption is
that similar senses occur in similar contexts,
and thus senses can be induced from text by
clustering word occurrences using some
measure of similarity of context. New
occurrences of the word can be classified into
the closest induced clusters/senses.
• Performance of unsupervised methods is
lower than other methods.
Unsupervised Methods
8. • Knowledge based methods rely primarily on
dictionaries, thesauri, and lexical knowledge
bases, without using any corpus evidence.
Therefore, these methods do not require any
kind of training corpus.
• Performance of these methods is high and
also they do not face the challenge of new
knowledge acquisition since there is no
training data required.
Knowledge Based Methods
9. • WordNet is a lexical database for the English
language which groups English words into sets
of synonyms called synsets, provides short,
general definitions and the various semantic
relations between these synonym sets.
About Wordnet
10. • Every synset contains a group of synonymous words
or collocations ; different senses of a word are in different synsets.
• The meaning of the synsets is further clarified with short
defining glosses(Definitions and/or example sentences)
• Most synonym sets are connected to other synsets via a number of
semantic relations. A few of them include :
hypernyms: Y is a hypernym of X if every X is a (kind of) Y (bird is a
hypernym of parrot)
hyponyms: Y is a hyponym of X if every Y is a (kind of) X (parrot is a
hypernym of bird)
meronym: Y is a meronym of X if Y is a part of X (window is a
meronym of building)
holonym: Y is a holonym of X if X is a part of Y (building is a
holonym of window)
11. The synsets of the word sea are :-
1. sea (synonyms): a division of an ocean or a large body of salt water
partially enclosed by land
– It has hypernyms - body of water, water
– It has hyponyms - south sea
– It has meronyms - bay, inlet, recess, embayment, gulf
– It has holonyms - hydrosphere
2. sea, ocean (synonyms) : anything apparently limitless in quantity or
volume
– It has hypernyms - large indefinite amount, large indefinite quantity
3. Sea (synonyms): turbulent water with swells of considerable size
– It has hypernyms - turbulent flow
– It has hyponyms - head sea
An example
12. The algorithm computes an overall impact of
the following parameters on the similarity of
two words:
• Intersection
• Hierarchical Level
• Distance
Algorithm
13. NS1 S2
LEVEL 1
Intersection is computed as the number of overlapping words
between the word families of senses of target word and the
nearby word at various levels of the hierarchy.
At LEVEL 1:
Let us assume there are two senses of the target word. Let the
word families of two senses of a target word be S1 and S2.
Also let the word families of all the senses of a nearby word
be represented by a single set N.
Intersection at Level 1
14. NS1 S2
PNPS1 PS2
Including the hypernyms at level 2:
Intersection at Level 2
PS1, PS2 and PN are parents or hypernyms of S1, S2 and N respectively
16. Score
We compute the overall impact of intersection, hierarchical level and distance on
the degree of similarity between target and nearby words.
We have devised a formula of score as follows:
Score = (Intersection)1/k1
(Level)k2 * (Distance)1/k3
The values of k1, k2 and k3 have been experimentally determined as:
K1 = 3, k2 = 3, k3 = 3
17. Evaluation - SemCor
The algorithm has been evaluated on the SemCor dataset, which is
the largest publicly available sense-tagged-corpora created at
Princeton University.
It has been automatically mapped to various versions of the
WordNet.
For every polysemous word in a sentence, SemCor provides the
sense it corresponds to in accordance with the WordNet.
18. The algorithm has been evaluated in the following three ways:
Top 1 – This refers to the case when the correct sense i.e. the
sense specified by Semcor has been given the highest score
by the algorithm and is ranked as first.
Top 2 – This refers to the case when the correct sense i.e. the
sense specified by Semcor is one of the top 2 scoring senses
given by the algorithm.
Top 3 – This refers to the case when the correct sense i.e. the
sense specified by Semcor is one of the top 3 scoring senses
given by the algorithm
19. Comparison of resultsComparison of results
Therefore the algorithm performs better than the existing approaches in this area.