Information Retrieval using Semantic Similarity

Seminar on Artificial Intelligence

Information Retrieval
Using
Semantic Similarity

Harshita Meena (100050020)
Diksha Meghwal (100050039)
Saswat Padhi (100050061)

Overview ...
●

“Semantics” & “Ontology” (Diksha)
●
●
●

●

Semantic Similarity (Harshita)
●
●
●

●

What is IR lacking?
Semantics: “What”? And How?
Ontologies and knowledge representation

Semantic Similarity: What? and How?
Path based semantic similarity measures
Information content based similarity measures

Information Retrieval (Saswat)
●
●
●

VSM Revisited
SSRM: IR with semantics
Conclusion and further reading

“Semantics” & “Ontology”
What is IR (without semantics) lacking?
“MEANING”
Query: software
Pool: application, program, package, freeware, shareware
Result: No match!!
motivation for looking at semantic rather than lexical similarity
The problem today in information retrieval is not lack of data, but the
lack of “structured” and “meaningful organisation” of data.
Ontologies are attempts to organise information and empower IR.

Semantics: What? And How?
“Semantics” capture the meaning of the linguistic terms. Computers
do not understand “meaning”. So, the semantic meanings of terms are
rather represented using links to other terms.
An “ontology” formally represents knowledge as a set of concepts
within a domain, and the relationships between pairs of concepts. It
can be used to model a domain and support reasoning about entities.
Formal definition by Tom Gruber:
An ontology is a formal, explicit specification of a shared conceptualization
●
●
●
●

formal: it should be machine readable
explicit: types of concept and the constraints are explicitly defined
shared: the ontology is agreed upon and accepted by a group
conceptualization: An abstract model that consists of relevant models
and the relationships between them

Components of Ontologies
Classes : Classes are abstract groups, or collections of objects. They
may contain individuals, other classes, or a combination of both.
Classes can be extensional or intensional, subsume or be subsumed.
●
Attributes: Used to store information that is specific to the object
it is attached to like its features or characteristics.
●
Relationships: A relation is an attribute whose value is another
object in the ontology. Eg: subsumption relations(is-superclass-of,
the converse of is-a, is-subtype-of or is-subclass-of), meronym
relations(part-of).

●

Domain ontology (or domain-specific ontology) models a specific
domain, or part of the world.
●
Upper ontology (or foundation ontology) is a model of the common
objects that are applicable across a range of domain ontologies.
●

Examples of Popular Ontologies
WordNet

Medical Subject Headings

WordNet is a lexical database for the
English language, which superficially
resembles a thesaurus. It groups
English words into sets of synonyms
called synsets, provides short, general
definitions, and records the various
semantic relations between these
synonym sets.

MeSH is a comprehensive controlled
vocabulary for the purpose of indexing
journal articles and books in the life
sciences; it can also serve as a thesaurus
that facilitates searching. Created and
updated by the United States National
Library of Medicine (NLM), it is used by the
MEDLINE/PubMed article database and by
NLM's catalog of book holdings.

The Future: “Semantic Web”, OWL and RDF ...
The Semantic Web is a collaborative movement led by the international standards
body W3C. Semantic Web aims at converting the current web dominated by
semi-structured documents into a organised "web of data".
RDF(Resource Description Framework) is a part of the W3C family of specifications,
which can be used as a general method for conceptual description or modeling of
information.
<rdf:RDF
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
<dc:title>Tony Benn</dc:title>
<dc:publisher>Wikipedia</dc:publisher>
</rdf:Description>
</rdf:RDF>

OWL is built on top of the RDF and is stronger and supports greater machine
interpretability than RDF.
<rdf:RDF
<owl:Ontology rdf:about="http://www.linkeddatatools.com/plants">
<dc:title>The LinkedDataTools.com Example Plant Ontology</dc:title>
<dc:description>An example ontology</dc:description>
</owl:Ontology>
<owl:Class rdf:about="http://www.linkeddatatools.com/plants#planttype">
<rdfs:label>The plant type</rdfs:label>
<rdfs:comment>The class of plant types.</rdfs:comment>
</owl:Class>
</rdf:RDF>

Semantic Similarity
Ontology is just a “structure”, without any weights on the edges.
Semantic similarity measures exploit the structure information and
try to quantify the concept similarities in a given ontology.
Ontology based semantic measures can be classified as follows:
●

●

●

Path Based Similarity Measures
Path based similarity measures utilize the information of the
shortest path between two concepts, their generality or specificity
and their relationship with other concepts.
Information Content Based Similarity Measures
Information content based measures associate a quantity IC which
takes into account, the probabilities of concepts in the ontology.
Feature Based Similarity Measures (we won't be discussing)

Semantic Similarity (Path Based)
Wu & Palmer Measure:

2H
( N 1 + N 2 +2H)
Wu and Palmer measure fits the intuition that concepts with greater
depth would be more similar (because of specificity).
N1 and N2 are the number of IS-A links from C1 and C2 respectively to
the most specific common subsumer concept C. H is the number of
IS-A links from C to the root of ontology.
simW & P (C 1 ,C 2 )=

Li Measure:

e βH −e−βH
sim Li (C 1 ,C 2 )=e ⋅ βH −βH
e +e
Li combines the shortest path and the depth of ontology information
in a non-linear function.
L stands for the shortest path between two concepts, α and β are
scaling factors. H is same as in Wu & Palmer measure.
−αL

Semantic Similarity (Path Based)
Leacock & Chodorow Measure:
L
sim L & C (C 1 ,C 2 )=−log
2H
This is almost the same as Wu & Palmer method, except logarithmic
smoothing and removal of depth factor from denominator.
As in the Li Measure, L is the shortest path between concepts C1 and
C2. H is the number of IS-A links from C to the root of ontology.

Mao Measure:

δ
sim Mao (C 1 , C 2 )=
L log 2 (1+d (C 1 )+d (C 2 ))
Mao measure considers the generality of the concepts by taking into
account, the number of descendants.
L stands for the shortest path between two concepts, d(C) stands for
number of descendants of C. δ is a constant (usually chosen as 0.9).

Semantic Similarity (IC Based)
The intuition behind information content is that, more frequent terms
are more general and hence provide less “information”:
IC (C )=−log p(C )=−log

freq(C )
freq (root )

freq(C) is the frequency of concept C, and freq(root) is the frequency of
root concept of the ontology. Frequency includes the frequencies of
subsumed concepts in an IS-A hierarchy.
We call concept C the most informative subsumer of two concepts C1
and C2 i.e. ICmis(C1,C2) if concept C has the least probability among all
shared subsumer between two concepts (thus most informative).

Resnik Measure:
sim Resnik (C 1 , C 2 )=IC mis (C 1 , C 2 )

More the information two terms share, the more similar they are.

Semantic Similarity (IC Based)
Jiang Measure:
dist Jiang (C 1 ,C 2 )=IC (C 1 )+ IC (C 2 )−2ICmis (C 1 ,C 2 )

Jiang measure considers the information content of each term apart
from shared information content. It is an inverted measurement.
The distance between two concepts is the amount of information
needed to fully describe both the concepts, excluding the amount of
information that is common to both of them.

Lin Measure:

2ICmis (C 1 ,C 2 )
sim Lin (C 1 , C 2 )=
IC (C 1 )+ IC (C 2 )
Lin measure also the information contents of each term, but uses
them differently than Jiang. It takes ratio instead of difference.
Since ICmis(C1,C2) < IC (C1) and IC (C2) the similarity value is normalized
between 1 ( similar concepts) and 0.

Semantic Similarity
Correlation with human judgements
Method

Type

Correlation

Method

Type

Correlation

Wu &
Palmer

Path

0.74

Wu &
Palmer

Path

0.67

Li

Path

0.82

Li

Path

0.70

Leacock

Path

0.82

Leacock

Path

0.74

Resnik

IC

0.79

Resnik

IC

0.71

Lin

IC

0.82

Lin

IC

0.72

Jiang

IC

0.83

Jiang

IC

0.71

WordNet Ontology

MeSH Ontology

SSRM: IR with semantics ... (0/3)
VSM Revisited:
●

Similarity in VSM is the cosine inner product:

∑ qi d i
sim(q , d )=

i

∑ q 2⋅√ ∑ d i2
√ i
i

●

i

Each dimension corresponds to a separate term. q and d are
n-dimensional vectors with weights for each term.

●

qi and di are weights of the query and document terms

●

The document term weight, di = tfi • idfi

●

Specifically, I would talk about SSRM algorithm (Semantic
Similarity Retrieval Model), where we modify the query term
weights to consider semantic similarity.

Query Re-weighting:
●

Query can contain related (semantically similar) terms
Query: free scientific computing software

●

We need to re-weight the query terms to stress a particular concept
we are searching.
i≠ j

qi ' =q i +

∑

q j⋅sim(i , j)

sim (i , j )⩾t
●

qi and qi' are old and new weights respectively

●

i and j refer to different terms in the query.

Query Expansion:
●

●

New terms that might be semantically similar to query terms. We
“expand” the queries by adding new terms in the neighbourhood of
the query term, in the ontology.
Adding such terms would affect weights of existing terms.

{

i≠ j

qj
∑ n ⋅sim(i , j)
q i ' = sim (i , j)⩾T j
i≠
qj
qi + ∑
⋅sim(i , j )
n
sim (i , j)⩾T
●

n is the number of hyponyms
for each expanded term j.

i is a new term
i had weight q i

Document Similarity:
●

After we have the expanded and re-weighted query vectors and the
document vectors using tf-idf, we calculate the query-document
similarity between query q and document d as:

∑ ∑ qi⋅d j⋅sim(i , j )
sim (q , d )=
●

Properties:

i

j

∑ ∑ qi⋅d j
i

j

●

Symmetric.

●

Normalized in [0,1].

●

Consistent behaviour.

●

Can be easily tweaked for document-document similarity.

SSRM: At a glance

SSRM Implementation Notes:
Quadratic time complexity as opposed to VSM.
●
Similarity between every pair or terms can be hashed.
●
Expensive to expand and re-weight the document vectors as well,
so only re-weight and expand queries. But expanding one of the
vectors should incorporate enough semantic info.
●
Thresholds (t, T) need to be adjusted for optimal behaviour.
●

Although behaviour of SSRM is consistent, SSRM won't result in
sim(d,d) = 1 i.e. even exact search won't give a similarity value of 1.
●
I had proposed the following formula last summer and the results
on MeSH were quite satisfactory:
●

∑ ∑ qi⋅d j⋅maxsimi
sim( q , d )= i j
∑ ∑ q i⋅d j
i

j

where maxsimi = max sim(i , j)
j

Experimental Results

IR on OSHUMED using MeSH

IR on web using WordNet

Future ...
Possible Issues
●

●

●

Negation
●
Query: I like pizza
Antonymy
●
Query: Slow runner
Role Reversal
●
Query: Dog bites man

Match: I don't like pizza
Match: Fast runner
Match: Man bites dog

Further reading
●

Groupwise Semantic Similarity
●
●

●

Jaccard Index
simLP, simUI, simGIC

Statistical Semantic Similarity
●
●
●

LSA: Latent Semantic Analysis
NGD: Normalized Google Distance
PMI: Pointwise Mutual Information

References
●

A comparative study of ontology based term similarity measures
on PubMed Document Clustering [Xiaodan Zhang, Liping Jing, Xiaohua Hu,
Michael Ng, Xiaohua Zhou] [2007].

●

Information retrieval by semantic similarity

[A. Hilaoutakis, G. Varelas, E.

Voutsakis] [2006].
●

Semantic Similarity Based on Corpus Statistics and Lexical
Taxonomy [Jay J. Jiang, David W. Conrath] [1997].

Thank you!

Information Retrieval using Semantic Similarity

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie Information Retrieval using Semantic Similarity

Ähnlich wie Information Retrieval using Semantic Similarity (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Information Retrieval using Semantic Similarity