1. Seminar on Artificial Intelligence
Information Retrieval
Using
Semantic Similarity
Harshita Meena (100050020)
Diksha Meghwal (100050039)
Saswat Padhi (100050061)
2. Overview ...
●
“Semantics” & “Ontology” (Diksha)
●
●
●
●
Semantic Similarity (Harshita)
●
●
●
●
What is IR lacking?
Semantics: “What”? And How?
Ontologies and knowledge representation
Semantic Similarity: What? and How?
Path based semantic similarity measures
Information content based similarity measures
Information Retrieval (Saswat)
●
●
●
VSM Revisited
SSRM: IR with semantics
Conclusion and further reading
3. “Semantics” & “Ontology”
What is IR (without semantics) lacking?
“MEANING”
Query: software
Pool: application, program, package, freeware, shareware
Result: No match!!
motivation for looking at semantic rather than lexical similarity
The problem today in information retrieval is not lack of data, but the
lack of “structured” and “meaningful organisation” of data.
Ontologies are attempts to organise information and empower IR.
4. “Semantics” & “Ontology”
Semantics: What? And How?
“Semantics” capture the meaning of the linguistic terms. Computers
do not understand “meaning”. So, the semantic meanings of terms are
rather represented using links to other terms.
An “ontology” formally represents knowledge as a set of concepts
within a domain, and the relationships between pairs of concepts. It
can be used to model a domain and support reasoning about entities.
Formal definition by Tom Gruber:
An ontology is a formal, explicit specification of a shared conceptualization
●
●
●
●
formal: it should be machine readable
explicit: types of concept and the constraints are explicitly defined
shared: the ontology is agreed upon and accepted by a group
conceptualization: An abstract model that consists of relevant models
and the relationships between them
5. “Semantics” & “Ontology”
Components of Ontologies
Classes : Classes are abstract groups, or collections of objects. They
may contain individuals, other classes, or a combination of both.
Classes can be extensional or intensional, subsume or be subsumed.
●
Attributes: Used to store information that is specific to the object
it is attached to like its features or characteristics.
●
Relationships: A relation is an attribute whose value is another
object in the ontology. Eg: subsumption relations(is-superclass-of,
the converse of is-a, is-subtype-of or is-subclass-of), meronym
relations(part-of).
●
Domain ontology (or domain-specific ontology) models a specific
domain, or part of the world.
●
Upper ontology (or foundation ontology) is a model of the common
objects that are applicable across a range of domain ontologies.
●
6. “Semantics” & “Ontology”
Examples of Popular Ontologies
WordNet
Medical Subject Headings
WordNet is a lexical database for the
English language, which superficially
resembles a thesaurus. It groups
English words into sets of synonyms
called synsets, provides short, general
definitions, and records the various
semantic relations between these
synonym sets.
MeSH is a comprehensive controlled
vocabulary for the purpose of indexing
journal articles and books in the life
sciences; it can also serve as a thesaurus
that facilitates searching. Created and
updated by the United States National
Library of Medicine (NLM), it is used by the
MEDLINE/PubMed article database and by
NLM's catalog of book holdings.
7. “Semantics” & “Ontology”
The Future: “Semantic Web”, OWL and RDF ...
The Semantic Web is a collaborative movement led by the international standards
body W3C. Semantic Web aims at converting the current web dominated by
semi-structured documents into a organised "web of data".
RDF(Resource Description Framework) is a part of the W3C family of specifications,
which can be used as a general method for conceptual description or modeling of
information.
<rdf:RDF
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
<dc:title>Tony Benn</dc:title>
<dc:publisher>Wikipedia</dc:publisher>
</rdf:Description>
</rdf:RDF>
OWL is built on top of the RDF and is stronger and supports greater machine
interpretability than RDF.
<rdf:RDF
<owl:Ontology rdf:about="http://www.linkeddatatools.com/plants">
<dc:title>The LinkedDataTools.com Example Plant Ontology</dc:title>
<dc:description>An example ontology</dc:description>
</owl:Ontology>
<owl:Class rdf:about="http://www.linkeddatatools.com/plants#planttype">
<rdfs:label>The plant type</rdfs:label>
<rdfs:comment>The class of plant types.</rdfs:comment>
</owl:Class>
</rdf:RDF>
8. Semantic Similarity
Ontology is just a “structure”, without any weights on the edges.
Semantic similarity measures exploit the structure information and
try to quantify the concept similarities in a given ontology.
Ontology based semantic measures can be classified as follows:
●
●
●
Path Based Similarity Measures
Path based similarity measures utilize the information of the
shortest path between two concepts, their generality or specificity
and their relationship with other concepts.
Information Content Based Similarity Measures
Information content based measures associate a quantity IC which
takes into account, the probabilities of concepts in the ontology.
Feature Based Similarity Measures (we won't be discussing)
9. Semantic Similarity (Path Based)
Wu & Palmer Measure:
2H
( N 1 + N 2 +2H)
Wu and Palmer measure fits the intuition that concepts with greater
depth would be more similar (because of specificity).
N1 and N2 are the number of IS-A links from C1 and C2 respectively to
the most specific common subsumer concept C. H is the number of
IS-A links from C to the root of ontology.
simW & P (C 1 ,C 2 )=
Li Measure:
e βH −e−βH
sim Li (C 1 ,C 2 )=e ⋅ βH −βH
e +e
Li combines the shortest path and the depth of ontology information
in a non-linear function.
L stands for the shortest path between two concepts, α and β are
scaling factors. H is same as in Wu & Palmer measure.
−αL
10. Semantic Similarity (Path Based)
Leacock & Chodorow Measure:
L
sim L & C (C 1 ,C 2 )=−log
2H
This is almost the same as Wu & Palmer method, except logarithmic
smoothing and removal of depth factor from denominator.
As in the Li Measure, L is the shortest path between concepts C1 and
C2. H is the number of IS-A links from C to the root of ontology.
Mao Measure:
δ
sim Mao (C 1 , C 2 )=
L log 2 (1+d (C 1 )+d (C 2 ))
Mao measure considers the generality of the concepts by taking into
account, the number of descendants.
L stands for the shortest path between two concepts, d(C) stands for
number of descendants of C. δ is a constant (usually chosen as 0.9).
11. Semantic Similarity (IC Based)
The intuition behind information content is that, more frequent terms
are more general and hence provide less “information”:
IC (C )=−log p(C )=−log
freq(C )
freq (root )
freq(C) is the frequency of concept C, and freq(root) is the frequency of
root concept of the ontology. Frequency includes the frequencies of
subsumed concepts in an IS-A hierarchy.
We call concept C the most informative subsumer of two concepts C1
and C2 i.e. ICmis(C1,C2) if concept C has the least probability among all
shared subsumer between two concepts (thus most informative).
Resnik Measure:
sim Resnik (C 1 , C 2 )=IC mis (C 1 , C 2 )
More the information two terms share, the more similar they are.
12. Semantic Similarity (IC Based)
Jiang Measure:
dist Jiang (C 1 ,C 2 )=IC (C 1 )+ IC (C 2 )−2ICmis (C 1 ,C 2 )
Jiang measure considers the information content of each term apart
from shared information content. It is an inverted measurement.
The distance between two concepts is the amount of information
needed to fully describe both the concepts, excluding the amount of
information that is common to both of them.
Lin Measure:
2ICmis (C 1 ,C 2 )
sim Lin (C 1 , C 2 )=
IC (C 1 )+ IC (C 2 )
Lin measure also the information contents of each term, but uses
them differently than Jiang. It takes ratio instead of difference.
Since ICmis(C1,C2) < IC (C1) and IC (C2) the similarity value is normalized
between 1 ( similar concepts) and 0.
13. Semantic Similarity
Correlation with human judgements
Method
Type
Correlation
Method
Type
Correlation
Wu &
Palmer
Path
0.74
Wu &
Palmer
Path
0.67
Li
Path
0.82
Li
Path
0.70
Leacock
Path
0.82
Leacock
Path
0.74
Resnik
IC
0.79
Resnik
IC
0.71
Lin
IC
0.82
Lin
IC
0.72
Jiang
IC
0.83
Jiang
IC
0.71
WordNet Ontology
MeSH Ontology
14. Information Retrieval
SSRM: IR with semantics ... (0/3)
VSM Revisited:
●
Similarity in VSM is the cosine inner product:
∑ qi d i
sim(q , d )=
i
∑ q 2⋅√ ∑ d i2
√ i
i
●
i
Each dimension corresponds to a separate term. q and d are
n-dimensional vectors with weights for each term.
●
qi and di are weights of the query and document terms
●
The document term weight, di = tfi • idfi
●
Specifically, I would talk about SSRM algorithm (Semantic
Similarity Retrieval Model), where we modify the query term
weights to consider semantic similarity.
15. Information Retrieval
SSRM: IR with semantics ... (1/3)
Query Re-weighting:
●
Query can contain related (semantically similar) terms
Query: free scientific computing software
●
We need to re-weight the query terms to stress a particular concept
we are searching.
i≠ j
qi ' =q i +
∑
q j⋅sim(i , j)
sim (i , j )⩾t
●
qi and qi' are old and new weights respectively
●
i and j refer to different terms in the query.
16. Information Retrieval
SSRM: IR with semantics ... (2/3)
Query Expansion:
●
●
New terms that might be semantically similar to query terms. We
“expand” the queries by adding new terms in the neighbourhood of
the query term, in the ontology.
Adding such terms would affect weights of existing terms.
{
i≠ j
qj
∑ n ⋅sim(i , j)
q i ' = sim (i , j)⩾T j
i≠
qj
qi + ∑
⋅sim(i , j )
n
sim (i , j)⩾T
●
n is the number of hyponyms
for each expanded term j.
i is a new term
i had weight q i
17. Information Retrieval
SSRM: IR with semantics ... (3/3)
Document Similarity:
●
After we have the expanded and re-weighted query vectors and the
document vectors using tf-idf, we calculate the query-document
similarity between query q and document d as:
∑ ∑ qi⋅d j⋅sim(i , j )
sim (q , d )=
●
Properties:
i
j
∑ ∑ qi⋅d j
i
j
●
Symmetric.
●
Normalized in [0,1].
●
Consistent behaviour.
●
Can be easily tweaked for document-document similarity.
19. Information Retrieval
SSRM Implementation Notes:
Quadratic time complexity as opposed to VSM.
●
Similarity between every pair or terms can be hashed.
●
Expensive to expand and re-weight the document vectors as well,
so only re-weight and expand queries. But expanding one of the
vectors should incorporate enough semantic info.
●
Thresholds (t, T) need to be adjusted for optimal behaviour.
●
Although behaviour of SSRM is consistent, SSRM won't result in
sim(d,d) = 1 i.e. even exact search won't give a similarity value of 1.
●
I had proposed the following formula last summer and the results
on MeSH were quite satisfactory:
●
∑ ∑ qi⋅d j⋅maxsimi
sim( q , d )= i j
∑ ∑ q i⋅d j
i
j
where maxsimi = max sim(i , j)
j
21. Future ...
Possible Issues
●
●
●
Negation
●
Query: I like pizza
Antonymy
●
Query: Slow runner
Role Reversal
●
Query: Dog bites man
Match: I don't like pizza
Match: Fast runner
Match: Man bites dog
Further reading
●
Groupwise Semantic Similarity
●
●
●
Jaccard Index
simLP, simUI, simGIC
Statistical Semantic Similarity
●
●
●
LSA: Latent Semantic Analysis
NGD: Normalized Google Distance
PMI: Pointwise Mutual Information
22. References
●
A comparative study of ontology based term similarity measures
on PubMed Document Clustering [Xiaodan Zhang, Liping Jing, Xiaohua Hu,
Michael Ng, Xiaohua Zhou] [2007].
●
Information retrieval by semantic similarity
[A. Hilaoutakis, G. Varelas, E.
Voutsakis] [2006].
●
Semantic Similarity Based on Corpus Statistics and Lexical
Taxonomy [Jay J. Jiang, David W. Conrath] [1997].
Thank you!