SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Seminar on Artificial Intelligence

Information Retrieval
Using
Semantic Similarity

Harshita Meena (100050020)
Diksha Meghwal (100050039)
Saswat Padhi (100050061)
Overview ...
●

“Semantics” & “Ontology” (Diksha)
●
●
●

●

Semantic Similarity (Harshita)
●
●
●

●

What is IR lacking?
Semantics: “What”? And How?
Ontologies and knowledge representation

Semantic Similarity: What? and How?
Path based semantic similarity measures
Information content based similarity measures

Information Retrieval (Saswat)
●
●
●

VSM Revisited
SSRM: IR with semantics
Conclusion and further reading
“Semantics” & “Ontology”
What is IR (without semantics) lacking?
“MEANING”
Query: software
Pool: application, program, package, freeware, shareware
Result: No match!!
motivation for looking at semantic rather than lexical similarity
The problem today in information retrieval is not lack of data, but the
lack of “structured” and “meaningful organisation” of data.
Ontologies are attempts to organise information and empower IR.
“Semantics” & “Ontology”
Semantics: What? And How?
“Semantics” capture the meaning of the linguistic terms. Computers
do not understand “meaning”. So, the semantic meanings of terms are
rather represented using links to other terms.
An “ontology” formally represents knowledge as a set of concepts
within a domain, and the relationships between pairs of concepts. It
can be used to model a domain and support reasoning about entities.
Formal definition by Tom Gruber:
An ontology is a formal, explicit specification of a shared conceptualization
●
●
●
●

formal: it should be machine readable
explicit: types of concept and the constraints are explicitly defined
shared: the ontology is agreed upon and accepted by a group
conceptualization: An abstract model that consists of relevant models
and the relationships between them
“Semantics” & “Ontology”
Components of Ontologies
Classes : Classes are abstract groups, or collections of objects. They
may contain individuals, other classes, or a combination of both.
Classes can be extensional or intensional, subsume or be subsumed.
●
Attributes: Used to store information that is specific to the object
it is attached to like its features or characteristics.
●
Relationships: A relation is an attribute whose value is another
object in the ontology. Eg: subsumption relations(is-superclass-of,
the converse of is-a, is-subtype-of or is-subclass-of), meronym
relations(part-of).

●

Domain ontology (or domain-specific ontology) models a specific
domain, or part of the world.
●
Upper ontology (or foundation ontology) is a model of the common
objects that are applicable across a range of domain ontologies.
●
“Semantics” & “Ontology”
Examples of Popular Ontologies
WordNet

Medical Subject Headings

WordNet is a lexical database for the
English language, which superficially
resembles a thesaurus. It groups
English words into sets of synonyms
called synsets, provides short, general
definitions, and records the various
semantic relations between these
synonym sets.

MeSH is a comprehensive controlled
vocabulary for the purpose of indexing
journal articles and books in the life
sciences; it can also serve as a thesaurus
that facilitates searching. Created and
updated by the United States National
Library of Medicine (NLM), it is used by the
MEDLINE/PubMed article database and by
NLM's catalog of book holdings.
“Semantics” & “Ontology”
The Future: “Semantic Web”, OWL and RDF ...
The Semantic Web is a collaborative movement led by the international standards
body W3C. Semantic Web aims at converting the current web dominated by
semi-structured documents into a organised "web of data".
RDF(Resource Description Framework) is a part of the W3C family of specifications,
which can be used as a general method for conceptual description or modeling of
information.
<rdf:RDF
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
<dc:title>Tony Benn</dc:title>
<dc:publisher>Wikipedia</dc:publisher>
</rdf:Description>
</rdf:RDF>

OWL is built on top of the RDF and is stronger and supports greater machine
interpretability than RDF.
<rdf:RDF
<owl:Ontology rdf:about="http://www.linkeddatatools.com/plants">
<dc:title>The LinkedDataTools.com Example Plant Ontology</dc:title>
<dc:description>An example ontology</dc:description>
</owl:Ontology>
<owl:Class rdf:about="http://www.linkeddatatools.com/plants#planttype">
<rdfs:label>The plant type</rdfs:label>
<rdfs:comment>The class of plant types.</rdfs:comment>
</owl:Class>
</rdf:RDF>
Semantic Similarity
Ontology is just a “structure”, without any weights on the edges.
Semantic similarity measures exploit the structure information and
try to quantify the concept similarities in a given ontology.
Ontology based semantic measures can be classified as follows:
●

●

●

Path Based Similarity Measures
Path based similarity measures utilize the information of the
shortest path between two concepts, their generality or specificity
and their relationship with other concepts.
Information Content Based Similarity Measures
Information content based measures associate a quantity IC which
takes into account, the probabilities of concepts in the ontology.
Feature Based Similarity Measures (we won't be discussing)
Semantic Similarity (Path Based)
Wu & Palmer Measure:

2H
( N 1 + N 2 +2H)
Wu and Palmer measure fits the intuition that concepts with greater
depth would be more similar (because of specificity).
N1 and N2 are the number of IS-A links from C1 and C2 respectively to
the most specific common subsumer concept C. H is the number of
IS-A links from C to the root of ontology.
simW & P (C 1 ,C 2 )=

Li Measure:

e βH −e−βH
sim Li (C 1 ,C 2 )=e ⋅ βH −βH
e +e
Li combines the shortest path and the depth of ontology information
in a non-linear function.
L stands for the shortest path between two concepts, α and β are
scaling factors. H is same as in Wu & Palmer measure.
−αL
Semantic Similarity (Path Based)
Leacock & Chodorow Measure:
L
sim L & C (C 1 ,C 2 )=−log
2H
This is almost the same as Wu & Palmer method, except logarithmic
smoothing and removal of depth factor from denominator.
As in the Li Measure, L is the shortest path between concepts C1 and
C2. H is the number of IS-A links from C to the root of ontology.

Mao Measure:

δ
sim Mao (C 1 , C 2 )=
L log 2 (1+d (C 1 )+d (C 2 ))
Mao measure considers the generality of the concepts by taking into
account, the number of descendants.
L stands for the shortest path between two concepts, d(C) stands for
number of descendants of C. δ is a constant (usually chosen as 0.9).
Semantic Similarity (IC Based)
The intuition behind information content is that, more frequent terms
are more general and hence provide less “information”:
IC (C )=−log p(C )=−log

freq(C )
freq (root )

freq(C) is the frequency of concept C, and freq(root) is the frequency of
root concept of the ontology. Frequency includes the frequencies of
subsumed concepts in an IS-A hierarchy.
We call concept C the most informative subsumer of two concepts C1
and C2 i.e. ICmis(C1,C2) if concept C has the least probability among all
shared subsumer between two concepts (thus most informative).

Resnik Measure:
sim Resnik (C 1 , C 2 )=IC mis (C 1 , C 2 )

More the information two terms share, the more similar they are.
Semantic Similarity (IC Based)
Jiang Measure:
dist Jiang (C 1 ,C 2 )=IC (C 1 )+ IC (C 2 )−2ICmis (C 1 ,C 2 )

Jiang measure considers the information content of each term apart
from shared information content. It is an inverted measurement.
The distance between two concepts is the amount of information
needed to fully describe both the concepts, excluding the amount of
information that is common to both of them.

Lin Measure:

2ICmis (C 1 ,C 2 )
sim Lin (C 1 , C 2 )=
IC (C 1 )+ IC (C 2 )
Lin measure also the information contents of each term, but uses
them differently than Jiang. It takes ratio instead of difference.
Since ICmis(C1,C2) < IC (C1) and IC (C2) the similarity value is normalized
between 1 ( similar concepts) and 0.
Semantic Similarity
Correlation with human judgements
Method

Type

Correlation

Method

Type

Correlation

Wu &
Palmer

Path

0.74

Wu &
Palmer

Path

0.67

Li

Path

0.82

Li

Path

0.70

Leacock

Path

0.82

Leacock

Path

0.74

Resnik

IC

0.79

Resnik

IC

0.71

Lin

IC

0.82

Lin

IC

0.72

Jiang

IC

0.83

Jiang

IC

0.71

WordNet Ontology

MeSH Ontology
Information Retrieval
SSRM: IR with semantics ... (0/3)
VSM Revisited:
●

Similarity in VSM is the cosine inner product:

∑ qi d i
sim(q , d )=

i

∑ q 2⋅√ ∑ d i2
√ i
i

●

i

Each dimension corresponds to a separate term. q and d are
n-dimensional vectors with weights for each term.

●

qi and di are weights of the query and document terms

●

The document term weight, di = tfi • idfi

●

Specifically, I would talk about SSRM algorithm (Semantic
Similarity Retrieval Model), where we modify the query term
weights to consider semantic similarity.
Information Retrieval
SSRM: IR with semantics ... (1/3)
Query Re-weighting:
●

Query can contain related (semantically similar) terms
Query: free scientific computing software

●

We need to re-weight the query terms to stress a particular concept
we are searching.
i≠ j

qi ' =q i +

∑

q j⋅sim(i , j)

sim (i , j )⩾t
●

qi and qi' are old and new weights respectively

●

i and j refer to different terms in the query.
Information Retrieval
SSRM: IR with semantics ... (2/3)
Query Expansion:
●

●

New terms that might be semantically similar to query terms. We
“expand” the queries by adding new terms in the neighbourhood of
the query term, in the ontology.
Adding such terms would affect weights of existing terms.

{

i≠ j

qj
∑ n ⋅sim(i , j)
q i ' = sim (i , j)⩾T j
i≠
qj
qi + ∑
⋅sim(i , j )
n
sim (i , j)⩾T
●

n is the number of hyponyms
for each expanded term j.

i is a new term
i had weight q i
Information Retrieval
SSRM: IR with semantics ... (3/3)
Document Similarity:
●

After we have the expanded and re-weighted query vectors and the
document vectors using tf-idf, we calculate the query-document
similarity between query q and document d as:

∑ ∑ qi⋅d j⋅sim(i , j )
sim (q , d )=
●

Properties:

i

j

∑ ∑ qi⋅d j
i

j

●

Symmetric.

●

Normalized in [0,1].

●

Consistent behaviour.

●

Can be easily tweaked for document-document similarity.
Information Retrieval
SSRM: At a glance
Information Retrieval
SSRM Implementation Notes:
Quadratic time complexity as opposed to VSM.
●
Similarity between every pair or terms can be hashed.
●
Expensive to expand and re-weight the document vectors as well,
so only re-weight and expand queries. But expanding one of the
vectors should incorporate enough semantic info.
●
Thresholds (t, T) need to be adjusted for optimal behaviour.
●

Although behaviour of SSRM is consistent, SSRM won't result in
sim(d,d) = 1 i.e. even exact search won't give a similarity value of 1.
●
I had proposed the following formula last summer and the results
on MeSH were quite satisfactory:
●

∑ ∑ qi⋅d j⋅maxsimi
sim( q , d )= i j
∑ ∑ q i⋅d j
i

j

where maxsimi = max sim(i , j)
j
Experimental Results

IR on OSHUMED using MeSH

IR on web using WordNet
Future ...
Possible Issues
●

●

●

Negation
●
Query: I like pizza
Antonymy
●
Query: Slow runner
Role Reversal
●
Query: Dog bites man

Match: I don't like pizza
Match: Fast runner
Match: Man bites dog

Further reading
●

Groupwise Semantic Similarity
●
●

●

Jaccard Index
simLP, simUI, simGIC

Statistical Semantic Similarity
●
●
●

LSA: Latent Semantic Analysis
NGD: Normalized Google Distance
PMI: Pointwise Mutual Information
References
●

A comparative study of ontology based term similarity measures
on PubMed Document Clustering [Xiaodan Zhang, Liping Jing, Xiaohua Hu,
Michael Ng, Xiaohua Zhou] [2007].

●

Information retrieval by semantic similarity

[A. Hilaoutakis, G. Varelas, E.

Voutsakis] [2006].
●

Semantic Similarity Based on Corpus Statistics and Lexical
Taxonomy [Jay J. Jiang, David W. Conrath] [1997].

Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free GrammarsMarina Santini
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Common evaluation measures in NLP and IR
Common evaluation measures in NLP and IRCommon evaluation measures in NLP and IR
Common evaluation measures in NLP and IRRushdi Shams
 
Syntactic analysis in NLP
Syntactic analysis in NLPSyntactic analysis in NLP
Syntactic analysis in NLPkartikaVashisht
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfDeptii Chaudhari
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional SemanticsAndre Freitas
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsVaibhav Khanna
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
Language modelling and its use cases
Language modelling and its use casesLanguage modelling and its use cases
Language modelling and its use casesKhrystyna Skopyk
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Semantic interpretation
Semantic interpretationSemantic interpretation
Semantic interpretationVivek Kumar
 

Was ist angesagt? (20)

Word2Vec
Word2VecWord2Vec
Word2Vec
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Common evaluation measures in NLP and IR
Common evaluation measures in NLP and IRCommon evaluation measures in NLP and IR
Common evaluation measures in NLP and IR
 
Syntactic analysis in NLP
Syntactic analysis in NLPSyntactic analysis in NLP
Syntactic analysis in NLP
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional Semantics
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Text mining
Text miningText mining
Text mining
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Language modelling and its use cases
Language modelling and its use casesLanguage modelling and its use cases
Language modelling and its use cases
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Semantic interpretation
Semantic interpretationSemantic interpretation
Semantic interpretation
 
Text Classification
Text ClassificationText Classification
Text Classification
 

Andere mochten auch

09 semantic web & ontologies
09 semantic web & ontologies09 semantic web & ontologies
09 semantic web & ontologiesMarina Santini
 
CSTalks - Music Information Retrieval - 23 Feb
CSTalks - Music Information Retrieval - 23 FebCSTalks - Music Information Retrieval - 23 Feb
CSTalks - Music Information Retrieval - 23 Febcstalks
 
Similarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionSimilarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionAlexander Panchenko
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...GUANGYUAN PIAO
 
Querying the Semantic Web with SPARQL
Querying the Semantic Web with SPARQLQuerying the Semantic Web with SPARQL
Querying the Semantic Web with SPARQLEmanuele Della Valle
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...University of Minnesota, Duluth
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep LearningAdam Gibson
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!Jane Garay
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 

Andere mochten auch (13)

09 semantic web & ontologies
09 semantic web & ontologies09 semantic web & ontologies
09 semantic web & ontologies
 
IntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotationsIntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotations
 
Ihi2012 semantic-similarity-tutorial-part1
Ihi2012 semantic-similarity-tutorial-part1Ihi2012 semantic-similarity-tutorial-part1
Ihi2012 semantic-similarity-tutorial-part1
 
CSTalks - Music Information Retrieval - 23 Feb
CSTalks - Music Information Retrieval - 23 FebCSTalks - Music Information Retrieval - 23 Feb
CSTalks - Music Information Retrieval - 23 Feb
 
Similarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionSimilarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation Extraction
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
 
Querying the Semantic Web with SPARQL
Querying the Semantic Web with SPARQLQuerying the Semantic Web with SPARQL
Querying the Semantic Web with SPARQL
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Java and OWL
Java and OWLJava and OWL
Java and OWL
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 

Ähnlich wie Information Retrieval using Semantic Similarity

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Innovation Quotient Pvt Ltd
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations onijistjournal
 
A Survey on Unsupervised Graph-based Word Sense Disambiguation
A Survey on Unsupervised Graph-based Word Sense DisambiguationA Survey on Unsupervised Graph-based Word Sense Disambiguation
A Survey on Unsupervised Graph-based Word Sense DisambiguationElena-Oana Tabaranu
 
Co word analysis
Co word analysisCo word analysis
Co word analysisdebolina73
 
Meaningful Interaction Analysis
Meaningful Interaction AnalysisMeaningful Interaction Analysis
Meaningful Interaction Analysisfridolin.wild
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Comparing Forgetting Heuristics For Complexity Reduction Of Justifications
Comparing Forgetting Heuristics For Complexity Reduction Of JustificationsComparing Forgetting Heuristics For Complexity Reduction Of Justifications
Comparing Forgetting Heuristics For Complexity Reduction Of JustificationsTimdeBoer16
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 
Information among networks and systems of knowledge
Information among networks and systems of knowledgeInformation among networks and systems of knowledge
Information among networks and systems of knowledgeJosé Nafría
 
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...Antonio Lieto
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Saeedeh Shekarpour
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 

Ähnlich wie Information Retrieval using Semantic Similarity (20)

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
A Survey on Unsupervised Graph-based Word Sense Disambiguation
A Survey on Unsupervised Graph-based Word Sense DisambiguationA Survey on Unsupervised Graph-based Word Sense Disambiguation
A Survey on Unsupervised Graph-based Word Sense Disambiguation
 
Co word analysis
Co word analysisCo word analysis
Co word analysis
 
Ontology learning
Ontology learningOntology learning
Ontology learning
 
Meaningful Interaction Analysis
Meaningful Interaction AnalysisMeaningful Interaction Analysis
Meaningful Interaction Analysis
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
L0261075078
L0261075078L0261075078
L0261075078
 
L0261075078
L0261075078L0261075078
L0261075078
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
G04124041046
G04124041046G04124041046
G04124041046
 
Comparing Forgetting Heuristics For Complexity Reduction Of Justifications
Comparing Forgetting Heuristics For Complexity Reduction Of JustificationsComparing Forgetting Heuristics For Complexity Reduction Of Justifications
Comparing Forgetting Heuristics For Complexity Reduction Of Justifications
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
Information among networks and systems of knowledge
Information among networks and systems of knowledgeInformation among networks and systems of knowledge
Information among networks and systems of knowledge
 
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 

Kürzlich hochgeladen

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 

Kürzlich hochgeladen (20)

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

Information Retrieval using Semantic Similarity

  • 1. Seminar on Artificial Intelligence Information Retrieval Using Semantic Similarity Harshita Meena (100050020) Diksha Meghwal (100050039) Saswat Padhi (100050061)
  • 2. Overview ... ● “Semantics” & “Ontology” (Diksha) ● ● ● ● Semantic Similarity (Harshita) ● ● ● ● What is IR lacking? Semantics: “What”? And How? Ontologies and knowledge representation Semantic Similarity: What? and How? Path based semantic similarity measures Information content based similarity measures Information Retrieval (Saswat) ● ● ● VSM Revisited SSRM: IR with semantics Conclusion and further reading
  • 3. “Semantics” & “Ontology” What is IR (without semantics) lacking? “MEANING” Query: software Pool: application, program, package, freeware, shareware Result: No match!! motivation for looking at semantic rather than lexical similarity The problem today in information retrieval is not lack of data, but the lack of “structured” and “meaningful organisation” of data. Ontologies are attempts to organise information and empower IR.
  • 4. “Semantics” & “Ontology” Semantics: What? And How? “Semantics” capture the meaning of the linguistic terms. Computers do not understand “meaning”. So, the semantic meanings of terms are rather represented using links to other terms. An “ontology” formally represents knowledge as a set of concepts within a domain, and the relationships between pairs of concepts. It can be used to model a domain and support reasoning about entities. Formal definition by Tom Gruber: An ontology is a formal, explicit specification of a shared conceptualization ● ● ● ● formal: it should be machine readable explicit: types of concept and the constraints are explicitly defined shared: the ontology is agreed upon and accepted by a group conceptualization: An abstract model that consists of relevant models and the relationships between them
  • 5. “Semantics” & “Ontology” Components of Ontologies Classes : Classes are abstract groups, or collections of objects. They may contain individuals, other classes, or a combination of both. Classes can be extensional or intensional, subsume or be subsumed. ● Attributes: Used to store information that is specific to the object it is attached to like its features or characteristics. ● Relationships: A relation is an attribute whose value is another object in the ontology. Eg: subsumption relations(is-superclass-of, the converse of is-a, is-subtype-of or is-subclass-of), meronym relations(part-of). ● Domain ontology (or domain-specific ontology) models a specific domain, or part of the world. ● Upper ontology (or foundation ontology) is a model of the common objects that are applicable across a range of domain ontologies. ●
  • 6. “Semantics” & “Ontology” Examples of Popular Ontologies WordNet Medical Subject Headings WordNet is a lexical database for the English language, which superficially resembles a thesaurus. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. MeSH is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM's catalog of book holdings.
  • 7. “Semantics” & “Ontology” The Future: “Semantic Web”, OWL and RDF ... The Semantic Web is a collaborative movement led by the international standards body W3C. Semantic Web aims at converting the current web dominated by semi-structured documents into a organised "web of data". RDF(Resource Description Framework) is a part of the W3C family of specifications, which can be used as a general method for conceptual description or modeling of information. <rdf:RDF <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn"> <dc:title>Tony Benn</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF> OWL is built on top of the RDF and is stronger and supports greater machine interpretability than RDF. <rdf:RDF <owl:Ontology rdf:about="http://www.linkeddatatools.com/plants"> <dc:title>The LinkedDataTools.com Example Plant Ontology</dc:title> <dc:description>An example ontology</dc:description> </owl:Ontology> <owl:Class rdf:about="http://www.linkeddatatools.com/plants#planttype"> <rdfs:label>The plant type</rdfs:label> <rdfs:comment>The class of plant types.</rdfs:comment> </owl:Class> </rdf:RDF>
  • 8. Semantic Similarity Ontology is just a “structure”, without any weights on the edges. Semantic similarity measures exploit the structure information and try to quantify the concept similarities in a given ontology. Ontology based semantic measures can be classified as follows: ● ● ● Path Based Similarity Measures Path based similarity measures utilize the information of the shortest path between two concepts, their generality or specificity and their relationship with other concepts. Information Content Based Similarity Measures Information content based measures associate a quantity IC which takes into account, the probabilities of concepts in the ontology. Feature Based Similarity Measures (we won't be discussing)
  • 9. Semantic Similarity (Path Based) Wu & Palmer Measure: 2H ( N 1 + N 2 +2H) Wu and Palmer measure fits the intuition that concepts with greater depth would be more similar (because of specificity). N1 and N2 are the number of IS-A links from C1 and C2 respectively to the most specific common subsumer concept C. H is the number of IS-A links from C to the root of ontology. simW & P (C 1 ,C 2 )= Li Measure: e βH −e−βH sim Li (C 1 ,C 2 )=e ⋅ βH −βH e +e Li combines the shortest path and the depth of ontology information in a non-linear function. L stands for the shortest path between two concepts, α and β are scaling factors. H is same as in Wu & Palmer measure. −αL
  • 10. Semantic Similarity (Path Based) Leacock & Chodorow Measure: L sim L & C (C 1 ,C 2 )=−log 2H This is almost the same as Wu & Palmer method, except logarithmic smoothing and removal of depth factor from denominator. As in the Li Measure, L is the shortest path between concepts C1 and C2. H is the number of IS-A links from C to the root of ontology. Mao Measure: δ sim Mao (C 1 , C 2 )= L log 2 (1+d (C 1 )+d (C 2 )) Mao measure considers the generality of the concepts by taking into account, the number of descendants. L stands for the shortest path between two concepts, d(C) stands for number of descendants of C. δ is a constant (usually chosen as 0.9).
  • 11. Semantic Similarity (IC Based) The intuition behind information content is that, more frequent terms are more general and hence provide less “information”: IC (C )=−log p(C )=−log freq(C ) freq (root ) freq(C) is the frequency of concept C, and freq(root) is the frequency of root concept of the ontology. Frequency includes the frequencies of subsumed concepts in an IS-A hierarchy. We call concept C the most informative subsumer of two concepts C1 and C2 i.e. ICmis(C1,C2) if concept C has the least probability among all shared subsumer between two concepts (thus most informative). Resnik Measure: sim Resnik (C 1 , C 2 )=IC mis (C 1 , C 2 ) More the information two terms share, the more similar they are.
  • 12. Semantic Similarity (IC Based) Jiang Measure: dist Jiang (C 1 ,C 2 )=IC (C 1 )+ IC (C 2 )−2ICmis (C 1 ,C 2 ) Jiang measure considers the information content of each term apart from shared information content. It is an inverted measurement. The distance between two concepts is the amount of information needed to fully describe both the concepts, excluding the amount of information that is common to both of them. Lin Measure: 2ICmis (C 1 ,C 2 ) sim Lin (C 1 , C 2 )= IC (C 1 )+ IC (C 2 ) Lin measure also the information contents of each term, but uses them differently than Jiang. It takes ratio instead of difference. Since ICmis(C1,C2) < IC (C1) and IC (C2) the similarity value is normalized between 1 ( similar concepts) and 0.
  • 13. Semantic Similarity Correlation with human judgements Method Type Correlation Method Type Correlation Wu & Palmer Path 0.74 Wu & Palmer Path 0.67 Li Path 0.82 Li Path 0.70 Leacock Path 0.82 Leacock Path 0.74 Resnik IC 0.79 Resnik IC 0.71 Lin IC 0.82 Lin IC 0.72 Jiang IC 0.83 Jiang IC 0.71 WordNet Ontology MeSH Ontology
  • 14. Information Retrieval SSRM: IR with semantics ... (0/3) VSM Revisited: ● Similarity in VSM is the cosine inner product: ∑ qi d i sim(q , d )= i ∑ q 2⋅√ ∑ d i2 √ i i ● i Each dimension corresponds to a separate term. q and d are n-dimensional vectors with weights for each term. ● qi and di are weights of the query and document terms ● The document term weight, di = tfi • idfi ● Specifically, I would talk about SSRM algorithm (Semantic Similarity Retrieval Model), where we modify the query term weights to consider semantic similarity.
  • 15. Information Retrieval SSRM: IR with semantics ... (1/3) Query Re-weighting: ● Query can contain related (semantically similar) terms Query: free scientific computing software ● We need to re-weight the query terms to stress a particular concept we are searching. i≠ j qi ' =q i + ∑ q j⋅sim(i , j) sim (i , j )⩾t ● qi and qi' are old and new weights respectively ● i and j refer to different terms in the query.
  • 16. Information Retrieval SSRM: IR with semantics ... (2/3) Query Expansion: ● ● New terms that might be semantically similar to query terms. We “expand” the queries by adding new terms in the neighbourhood of the query term, in the ontology. Adding such terms would affect weights of existing terms. { i≠ j qj ∑ n ⋅sim(i , j) q i ' = sim (i , j)⩾T j i≠ qj qi + ∑ ⋅sim(i , j ) n sim (i , j)⩾T ● n is the number of hyponyms for each expanded term j. i is a new term i had weight q i
  • 17. Information Retrieval SSRM: IR with semantics ... (3/3) Document Similarity: ● After we have the expanded and re-weighted query vectors and the document vectors using tf-idf, we calculate the query-document similarity between query q and document d as: ∑ ∑ qi⋅d j⋅sim(i , j ) sim (q , d )= ● Properties: i j ∑ ∑ qi⋅d j i j ● Symmetric. ● Normalized in [0,1]. ● Consistent behaviour. ● Can be easily tweaked for document-document similarity.
  • 19. Information Retrieval SSRM Implementation Notes: Quadratic time complexity as opposed to VSM. ● Similarity between every pair or terms can be hashed. ● Expensive to expand and re-weight the document vectors as well, so only re-weight and expand queries. But expanding one of the vectors should incorporate enough semantic info. ● Thresholds (t, T) need to be adjusted for optimal behaviour. ● Although behaviour of SSRM is consistent, SSRM won't result in sim(d,d) = 1 i.e. even exact search won't give a similarity value of 1. ● I had proposed the following formula last summer and the results on MeSH were quite satisfactory: ● ∑ ∑ qi⋅d j⋅maxsimi sim( q , d )= i j ∑ ∑ q i⋅d j i j where maxsimi = max sim(i , j) j
  • 20. Experimental Results IR on OSHUMED using MeSH IR on web using WordNet
  • 21. Future ... Possible Issues ● ● ● Negation ● Query: I like pizza Antonymy ● Query: Slow runner Role Reversal ● Query: Dog bites man Match: I don't like pizza Match: Fast runner Match: Man bites dog Further reading ● Groupwise Semantic Similarity ● ● ● Jaccard Index simLP, simUI, simGIC Statistical Semantic Similarity ● ● ● LSA: Latent Semantic Analysis NGD: Normalized Google Distance PMI: Pointwise Mutual Information
  • 22. References ● A comparative study of ontology based term similarity measures on PubMed Document Clustering [Xiaodan Zhang, Liping Jing, Xiaohua Hu, Michael Ng, Xiaohua Zhou] [2007]. ● Information retrieval by semantic similarity [A. Hilaoutakis, G. Varelas, E. Voutsakis] [2006]. ● Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy [Jay J. Jiang, David W. Conrath] [1997]. Thank you!