SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
Representing Documents and Queries as Sets of Word 
Embedded Vectors for Information Retrieval
Debasis Ganguly* Dwaipayan Roy+ Mandar Mitra+
+CVPR Unit
Indian Statistical Institute
Gareth Jones*
*ADAPT Centre, School of Computing
D bli Cit U i it Indian Statistical Institute
Kolkata, India
Dublin City University
Dublin, Ireland
Introduction Query Likelihood in Abstract Space
centroid of the points of d generated by same Gaussian zi.
set of centroid of points of document d generated by
same Gaussian function i.
Obtaining an efficient embedded representation of composed unit of
text (such as document or query) for retrieval, is a difficult problem.
We introduce a set-based embedded representation to exploit word
embeddings for information retrieval.
Set distance based measures applied to obtain the similarity between
   
K
ididC 1,
di,
Posterior likelihood of Q, sampled from K mixture model of Gaussians,
centered around :
Combination of Text and Vector Likelihood
A word embedding technique based on Recurrent Neural
Network (RNN) represents every word of a collection as a vector in an
abstract space of N dimensions.
A document is a set of points (word-vectors) in the abstract space.
Like ‘Bag of Words (BOW)’ representation, the embeddings of each
f d t ‘B f V t ’ (B V) t ti f d t
Beyond Bag of Words Model
Set distance based measures applied to obtain the similarity between
query and document.
   
 

Qq
K
i
diWVEC q
QK
dQsimQdP
1
,
1
),()( 
di,
Evaluation
random variable to weight the individual believes of text
based similarity with embedding based similarity.
λ and α both empirically set to 0 4
of words create a ‘Bag of Vectors’ (BoV) representation of document.
|d| = Number of unique terms in document d
wi = i-th unique word of document d
vi = embedded vector of i-th unique word of d
     

Qq
WVECLM qdPqdPQdP .1.)( 
 d
iid wBOW 1
  d
iid vBOV 1

Document as Mixture Distributions
Each concept of the document generates set of points which are
i il i th b t t
]1,0[
Metrics
Topic Set Method MAP gMAP Recall
TREC-6
LM
LM+wvsim(one-clus)
LM+wvsim(no-clus)
LM+wvsim(kmeans)
0.2363
0.2355
0.2259
0.2345
0.0914
0.0918
0.0827
0.0906
0.5100
0.5058
0.5000
0.5027
LM 0 1787 0 0831 0 4882
λ and α, both empirically set to 0.4.
K, the number of clusters for K-Means, empirically set 100.
Word vectors embedded in a 200 dimensional space with
negative-sampling of 5 words on continuous bag-of-words model.
similar in the abstract space.
Thus a document is a mixture of probability density functions (e.g.
Gaussians of dimension p) that generates the observed query terms.
Let each term w of the vocabulary is associated with a latent variable
zw which denotes the concept of w.
zw is an integer between 1 and K, the number of concepts or, the
number of Gaussians in the Mixture distribution.
zw s can be estimated using clustering algorithms such as K-Means on
the set of all vi of the vocabulary.
TREC-7
LM
LM+wvsim(one-clus)
LM+wvsim(no-clus)
LM+wvsim(kmeans)
0.1787
0.1773
0.1664
0.1756
0.0831
0.0851
0.0803
0.0874
0.4882
0.4897
0.4863
0.4916
TREC-8
LM
LM+wvsim(one-clus)
LM+wvsim(no-clus)
LM+wvsim(kmeans)
0.2462
0.2541
0.2473
0.2558
0.1384
0.1465
0.1396
0.1468
0.5932
0.6017
0.5994
0.6017
Robust
LM
LM+wvsim(one-clus)
LM+wvsim(no-clus)
LM+wvsim(kmeans)
0.2698
0.2690
0.2642
0.2804
0.1724
0.1701
0.1646
0.1819
0.7935
0.7905
0.7900
0.8010
For TREC 8 and Robust query sets, significant improvement achieved
over simple text based similarity.
The research is supported by Science Foundation Ireland (SFI) as a part of the ADAPT centre at DCU
(Grant No: 13/RC/2106) and by a grant under the SFI ISCA India consortium.
over simple text based similarity.
Result with K=1 quite close in performance with K=100.
Performance for TREC 6 and 7 degrades, as compared to text based
similarity.

Weitere ähnliche Inhalte

Was ist angesagt?

Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Sebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverSebastian Ruder
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...Masumi Shirakawa
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
 
IR-ranking
IR-rankingIR-ranking
IR-rankingFELIX75
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse LearningDatabricks
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Sean Golliher
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learningtelss09
 
Framework for Evaluating Distributed Smalltalk Interface
Framework for Evaluating Distributed Smalltalk InterfaceFramework for Evaluating Distributed Smalltalk Interface
Framework for Evaluating Distributed Smalltalk InterfaceESUG
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
 

Was ist angesagt? (20)

Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Collaborative DL
Collaborative DLCollaborative DL
Collaborative DL
 
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
 
Ir models
Ir modelsIr models
Ir models
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspace
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Framework for Evaluating Distributed Smalltalk Interface
Framework for Evaluating Distributed Smalltalk InterfaceFramework for Evaluating Distributed Smalltalk Interface
Framework for Evaluating Distributed Smalltalk Interface
 
Cg33504508
Cg33504508Cg33504508
Cg33504508
 
On the Mining of Numerical Data with Formal Concept Analysis
On the Mining of Numerical Data with Formal Concept AnalysisOn the Mining of Numerical Data with Formal Concept Analysis
On the Mining of Numerical Data with Formal Concept Analysis
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Topic Models
Topic ModelsTopic Models
Topic Models
 

Ähnlich wie Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval

Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphFedorNikolaev
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
Language independent document
Language independent documentLanguage independent document
Language independent documentijcsit
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
Speech recognition using vector quantization through modified k means lbg alg...
Speech recognition using vector quantization through modified k means lbg alg...Speech recognition using vector quantization through modified k means lbg alg...
Speech recognition using vector quantization through modified k means lbg alg...Alexander Decker
 
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-Cemal Ardil
 
Improving Web Image Search Re-ranking
Improving Web Image Search Re-rankingImproving Web Image Search Re-ranking
Improving Web Image Search Re-rankingIOSR Journals
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델guesta34d441
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델JUNGEUN KANG
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 

Ähnlich wie Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval (20)

Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Language independent document
Language independent documentLanguage independent document
Language independent document
 
Ir 08
Ir   08Ir   08
Ir 08
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Speech recognition using vector quantization through modified k means lbg alg...
Speech recognition using vector quantization through modified k means lbg alg...Speech recognition using vector quantization through modified k means lbg alg...
Speech recognition using vector quantization through modified k means lbg alg...
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
 
graph_embeddings
graph_embeddingsgraph_embeddings
graph_embeddings
 
Improving Web Image Search Re-ranking
Improving Web Image Search Re-rankingImproving Web Image Search Re-ranking
Improving Web Image Search Re-ranking
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 

Kürzlich hochgeladen

How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 

Kürzlich hochgeladen (20)

prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 

Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval

  • 1. Representing Documents and Queries as Sets of Word  Embedded Vectors for Information Retrieval Debasis Ganguly* Dwaipayan Roy+ Mandar Mitra+ +CVPR Unit Indian Statistical Institute Gareth Jones* *ADAPT Centre, School of Computing D bli Cit U i it Indian Statistical Institute Kolkata, India Dublin City University Dublin, Ireland Introduction Query Likelihood in Abstract Space centroid of the points of d generated by same Gaussian zi. set of centroid of points of document d generated by same Gaussian function i. Obtaining an efficient embedded representation of composed unit of text (such as document or query) for retrieval, is a difficult problem. We introduce a set-based embedded representation to exploit word embeddings for information retrieval. Set distance based measures applied to obtain the similarity between     K ididC 1, di, Posterior likelihood of Q, sampled from K mixture model of Gaussians, centered around : Combination of Text and Vector Likelihood A word embedding technique based on Recurrent Neural Network (RNN) represents every word of a collection as a vector in an abstract space of N dimensions. A document is a set of points (word-vectors) in the abstract space. Like ‘Bag of Words (BOW)’ representation, the embeddings of each f d t ‘B f V t ’ (B V) t ti f d t Beyond Bag of Words Model Set distance based measures applied to obtain the similarity between query and document.        Qq K i diWVEC q QK dQsimQdP 1 , 1 ),()(  di, Evaluation random variable to weight the individual believes of text based similarity with embedding based similarity. λ and α both empirically set to 0 4 of words create a ‘Bag of Vectors’ (BoV) representation of document. |d| = Number of unique terms in document d wi = i-th unique word of document d vi = embedded vector of i-th unique word of d        Qq WVECLM qdPqdPQdP .1.)(   d iid wBOW 1   d iid vBOV 1  Document as Mixture Distributions Each concept of the document generates set of points which are i il i th b t t ]1,0[ Metrics Topic Set Method MAP gMAP Recall TREC-6 LM LM+wvsim(one-clus) LM+wvsim(no-clus) LM+wvsim(kmeans) 0.2363 0.2355 0.2259 0.2345 0.0914 0.0918 0.0827 0.0906 0.5100 0.5058 0.5000 0.5027 LM 0 1787 0 0831 0 4882 λ and α, both empirically set to 0.4. K, the number of clusters for K-Means, empirically set 100. Word vectors embedded in a 200 dimensional space with negative-sampling of 5 words on continuous bag-of-words model. similar in the abstract space. Thus a document is a mixture of probability density functions (e.g. Gaussians of dimension p) that generates the observed query terms. Let each term w of the vocabulary is associated with a latent variable zw which denotes the concept of w. zw is an integer between 1 and K, the number of concepts or, the number of Gaussians in the Mixture distribution. zw s can be estimated using clustering algorithms such as K-Means on the set of all vi of the vocabulary. TREC-7 LM LM+wvsim(one-clus) LM+wvsim(no-clus) LM+wvsim(kmeans) 0.1787 0.1773 0.1664 0.1756 0.0831 0.0851 0.0803 0.0874 0.4882 0.4897 0.4863 0.4916 TREC-8 LM LM+wvsim(one-clus) LM+wvsim(no-clus) LM+wvsim(kmeans) 0.2462 0.2541 0.2473 0.2558 0.1384 0.1465 0.1396 0.1468 0.5932 0.6017 0.5994 0.6017 Robust LM LM+wvsim(one-clus) LM+wvsim(no-clus) LM+wvsim(kmeans) 0.2698 0.2690 0.2642 0.2804 0.1724 0.1701 0.1646 0.1819 0.7935 0.7905 0.7900 0.8010 For TREC 8 and Robust query sets, significant improvement achieved over simple text based similarity. The research is supported by Science Foundation Ireland (SFI) as a part of the ADAPT centre at DCU (Grant No: 13/RC/2106) and by a grant under the SFI ISCA India consortium. over simple text based similarity. Result with K=1 quite close in performance with K=100. Performance for TREC 6 and 7 degrades, as compared to text based similarity.