Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval
1. Representing Documents and Queries as Sets of Word
Embedded Vectors for Information Retrieval
Debasis Ganguly* Dwaipayan Roy+ Mandar Mitra+
+CVPR Unit
Indian Statistical Institute
Gareth Jones*
*ADAPT Centre, School of Computing
D bli Cit U i it Indian Statistical Institute
Kolkata, India
Dublin City University
Dublin, Ireland
Introduction Query Likelihood in Abstract Space
centroid of the points of d generated by same Gaussian zi.
set of centroid of points of document d generated by
same Gaussian function i.
Obtaining an efficient embedded representation of composed unit of
text (such as document or query) for retrieval, is a difficult problem.
We introduce a set-based embedded representation to exploit word
embeddings for information retrieval.
Set distance based measures applied to obtain the similarity between
K
ididC 1,
di,
Posterior likelihood of Q, sampled from K mixture model of Gaussians,
centered around :
Combination of Text and Vector Likelihood
A word embedding technique based on Recurrent Neural
Network (RNN) represents every word of a collection as a vector in an
abstract space of N dimensions.
A document is a set of points (word-vectors) in the abstract space.
Like ‘Bag of Words (BOW)’ representation, the embeddings of each
f d t ‘B f V t ’ (B V) t ti f d t
Beyond Bag of Words Model
Set distance based measures applied to obtain the similarity between
query and document.
Qq
K
i
diWVEC q
QK
dQsimQdP
1
,
1
),()(
di,
Evaluation
random variable to weight the individual believes of text
based similarity with embedding based similarity.
λ and α both empirically set to 0 4
of words create a ‘Bag of Vectors’ (BoV) representation of document.
|d| = Number of unique terms in document d
wi = i-th unique word of document d
vi = embedded vector of i-th unique word of d
Qq
WVECLM qdPqdPQdP .1.)(
d
iid wBOW 1
d
iid vBOV 1
Document as Mixture Distributions
Each concept of the document generates set of points which are
i il i th b t t
]1,0[
Metrics
Topic Set Method MAP gMAP Recall
TREC-6
LM
LM+wvsim(one-clus)
LM+wvsim(no-clus)
LM+wvsim(kmeans)
0.2363
0.2355
0.2259
0.2345
0.0914
0.0918
0.0827
0.0906
0.5100
0.5058
0.5000
0.5027
LM 0 1787 0 0831 0 4882
λ and α, both empirically set to 0.4.
K, the number of clusters for K-Means, empirically set 100.
Word vectors embedded in a 200 dimensional space with
negative-sampling of 5 words on continuous bag-of-words model.
similar in the abstract space.
Thus a document is a mixture of probability density functions (e.g.
Gaussians of dimension p) that generates the observed query terms.
Let each term w of the vocabulary is associated with a latent variable
zw which denotes the concept of w.
zw is an integer between 1 and K, the number of concepts or, the
number of Gaussians in the Mixture distribution.
zw s can be estimated using clustering algorithms such as K-Means on
the set of all vi of the vocabulary.
TREC-7
LM
LM+wvsim(one-clus)
LM+wvsim(no-clus)
LM+wvsim(kmeans)
0.1787
0.1773
0.1664
0.1756
0.0831
0.0851
0.0803
0.0874
0.4882
0.4897
0.4863
0.4916
TREC-8
LM
LM+wvsim(one-clus)
LM+wvsim(no-clus)
LM+wvsim(kmeans)
0.2462
0.2541
0.2473
0.2558
0.1384
0.1465
0.1396
0.1468
0.5932
0.6017
0.5994
0.6017
Robust
LM
LM+wvsim(one-clus)
LM+wvsim(no-clus)
LM+wvsim(kmeans)
0.2698
0.2690
0.2642
0.2804
0.1724
0.1701
0.1646
0.1819
0.7935
0.7905
0.7900
0.8010
For TREC 8 and Robust query sets, significant improvement achieved
over simple text based similarity.
The research is supported by Science Foundation Ireland (SFI) as a part of the ADAPT centre at DCU
(Grant No: 13/RC/2106) and by a grant under the SFI ISCA India consortium.
over simple text based similarity.
Result with K=1 quite close in performance with K=100.
Performance for TREC 6 and 7 degrades, as compared to text based
similarity.