This document outlines Shatabdi Kundu's project on context based search using probabilistic topic models and WordNet. The project uses Latent Dirichlet Allocation to discover topics in a collection of documents and WordNet to identify the semantic meaning of the topics based on lexical relations between words. The results showed 7 hidden topics identified in a test collection along with their labels discovered through WordNet. Future work will apply this modeling to search by topic and context to improve geo-intent based information retrieval.
1. Context Based Search
By
Shatabdi Kundu (2010EET2553)
Computer Technology,M.Tech
IIT Delhi
Email ID:shatabdikundu@live.com
Project Guide:
Prof.Santanu Chaudhury
Electrical Engineering Department
IIT Delhi
Email ID:santanuc@ee.iitd.ac.in
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 1of 16
2. Outline
Introduction to Topic Models- Probabilistic Modelling
Latent Dirichlet Allocation
Topic Discovery using Wordnet
Work Done
Results
Conclusion and Future Work
References
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 2of 16
3. Probabilistic Modelling
Treat data as observations that arise from a generative
probabilistic process that includes hidden variables
For documents, the hidden variables reflect the thematic
structure of the collection
Infer the hidden structure using posterior inference
What are the topics that describe this collection?
Situate new data into the estimated model
How does this query or new document fit into the estimated
topic structure?
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 3of 16
5. Generative Process
Cast these intuitions into a generative probabilistic process
Each document is a random mixture of corpus-wide topics
Each word is drawn from one of those topics
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 5of 16
6. Graphical Models
Nodes are random variables
Edges denote possible dependence
Observed variables are shaded
Plates denote replicated structure
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 6of 16
7. Graphical Models
Structure of the graph defines the pattern of conditional
dependence between the ensemble of random variables.
Eg. this graph corressponds to
N
p(y , x1 ...xN ) = p(y ) p(xn | y ) (1)
n=1
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 7of 16
8. Latent Dirichlet Allocation
1 Draw each topic βk ∼ Dir(η), for k {1,.....,K}
2 For each document:
1 Draw topic proportions θd ∼ Dir(α)
2 For each word:
1 Draw Zd,n ∼ Mult(θd )
2 Draw Wd,n ∼ Mult(βZd,n )
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 8of 16
9. Latent Dirichlet Allocation
From a collection of documents, infer
Per-word topic assignment Zd,n
Per-document topic proportions θd
Per-corpus topic distributions βk
Use posterior expectations to perform the task at hand, e.g
information retrieval,document similarity, etc.
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 9of 16
10. Topic Discovery using Wordnet
Lexical relations used for finding out the latent topics
synsets(synonym sets) as basic units
hyponymy
a semantic relation between word meanings
Eg. {maple} is a hyponym of {tree}
hypernymy
inverse of hyponym
Eg.{tree} is a hypernym of {maple}
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 10of 16
11. Work Done
I took a collection of 10 documents that had a total of around
28K words
I removed the stop words and rare words along with
punctuation marks and numbers.
Then I modeled a 7-topic LDA model with this corpus
Now I had 7 topics with 5 most highly probable occuring
words from each topic.
I then used the lexical relations of Wordnet to identify the
hidden topics using common parents of all the words in each
topic.
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 11of 16
12. Results after training LDA model
This model only selects appropriate words within a topic but
does not name the topic
Discovering the topic name is done using Wordnet
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 12of 16
13. Results after applying to Wordnet
The above result gives us the hidden topic names of the words
that comprised the documents.
This kind of model can be used for identifying topics when
given only a word.
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 13of 16
14. Conclusion and Future Work
Now we will be working on searching based on topics(context)
using this model.
Basically we will be dealing with geo-intent of the queries and
decide on the topic to which they belong for better retrieval of
information.
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 14of 16
15. References
Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan.
Journal of Machine Learning Research, 3:993-1022, January
2003.
Jun Fu Cai, Wee Sun Lee, Yee Whye Teh. NUS-ML:
Improving Word Sense Disambiguation Using Topic Features.
SEMEVAL (2007).
David M. Blei, Jon D. McAuliffe. Supervised Topic Models.
NIPS (2007).
Wordnet. http://www.shiffman.net/teaching/a2z/wordnet
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 15of 16