The document provides an introduction to various mathematical concepts commonly used in information retrieval and summarizes several models including vector space models, inductive models like probabilistic and neural networks, and hybrid models. It gives examples of how vector space and probabilistic models can be used for applications like document clustering, targeted advertising, and part-of-speech tagging. It also notes some uncertainties and pitfalls with these models.
Powerful Google developer tools for immediate impact! (2023-24 C)
Mathematical Concepts for IR Models
1. Mathematical Roadmap for IR
Introduction to the mathematical concepts
commonly used in IR/ST
- Abhay Shete, 42
2. Why bother about the maths ?
● Lots of open source libraries out there
● General approach for external software
– Include it
– Build it
– Forget it !!
3. IR is fuzzy
● Consider the following sentences
– India beat Pakistan
– Sachin has scored over 10,000 runs in Test cricket
– Sachin Tendulkar was unable to run to the non-
striker's end
● How do we humans interpret this information ?
● How do machines interpret it ?
4. What excites me about IR ?
● Its all about indulging and flirting with sexy
models !!!!
5. Models of interest
● Vector space models
● Inductive models
– Probabilistic
– Neural networks, decision trees
● Hybrid – vector space and inductive
● Others...
6. Vector Space Model
● Recall junior college mathematics
● Dot-product, cross-product, projections, etc..
7. Vector Space Model
● A1, A2 are articles about animals
● P1 contains politics
● To find similarity scores between
– a) A1, A2 ==> assume score is S1
– b) A1, P1 ==> assume score is S2
● Intuitively S1 > S2
● How do u create a model around this ?
8. Vector Space Model
● A1 = {dog(1), cat(2), lion(1), rhino(1), tiger(2)}
● A2 = {eagle(1), cat(1), tiger(1), fox, hyena
●
N Dimensional Space !!!
d1 = 1(dog) + 2(cat) + 1(lion) + 1(rhino) +
2(tiger) + 0(eagle) + 0(fox) + 0(hyena)
● d2 = 0(dog) + 1(cat) + 0(lion) + 0(rhino) +
1(tiger) + 1(eagle) + 1(fox) + 1(hyena)
● Similarity is measured by the angle between the
two vectors !!
9. Similarity measured by the angle between
two vectors !!
If X and Y are two n-dimensional vectors
<xi> and <yi>, the angle θ between them
satisfies:
X Y = |X| |Y| cos θ
cos θ = X Y / (|X||Y|)
11. Linear Algebra
● Vectors and matrices are equivalent
● Important Linear Algebra concepts:
– Singular Value Decomposition
– EigenVectors
● Latent Semantic Indexing (LSI)
● Face recognition
12. Latent Semantic Indexing
● Exact keyword match not required
● No predfined semantic knowledge base
● Demo
– http://lsi.research.telcordia.com/lsi-bin/lsiQuery
● How does it work ?
– Fish Tank Analogy
13. Uncertainities, Pitfalls
● Not suitable for fine grained search
● Web Scale corpus !!
● Does anybody find Google suggestions really
helpful ?
14. Face recognition application
● EigenValues, covariance and SVD
● Intuitive Understanding
– Create a vector space from the pixels modelling the
face
– Find “average” face vector
– Find the specific features of every face by computing
the variance from the average vector
– Take the top N specific features of this vector space
and create the “classifier” vector
16. Inductive Models
● Require prior information (Training set) to
extrapolate to future unseen cases.
● Create a “function” from the input training data
which is applied on test data to produce output.
17. Bag 1 Bag 2 Bag 3
Ball drawn at random from a bag
and found to be white. What are the
chances of it being drawn from Bag
3?
18. Bag 1 Bag 2 Bag 3
Ball drawn at random from a bag
and found to be green. What are
the chances that it is from Bag 2 ?
20. Sachin, Pilot, Rock, band,
Sachin, cricket, political, career, took, crowd,
Tendulkar, drew, crowds, storm, elected
run, wicket, , year,
centre, grammy
crowd government,
cheered elections
Cricket Bag Politics Bag Music Bag
Test sentence:- Sachin's 10,000th run was
highly appreciated by the crowd
What chances this is from Bag “Cricket” ?
21. Other applications on probabilistic
models
● Targeted Advertising – Demo
● POS Tagging
http://www.infogistics.com/posdemo.htm
● Named Entity Recognition
22. Neural Networks
● Demo video (5 minutes) – Machine learns how to
steer the vehicle by observing the driver
● After training, capable of steering the vehicle on
its own.
23. Uncertainities, Pitfalls
● Depends on the training set
● Training set may not be in line with the
application for which it is being used.
– POS Tagging done on Wall Street Corpus
– “Like” Applying the same for “like” analyzing teen
text !!
– “dove makes my skin smooth.”
● Can you deal with “The Black Swan”!!
● Is inductive learning really needed ?
● Combine with other features/approaches ?
● More data better than a good algorithm.