Mathematical Concepts for IR Models

Mathematical Roadmap for IR

Introduction to the mathematical concepts
commonly used in IR/ST

- Abhay Shete, 42

Why bother about the maths ?

● Lots of open source libraries out there
● General approach for external software
– Include it
– Build it
– Forget it !!

IR is fuzzy
● Consider the following sentences
– India beat Pakistan
– Sachin has scored over 10,000 runs in Test cricket
– Sachin Tendulkar was unable to run to the non-
striker's end
● How do we humans interpret this information ?
● How do machines interpret it ?

What excites me about IR ?

● Its all about indulging and flirting with sexy
models !!!!

Models of interest

● Vector space models
● Inductive models
– Probabilistic
– Neural networks, decision trees
● Hybrid – vector space and inductive
● Others...

Vector Space Model
● Recall junior college mathematics
● Dot-product, cross-product, projections, etc..

Vector Space Model

● A1, A2 are articles about animals
● P1 contains politics
● To find similarity scores between
– a) A1, A2 ==> assume score is S1
– b) A1, P1 ==> assume score is S2
● Intuitively S1 > S2
● How do u create a model around this ?

Vector Space Model
● A1 = {dog(1), cat(2), lion(1), rhino(1), tiger(2)}
● A2 = {eagle(1), cat(1), tiger(1), fox, hyena
●
N Dimensional Space !!!
d1 = 1(dog) + 2(cat) + 1(lion) + 1(rhino) +
2(tiger) + 0(eagle) + 0(fox) + 0(hyena)
● d2 = 0(dog) + 1(cat) + 0(lion) + 0(rhino) +
1(tiger) + 1(eagle) + 1(fox) + 1(hyena)
● Similarity is measured by the angle between the
two vectors !!

Similarity measured by the angle between
two vectors !!

If X and Y are two n-dimensional vectors
<xi> and <yi>, the angle θ between them
satisfies:
X Y = |X| |Y| cos θ
cos θ = X Y / (|X||Y|)

Linear Algebra

● Vectors and matrices are equivalent
● Important Linear Algebra concepts:
– Singular Value Decomposition
– EigenVectors
● Latent Semantic Indexing (LSI)
● Face recognition

Latent Semantic Indexing

● Exact keyword match not required
● No predfined semantic knowledge base
● Demo
– http://lsi.research.telcordia.com/lsi-bin/lsiQuery
● How does it work ?
– Fish Tank Analogy

Uncertainities, Pitfalls

● Not suitable for fine grained search
● Web Scale corpus !!
● Does anybody find Google suggestions really
helpful ?

Face recognition application

● EigenValues, covariance and SVD
● Intuitive Understanding
– Create a vector space from the pixels modelling the
face
– Find “average” face vector
– Find the specific features of every face by computing
the variance from the average vector
– Take the top N specific features of this vector space
and create the “classifier” vector

Document Clustering

● Clusty.com demo
● Clustering approaches

Inductive Models

● Require prior information (Training set) to
extrapolate to future unseen cases.

● Create a “function” from the input training data
which is applied on test data to produce output.

Bag 1 Bag 2 Bag 3

Ball drawn at random from a bag
and found to be white. What are the
chances of it being drawn from Bag
3?

Bag 1 Bag 2 Bag 3

Ball drawn at random from a bag
and found to be green. What are
the chances that it is from Bag 2 ?

● Any way to model this ?
● Bayes Theorem !!!

Sachin, Pilot, Rock, band,
Sachin, cricket, political, career, took, crowd,
Tendulkar, drew, crowds, storm, elected
run, wicket, , year,
centre, grammy
crowd government,
cheered elections

Cricket Bag Politics Bag Music Bag

Test sentence:- Sachin's 10,000th run was
highly appreciated by the crowd

What chances this is from Bag “Cricket” ?

Other applications on probabilistic
models
● Targeted Advertising – Demo
● POS Tagging
http://www.infogistics.com/posdemo.htm
● Named Entity Recognition

Neural Networks

● Demo video (5 minutes) – Machine learns how to
steer the vehicle by observing the driver
● After training, capable of steering the vehicle on
its own.

Uncertainities, Pitfalls
● Depends on the training set
● Training set may not be in line with the
application for which it is being used.
– POS Tagging done on Wall Street Corpus
– “Like” Applying the same for “like” analyzing teen
text !!
– “dove makes my skin smooth.”
● Can you deal with “The Black Swan”!!
● Is inductive learning really needed ?
● Combine with other features/approaches ?
● More data better than a good algorithm.

Mathematical Concepts for IR Models

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Mathematical Concepts for IR Models

Ähnlich wie Mathematical Concepts for IR Models (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mathematical Concepts for IR Models