Information Retrieval

(

LABORATOIRE D’INFORMATIQUE DE
L’UNIVERSITE DE PAU ET DES PAYS
DE L’ADOUR

Information Retrieval
!

Richard Chbeir, Ph.D.

http://liuppa.univ-pau.fr

)

Where
I
am
coming
from
…

10/12/13
Page 2

Where
I
am
coming
from
…

10/12/13
Page 3

Some
photos
of
Anglet/Biarritz

10/12/13
Page 4

Some
photos
of
Anglet/Biarritz

10/12/13
Page 5

Some
photos
of
Anglet/Biarritz

10/12/13
Page 6

Deﬁni9on

•  Informa9on
Retrieval

–  …

10/12/13
Page 7

Informa9on
Retrieval
Process

User

Document

Query

Knowledge
Base

Documents

Documents

Documents

Processing

Processing

Query
representa9on

Document

representa9ons

Retrieval

Similarity

comparison

Retrieved

Retrieved

documents

Documents

10/12/13
Page 8

Relevance
evalua9on

and
feedback

Indexing

Query/Document
processing

•  Strip unwanted characters/markup (e.g., HTML
tags, punctuation, numbers, etc.)
•  Break into tokens (keywords) on whitespace
•  Remove common stopwords (using KB or a list)
–  e.g., a, the, it, www, what, when, from, etc.

•  Normalize keywords (using KB)
–  Using one index for synonyms
–  Stem tokens to “root” words (using KB)
•  computational à comput

•  Detect common phrases (using KB)
•  Build corresponding representation(s)

10/12/13
Page 9

Example

Identifying important blocs
10/12/13
Page 10

Example

Identifying important blocs
10/12/13
Page 11

Example

Eliminating stop words
10/12/13
Page 12

Example

Identifying Concepts
10/12/13
Page 13

Example

Université Bourgogne

Institut

Professeur

Result: (Université, Bourgogne, Professeur, Institut)
10/12/13
Page 14

Query/Document
processing

•  Inverted index
An
inverted
index
(also
referred
to
as
pos9ngs
file
or
inverted
file)
is
an

index
data
structure
storing
a
mapping
from
content,

such
as
words
or

numbers,
to
its
loca9ons
in
a
database
file,
or
in
a
document
or
a
set
of

documents

•  D0
=
"it
is
what
it
is",
D1
=
"what
is
it"
and
D2
=
"it
is

a
banana”,
the
related
record
level
inverted
index

(or
inverted
file
index
or
just
inverted
file)
is

"a":

{2}

"banana":
{2}

"is":

{0,
1,
2}

"it":

{0,
1,
2}

"what":

{0,
1}

10/12/13
Page 15

But
also….

Result: (Université, Bourgogne, Professeur, Institut)
10/12/13
Page 16

And each term has its importance according to its weight

Query/Document
processing

•  Inverted index
An
inverted
index
(also
referred
to
as
pos9ngs
file
or
inverted
file)
is
an

index
data
structure
storing
a
mapping
from
content,

such
as
words
or

numbers,
to
its
loca9ons
in
a
database
file,
or
in
a
document
or
a
set
of

documents

•  D0
=
"it
is
what
it
is",
D1
=
"what
is
it"
and
D2
=
"it
is

a
banana”,
the
related
word
level
inverted
index

(or
full
inverted
index
or
inverted
list)
is

"a":
{(2,
2)}

"banana":
{(2,
3)}

"is":
{(0,
1),
(0,
4),
(1,
1),
(2,
1)}
"it":
{(0,

0),
(0,
3),
(1,
2),
(2,
0)}
"what":
{(0,
2),
(1,

0)}

10/12/13
Page 17

IR
representa9on
models

•  Boolean
retrieval
model

•  Sta9s9cal
models

–  Vector
space
retrieval
model

–  Probabilis9c
retrieval
models

10/12/13
Page 18

IR
representa9on
models

•  Boolean
retrieval
model

•  A document is represented as a set of keywords
–  t1, t2, t3, …. tn

•  Queries are Boolean expressions of keywords,
connected by AND, OR, and NOT, including the use of
brackets to indicate scope
–  [[t1 & t2] | [t3& t4]] & t5& !t6]

•  Output: Document is relevant or not. No partial matches
or ranking

Not

relevant
Relevant

10/12/13
Page 19

IR
representa9on
models

•  Boolean
retrieval
model

•  Popular retrieval model
–  Easy to understand for simple queries
–  Clean formalism
•  Boolean models can be extended to include ranking
–  t1 ADJ t2: means that two terms must occur
adjacently in the retrieved tuples
–  t1 NEAR t2: means that two terms must occur in a
common sentence
•  Reasonably efficient implementations possible for
normal queries
10/12/13
Page 20

IR
representa9on
models

•  Boolean
retrieval
model
-‐
problems

•  Very rigid: AND means all; OR means any
•  Difficult to express complex user requests
•  Difficult to control the number of documents retrieved
–  All matched documents will be returned

•  Difficult to rank output
–  All matched documents logically satisfy the query

•  Difficult to perform relevance feedback
–  If a document is identified by the user as relevant or irrelevant,
how should the query be modified?

10/12/13
Page 21

IR
representa9on
models

•  Sta9s9cal
models

•  A document is typically represented by a bag of words
(unordered words with frequencies)
•  Bag = set that allows multiple occurrences of the same
element
•  User specifies a set of desired terms with optional weights:
–  Weighted query terms:
Q = < database 0.5; text 0.8; information 0.2 >
–  Unweighted query terms:
Q = < database; text; information >
–  No Boolean conditions specified in the query

10/12/13
Page 22

IR
representa9on
models

•  Sta9s9cal
models

•  Support “natural language” queries
•  Treat documents and queries the same
•  Retrieval based on similarity between query and
documents
•  Output documents are ranked according to similarity
to query
•  A similarity threshold can be defined
•  Similarity based on occurrence frequencies of
keywords in query and document
•  Automatic relevance feedback can be supported:
–  Relevant documents “added” to query
–  Irrelevant documents “subtracted” from query
10/12/13
Page 23

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

•  Assume t distinct terms (index terms or vocabulary) remain after
preprocessing
•  These “orthogonal” terms form a vector space
Dimension = t = |vocabulary|
•  Each term, i, in a document or query, j, is given a real-valued
weight, wij.
•  Both documents and queries are expressed as t-dimensional
vectors:

dj = (w1j, w2j, …, wtj)

10/12/13
Page 24

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3

T3
5

D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
2
3

D2 = 3T1 + 7T2 + T3
10/12/13
Page 25

T2

7

T1

•  Is D1 or D2 more similar to Q?
•  How to measure the degree of
similarity? Distance? Angle?
Projection?

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

•  Document
collec9on

•  A collection of n documents can be represented in the vector space
model by a term-document matrix
•  An entry in the matrix corresponds to the “weight” of a term in the
document; zero means the term has no significance in the document
or it simply doesn’t exist in the document

10/12/13
Page 26

D1
D2
:
:
Dn

T1 T2
w11 w21
w12 w22
: :
: :
w1n w2n

….
…
…

…

Tt
wt1
wt2
:
:
wtn

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

•  Term Weights: Term Frequency
•  More frequent terms in a document are more important, i.e.
more indicative of the topic

fij = frequency of term i in document j
•  May want to normalize term frequency (tf) by dividing by
the frequency of the most common term in the document:

tfij = fij / maxi{fij}

10/12/13
Page 27

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

•  Term Weights: Inverse Document Frequency
–  Terms that appear in many different documents are less
indicative of overall topic
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
( = log2 (N/ df i) )
(N: total number of documents)

10/12/13
Page 28

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

•  TF-IDF Weighting
•  A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
•  A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight
•  Many other ways of determining term weights
have been proposed
•  Experimentally, tf-idf has been found to work
well
10/12/13
Page 29

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

•  TF-IDF Weighting - Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
10/12/13
Page 30

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

•  Queries
•  treated as a document and also tf-idf weighted
•  Alternative is for the user to supply weights for
the given query terms

10/12/13
Page 31

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

•  Similarity measure – Inner product
•  Similarity between vectors for the document di and query q can be
computed as the vector inner product (a.k.a. dot product):
t

sim(dj,q) = dj•q =∑ wij wiq
i =1

where wij is the weight of term i in document j and wiq is the
weight of term i in the query
•  For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection)
•  For weighted term vectors, it is the sum of the products of the
weights of the matched terms
10/12/13
Page 32

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

•  Similarity measure – Inner product properties
•  Unbounded
•  Favors long documents with a large number of unique terms
•  Measures how many terms matched but not how many terms are
not matched

10/12/13
Page 33

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model

•  Similarity measure – Inner product example

Binary:
–  D = 1,

1,

1,

0,

1,

1,

0

–  Q = 1,

0,

1,

0,

0,

1,

1

sim(D, Q) = 3

Weighted:

Size
of
vector
=
size
of
vocabulary
=
7

0
means
corresponding
term
not
found
in

document
or
query

D1
=
2T1
+
3T2
+
5T3

D2
=
3T1
+
7T2
+

1T3

Q
=
0T1
+
0T2
+

2T3

10/12/13
Page 34

sim(D1
,
Q)
=
2*0
+
3*0
+
5*2

=
10

sim(D2
,
Q)
=
3*0
+
7*0
+
1*2

=

2

IR
representa9on
models

•  Sta9s9cal
models

t3

–  Vector
space
retrieval
model

•  Similarity measure – Cosine

Θ1

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted.
Restart your computer, and then open the ﬁle again. If the red x still appears, you may have to delete the image and then insert it again.

D1
Θ2

CosSim(dj,
q)
=

t2

Q
t1

D2

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / !(4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / !(9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner
10/12/13
Page 35

product.

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model
properties

•  Simple, mathematically based approach
•  Considers both local (tf) and global (idf) word occurrence
frequencies
•  Provides partial matching and ranked results
•  Tends to work quite well in practice despite obvious
weaknesses
•  Allows efficient implementation for large document
collections

10/12/13
Page 36

IR
representa9on
models

•  Sta9s9cal
models

–  Vector
space
retrieval
model
problems

•  Missing semantic information (e.g., word sense)
•  Missing syntactic information (e.g., phrase structure, word
order, proximity information)
•  Assumption of term independence (e.g., ignores synonomy) and
of term dimensions orthogonality
•  Missing consideration of hot topics and documents
•  Lacks the control of a Boolean model (e.g., requiring a term to
appear in a document)
–  Given a two-term query “A B”, may prefer a document containing A
frequently but not B, over a document that contains both A and B, but both
less frequently

10/12/13
Page 37

Knowledge
Base

•  is
a
seman9c
network
with
a
set
of
concepts
represen9ng

groups
of
words/expressions,
and
a
set
of
links
connec9ng

the
concepts,
represen9ng
seman9c
rela9ons

•  Used
in

–  Indexing
normaliza9on

–  Query
rewri9ng

Government

State
Legislature

Executive
Congress

Difficulty

×

Senate

Crisis

US
president,

Chief
execu9ve

Ease, relief

Aid, help
Loan

Emergency, Pinch,
Exigency

Bush,
George
W
Bush,

George
Walker
Bush

Bailout
Concept (Synonym Set)
Hyponym/Hypernym relations (following direction)

10/12/13
Page 38

Meronym/Holonym relations (following direction)

×

Antonymy

Similarity
and
ranking

•  Range
queries




•  KNN



10/12/13
Page 39

Similarity
and
ranking

•  Ranking
is
related
to:

–  Similarity
score

–  Popularity
of
the
requested
documents

–  Popularity
of
query
topics

–  External
parameters
(events)

–  User
preferences

10/12/13
Page 40

Relevance
Feedback

•  Relevance is a subjective judgment and may
include:
–  Being on the proper subject
–  Being timely (recent information)
–  Being authoritative (from a trusted source)
–  Satisfying the goals of the user and his/her
intended use of the information (information
need)

10/12/13
Page 41

Relevance
Feedback

•  Query
modifica9on

•  Document
modifica9on

–  To
allow
other
users
to
benefit
from
previous

feedbacks

10/12/13
Page 42

Relevance
Feedback

•  Query
modifica9on

–  Terms
occuring
in
documents
previously

iden9fied
as
relevant
are
added
to
the
original

query
or
the
weight
of
such
terms
is
increased

–  Terms
occuring
in
documents
previously

iden9fied
as
irrelevant
are
deleted
from
the

original
query
or
the
weight
of
such
terms
in

reduced

1 r i 1 n i
Q = Q + ∑ drel − ∑ dirrel
r i=1
n i=1
i+1

10/12/13
Page 43

i

Relevance
Feedback

•  Document
modiﬁca9on

–  Terms
in
the
query
are
added
to
the
document

index
list
with
an
ini9al
weight

–  Weights
of
index
terms
in
query
and
relevant

documents
are
increased

–  Weights
of
index
terms
not
in
query
in
relevant

documents
are
decreased

10/12/13
Page 44

Performance
measurement

•  What
can
be
measured
that
reflects
users’
ability
to
use

system?

–  Coverage
of
informa9on

–  Form
of
presenta9on

effectiveness
–  Effort
required/ease
of
use

–  Time
and
space
efficiency

–  Recall

•  propor9on
of
relevant
material
actually
retrieved

–  Precision

•  propor9on
of
retrieved
material
actually
relevant

10/12/13
Page 45

Performance
measurement

•  In
what
ways
can
a
document
be
relevant
to

a
query?

–  Answer
precise
ques9on
precisely

–  Par9ally
answer
ques9on

–  Suggest
a
source
for
more
informa9on

–  Give
background
informa9on

–  Remind
the
user
of
other
knowledge

–  Others
...

10/12/13
Page 46

Performance
measurement

•  Standard
IR
Evalua9on

•  Precision

Retrieved
Documents

# relevant retrieved
# retrieved

•  Recall

# relevant retrieved
# relevant in collection

10/12/13
Page 47

Collection

Performance
measurement

•  Standard
IR
Evalua9on

•  Noise

Retrieved
Documents

# irrelevant retrieved
# retrieved

•  Silence

# relevant not retrieved
# relevant in collection

10/12/13
Page 48

Collection

Performance
measurement

•  Standard
IR
Evalua9on

–  P/R
curves

•  There
is
a
tradeoﬀ
between
Precision
and
Recall

•  So
measure
Precision
at
diﬀerent
levels
of
Recall

B

A

precision

x
x

x

B
performs
beler

x
x
x

recall
10/12/13
Page 49

x
x

Performance
measurement

•  Standard
IR
Evalua9on

–  P/R
curves

•  Diﬃcult
to
determine
which
of
these
two

hypothe9cal
results
is
beler

precision

x
x

x

recall
10/12/13
Page 50

x

Performance
measurement

•  Document
Cutoﬀ
Levels

 

Another way to evaluate:!
  Fix the number of documents retrieved at several levels:"
 

top 5, top 10, top 20, top 50, top 100, top 500"

  Measure

 

10/12/13
Page 51

precision at each of these levels"
  Take (weighted) average over results"
This is a way to focus on high precision!

Performance
measurement

•  F-‐measure

–  A
measure
that
combines
precision
and
recall
is

the
harmonic
mean
of
precision
and
recall,
the

tradi9onal
F-‐measure
or
balanced
F-‐score:

10/12/13
Page 52

Web Search System
Web

Spider

Document

corpus

IR

System

Query

String

1. Page1
2. Page2
3. Page3
.
.
10/12/13
Page 53

Ranked

Documents

53

Web
search
engines
vs.
IRS

•  Differences:
–  Must assemble document corpus by spidering
the web
–  Can exploit the structural layout information in
HTML (XML)
–  Documents are distributed
–  Documents change uncontrollably
–  Can exploit the link structure of the web

10/12/13
Page 54

Web
search
engines
crawling

•  Spider,
wanderer,
crawler,
walker,
ant

–  A
computer
program
that
browses
the
World
Wide
Web
in
a

methodical,
automated
manner.
(Wikipedia)

•  U9li9es:

–  Supports
a
search
engine

–  Gathers
pages
from
the
Web

•  Star9ng
from
one
page
–determine
which
page(s)
to
go
to
next

–  Builds
an
index

•  Object:

–  Text,
video,
image
and
so
on

–  Link
structure

10/12/13
Page 55

Web
search
engines
crawling

•  Current
crawlers

– 
– 
– 
– 
– 
– 
– 
– 

10/12/13
Page 56

GoogleBot

Nutch

Big
Brother

Calif

CoolBot

Download
Express

Get
URL

…

Web
search
engines
crawling

•  How
to
parse
web
pages?

download
HTTP

URL

Hyperlink
Extraction

Hyperlink
Parser

Video URL
Image URL
HTML URL

URL
buffer
A
A-B
New URL/URI
10/12/13
Page 57

Diff
operator

B
Visited pages

Web
search
engines
crawling

•  Policies

–  Selec9on
policy

–  Re-‐visit
policy

–  Politeness
policy

–  Paralleliza9on
policy

–  Robustness

10/12/13
Page 58

Web
search
engines
crawling

•  Policies

–  Selec9on
policy

•  Pageranks

•  Path
ascending

•  Focused
crawling

–  Re-‐visit
policy

•  Freshness

•  Age

–  Politeness

10/12/13
Page 59

•  So
that
crawlers
don’t
overload
web
servers

•  Set
a
delay
between
GET
requests

•  which
pages
can
be
crawled,
and
which
cannot

Web
search
engines
crawling

•  Policies

–  Robot
rules
by
Isaac
Asimov
in
1942

1.  A
robot
may
not
injure
a
human
being,
or
through
inac9on,
allow

a
human
being
to
come
to
harm

2.  A
robot
must
obey
orders
given
to
it
by
human
beings
except

where
such
orders
would
conﬂict
with
the
First
Law

3.  A
robot
must
protect
its
own
existence
as
long
as
such
protec9on

does
not
conﬂict
with
the
First
or
Second
Law

10/12/13
Page 60

h"p://www.asimovonline.com/

Web
search
engines
crawling

•  Policies

–  Web
Robot
rules

• 
• 
• 
• 

A
web
robot
must
show
iden9ﬁca9ons

A
web
robot
must
not
control
resources

A
web
robot
must
report
errors

A
web
robot
must
obey
exclusion
standard
("/
robots.txt"
ou
les
balises
META)

User-‐Agent:
{value}
Op9on
{Value}

Robot
name

10/12/13
Page 61

Name
or
*

Disallow,
etc.

URL,
/,
etc.

Web
search
engines
crawling

•  Policies

–  Web
Robot
rules

• 

A
web
robot
must
obey
exclusion
standard
("/
robots.txt"
ou
les
balises
META)

– 

Example
1:
deny
access
to
all
robots

User-Agent: *!
!Disallow: /!

10/12/13
Page 62

Web
search
engines
crawling

•  Policies

–  Web
Robot
rules

• 

A
web
robot
must
obey
exclusion
standard
("/
robots.txt"
ou
les
balises
META)

– 

Example
2:
deny
access
to
all
robots
to
the
give
folder

and
ﬁle
except
GoogleBot

User-agent: *!
!Disallow: /u-bourgogne/LE2I/!
!Disallow: /MR.html!
!
!
User-agent: GoogleBot !
!Disallow:!

10/12/13
Page 63

Web
search
engines
crawling

•  Policies

–  Web
Robot
rules

• 

A
web
robot
must
obey
exclusion
in
html
tag
META

<meta name="robots1" content="index,follow">!
<meta name="robots2"
content="noindex,follow">!
content="index,nofollow">!
content="noindex,nofollow">!
!
!
10/12/13
Page 64

Web
search
engines
crawling

•  Policies

–  Paralleliza9on

•  Distributed
web
crawling

•  To
maximize
download
rate

–  Robustness:
spider
traps

•  Inﬁnitely
deep
directory
structures:

hlp://foo.com/bar/foo/bar/foo/...

•  Pages
ﬁlled
a
large
number
of
characters

10/12/13
Page 65

Web
search
engines
crawling

•  Policies

–  A
good
robot
must

•  Get
the
header
(when
possible)

•  Crawl
during
convenient
periods

•  Avoid
loops

•  Download
moderately

–  Adopt
rota9on-‐based
requests
among
servers

–  Delay
requests
on
the
same
server

•  Ignore
forms

10/12/13
Page 66

Intelligent
IRS

•  Taking into account the meaning of the
words used
•  Taking into account the order of words in the
query
•  Taking into account the authority of the
source
•  Adapting to the user based on direct or
indirect feedback
10/12/13
Page 67

Adap9ve
IRS…

10/12/13
Page 68

But…

•  What
happens
if
you
are
looking
for

something
like
…

10/12/13
Page 69

But…

•  With
OS,
dedicated
mul9media

func9onali9es
are
limited
…

–  Manual
annota9on
and
categoriza9on

–  Textual
retrieval

–  …

10/12/13
Page 70

But…

•  With
DBMS,
dedicated
mul9media

func9onali9es
are
limited
…

Surveillance
Camera

ID
A-‐Date

1

1/1/200
6

2

Create
Table
SURVEILLANCE

ID
integer,

A-‐Date
date,

Photo
BLOB);

1/5/200
7

3

2/2/200
7

4

10/12/13
Page 71

Photo

4/1/200
7

Select
*
from
SURVEILLANCE

Where
Photo
=
…
);

But…

•  With
Image
Processing
…

10/12/13
Page 72

What
do
you
think?

10/12/13
Page 73

Merci
Richard.chbeir@univ-pau.fr

Information Retrieval

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Information Retrieval

Ähnlich wie Information Retrieval (20)

Mehr von rchbeir

Mehr von rchbeir (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Information Retrieval