2024: Domino Containers - The Next Step. News from the Domino Container commu...
Information Retrieval
1. (
LABORATOIRE D’INFORMATIQUE DE
L’UNIVERSITE DE PAU ET DES PAYS
DE L’ADOUR
Information Retrieval
!
Richard Chbeir, Ph.D.
http://liuppa.univ-pau.fr
)
8. Informa9on
Retrieval
Process
User
Document
Query
Knowledge
Base
Documents
Documents
Documents
Processing
Processing
Query
representa9on
Document
representa9ons
Retrieval
Similarity
comparison
Retrieved
Retrieved
documents
Documents
10/12/13
Page 8
Relevance
evalua9on
and
feedback
Indexing
9. Query/Document
processing
• Strip unwanted characters/markup (e.g., HTML
tags, punctuation, numbers, etc.)
• Break into tokens (keywords) on whitespace
• Remove common stopwords (using KB or a list)
– e.g., a, the, it, www, what, when, from, etc.
• Normalize keywords (using KB)
– Using one index for synonyms
– Stem tokens to “root” words (using KB)
• computational à comput
• Detect common phrases (using KB)
• Build corresponding representation(s)
10/12/13
Page 9
15. Query/Document
processing
• Inverted index
An
inverted
index
(also
referred
to
as
pos9ngs
file
or
inverted
file)
is
an
index
data
structure
storing
a
mapping
from
content,
such
as
words
or
numbers,
to
its
loca9ons
in
a
database
file,
or
in
a
document
or
a
set
of
documents
• D0
=
"it
is
what
it
is",
D1
=
"what
is
it"
and
D2
=
"it
is
a
banana”,
the
related
record
level
inverted
index
(or
inverted
file
index
or
just
inverted
file)
is
"a":
{2}
"banana":
{2}
"is":
{0,
1,
2}
"it":
{0,
1,
2}
"what":
{0,
1}
10/12/13
Page 15
16. But
also….
Result: (Université, Bourgogne, Professeur, Institut)
10/12/13
Page 16
And each term has its importance according to its weight
17. Query/Document
processing
• Inverted index
An
inverted
index
(also
referred
to
as
pos9ngs
file
or
inverted
file)
is
an
index
data
structure
storing
a
mapping
from
content,
such
as
words
or
numbers,
to
its
loca9ons
in
a
database
file,
or
in
a
document
or
a
set
of
documents
• D0
=
"it
is
what
it
is",
D1
=
"what
is
it"
and
D2
=
"it
is
a
banana”,
the
related
word
level
inverted
index
(or
full
inverted
index
or
inverted
list)
is
"a":
{(2,
2)}
"banana":
{(2,
3)}
"is":
{(0,
1),
(0,
4),
(1,
1),
(2,
1)}
"it":
{(0,
0),
(0,
3),
(1,
2),
(2,
0)}
"what":
{(0,
2),
(1,
0)}
10/12/13
Page 17
18. IR
representa9on
models
• Boolean
retrieval
model
• Sta9s9cal
models
– Vector
space
retrieval
model
– Probabilis9c
retrieval
models
10/12/13
Page 18
19. IR
representa9on
models
• Boolean
retrieval
model
• A document is represented as a set of keywords
– t1, t2, t3, …. tn
• Queries are Boolean expressions of keywords,
connected by AND, OR, and NOT, including the use of
brackets to indicate scope
– [[t1 & t2] | [t3& t4]] & t5& !t6]
• Output: Document is relevant or not. No partial matches
or ranking
Not
relevant
Relevant
10/12/13
Page 19
20. IR
representa9on
models
• Boolean
retrieval
model
• Popular retrieval model
– Easy to understand for simple queries
– Clean formalism
• Boolean models can be extended to include ranking
– t1 ADJ t2: means that two terms must occur
adjacently in the retrieved tuples
– t1 NEAR t2: means that two terms must occur in a
common sentence
• Reasonably efficient implementations possible for
normal queries
10/12/13
Page 20
21. IR
representa9on
models
• Boolean
retrieval
model
-‐
problems
• Very rigid: AND means all; OR means any
• Difficult to express complex user requests
• Difficult to control the number of documents retrieved
– All matched documents will be returned
• Difficult to rank output
– All matched documents logically satisfy the query
• Difficult to perform relevance feedback
– If a document is identified by the user as relevant or irrelevant,
how should the query be modified?
10/12/13
Page 21
22. IR
representa9on
models
• Sta9s9cal
models
• A document is typically represented by a bag of words
(unordered words with frequencies)
• Bag = set that allows multiple occurrences of the same
element
• User specifies a set of desired terms with optional weights:
– Weighted query terms:
Q = < database 0.5; text 0.8; information 0.2 >
– Unweighted query terms:
Q = < database; text; information >
– No Boolean conditions specified in the query
10/12/13
Page 22
23. IR
representa9on
models
• Sta9s9cal
models
• Support “natural language” queries
• Treat documents and queries the same
• Retrieval based on similarity between query and
documents
• Output documents are ranked according to similarity
to query
• A similarity threshold can be defined
• Similarity based on occurrence frequencies of
keywords in query and document
• Automatic relevance feedback can be supported:
– Relevant documents “added” to query
– Irrelevant documents “subtracted” from query
10/12/13
Page 23
24. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
• Assume t distinct terms (index terms or vocabulary) remain after
preprocessing
• These “orthogonal” terms form a vector space
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-valued
weight, wij.
• Both documents and queries are expressed as t-dimensional
vectors:
dj = (w1j, w2j, …, wtj)
10/12/13
Page 24
25. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
T3
5
D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
2
3
D2 = 3T1 + 7T2 + T3
10/12/13
Page 25
T2
7
T1
• Is D1 or D2 more similar to Q?
• How to measure the degree of
similarity? Distance? Angle?
Projection?
26. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
• Document
collec9on
• A collection of n documents can be represented in the vector space
model by a term-document matrix
• An entry in the matrix corresponds to the “weight” of a term in the
document; zero means the term has no significance in the document
or it simply doesn’t exist in the document
10/12/13
Page 26
D1
D2
:
:
Dn
T1 T2
w11 w21
w12 w22
: :
: :
w1n w2n
….
…
…
…
Tt
wt1
wt2
:
:
wtn
27. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
• Term Weights: Term Frequency
• More frequent terms in a document are more important, i.e.
more indicative of the topic
fij = frequency of term i in document j
• May want to normalize term frequency (tf) by dividing by
the frequency of the most common term in the document:
tfij = fij / maxi{fij}
10/12/13
Page 27
28. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
• Term Weights: Inverse Document Frequency
– Terms that appear in many different documents are less
indicative of overall topic
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
( = log2 (N/ df i) )
(N: total number of documents)
10/12/13
Page 28
29. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
• TF-IDF Weighting
• A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight
• Many other ways of determining term weights
have been proposed
• Experimentally, tf-idf has been found to work
well
10/12/13
Page 29
30. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
• TF-IDF Weighting - Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
10/12/13
Page 30
31. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
• Queries
• treated as a document and also tf-idf weighted
• Alternative is for the user to supply weights for
the given query terms
10/12/13
Page 31
32. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
• Similarity measure – Inner product
• Similarity between vectors for the document di and query q can be
computed as the vector inner product (a.k.a. dot product):
t
sim(dj,q) = dj•q =∑ wij wiq
i =1
where wij is the weight of term i in document j and wiq is the
weight of term i in the query
• For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection)
• For weighted term vectors, it is the sum of the products of the
weights of the matched terms
10/12/13
Page 32
33. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
• Similarity measure – Inner product properties
• Unbounded
• Favors long documents with a large number of unique terms
• Measures how many terms matched but not how many terms are
not matched
10/12/13
Page 33
34. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
• Similarity measure – Inner product example
Binary:
– D = 1,
1,
1,
0,
1,
1,
0
– Q = 1,
0,
1,
0,
0,
1,
1
sim(D, Q) = 3
Weighted:
Size
of
vector
=
size
of
vocabulary
=
7
0
means
corresponding
term
not
found
in
document
or
query
D1
=
2T1
+
3T2
+
5T3
D2
=
3T1
+
7T2
+
1T3
Q
=
0T1
+
0T2
+
2T3
10/12/13
Page 34
sim(D1
,
Q)
=
2*0
+
3*0
+
5*2
=
10
sim(D2
,
Q)
=
3*0
+
7*0
+
1*2
=
2
35. IR
representa9on
models
• Sta9s9cal
models
t3
– Vector
space
retrieval
model
• Similarity measure – Cosine
Θ1
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted.
Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
D1
Θ2
CosSim(dj,
q)
=
t2
Q
t1
D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / !(4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / !(9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner
10/12/13
Page 35
product.
36. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
properties
• Simple, mathematically based approach
• Considers both local (tf) and global (idf) word occurrence
frequencies
• Provides partial matching and ranked results
• Tends to work quite well in practice despite obvious
weaknesses
• Allows efficient implementation for large document
collections
10/12/13
Page 36
37. IR
representa9on
models
• Sta9s9cal
models
– Vector
space
retrieval
model
problems
• Missing semantic information (e.g., word sense)
• Missing syntactic information (e.g., phrase structure, word
order, proximity information)
• Assumption of term independence (e.g., ignores synonomy) and
of term dimensions orthogonality
• Missing consideration of hot topics and documents
• Lacks the control of a Boolean model (e.g., requiring a term to
appear in a document)
– Given a two-term query “A B”, may prefer a document containing A
frequently but not B, over a document that contains both A and B, but both
less frequently
10/12/13
Page 37
38. Knowledge
Base
• is
a
seman9c
network
with
a
set
of
concepts
represen9ng
groups
of
words/expressions,
and
a
set
of
links
connec9ng
the
concepts,
represen9ng
seman9c
rela9ons
• Used
in
– Indexing
normaliza9on
– Query
rewri9ng
Government
State
Legislature
Executive
Congress
Difficulty
×
Senate
Crisis
US
president,
Chief
execu9ve
Ease, relief
Aid, help
Loan
Emergency, Pinch,
Exigency
Bush,
George
W
Bush,
George
Walker
Bush
Bailout
Concept (Synonym Set)
Hyponym/Hypernym relations (following direction)
10/12/13
Page 38
Meronym/Holonym relations (following direction)
×
Antonymy
40. Similarity
and
ranking
• Ranking
is
related
to:
– Similarity
score
– Popularity
of
the
requested
documents
– Popularity
of
query
topics
– External
parameters
(events)
– User
preferences
10/12/13
Page 40
41. Relevance
Feedback
• Relevance is a subjective judgment and may
include:
– Being on the proper subject
– Being timely (recent information)
– Being authoritative (from a trusted source)
– Satisfying the goals of the user and his/her
intended use of the information (information
need)
10/12/13
Page 41
42. Relevance
Feedback
• Query
modifica9on
• Document
modifica9on
– To
allow
other
users
to
benefit
from
previous
feedbacks
10/12/13
Page 42
43. Relevance
Feedback
• Query
modifica9on
– Terms
occuring
in
documents
previously
iden9fied
as
relevant
are
added
to
the
original
query
or
the
weight
of
such
terms
is
increased
– Terms
occuring
in
documents
previously
iden9fied
as
irrelevant
are
deleted
from
the
original
query
or
the
weight
of
such
terms
in
reduced
1 r i 1 n i
Q = Q + ∑ drel − ∑ dirrel
r i=1
n i=1
i+1
10/12/13
Page 43
i
44. Relevance
Feedback
• Document
modifica9on
– Terms
in
the
query
are
added
to
the
document
index
list
with
an
ini9al
weight
– Weights
of
index
terms
in
query
and
relevant
documents
are
increased
– Weights
of
index
terms
not
in
query
in
relevant
documents
are
decreased
10/12/13
Page 44
45. Performance
measurement
• What
can
be
measured
that
reflects
users’
ability
to
use
system?
– Coverage
of
informa9on
– Form
of
presenta9on
effectiveness
– Effort
required/ease
of
use
– Time
and
space
efficiency
– Recall
• propor9on
of
relevant
material
actually
retrieved
– Precision
• propor9on
of
retrieved
material
actually
relevant
10/12/13
Page 45
46. Performance
measurement
• In
what
ways
can
a
document
be
relevant
to
a
query?
– Answer
precise
ques9on
precisely
– Par9ally
answer
ques9on
– Suggest
a
source
for
more
informa9on
– Give
background
informa9on
– Remind
the
user
of
other
knowledge
– Others
...
10/12/13
Page 46
47. Performance
measurement
• Standard
IR
Evalua9on
• Precision
Retrieved
Documents
# relevant retrieved
# retrieved
• Recall
# relevant retrieved
# relevant in collection
10/12/13
Page 47
Collection
48. Performance
measurement
• Standard
IR
Evalua9on
• Noise
Retrieved
Documents
# irrelevant retrieved
# retrieved
• Silence
# relevant not retrieved
# relevant in collection
10/12/13
Page 48
Collection
49. Performance
measurement
• Standard
IR
Evalua9on
– P/R
curves
• There
is
a
tradeoff
between
Precision
and
Recall
• So
measure
Precision
at
different
levels
of
Recall
B
A
precision
x
x
x
B
performs
beler
x
x
x
recall
10/12/13
Page 49
x
x
50. Performance
measurement
• Standard
IR
Evalua9on
– P/R
curves
• Difficult
to
determine
which
of
these
two
hypothe9cal
results
is
beler
precision
x
x
x
recall
10/12/13
Page 50
x
51. Performance
measurement
• Document
Cutoff
Levels
Another way to evaluate:!
Fix the number of documents retrieved at several levels:"
top 5, top 10, top 20, top 50, top 100, top 500"
Measure
10/12/13
Page 51
precision at each of these levels"
Take (weighted) average over results"
This is a way to focus on high precision!
52. Performance
measurement
• F-‐measure
– A
measure
that
combines
precision
and
recall
is
the
harmonic
mean
of
precision
and
recall,
the
tradi9onal
F-‐measure
or
balanced
F-‐score:
10/12/13
Page 52
53. Web Search System
Web
Spider
Document
corpus
IR
System
Query
String
1. Page1
2. Page2
3. Page3
.
.
10/12/13
Page 53
Ranked
Documents
53
54. Web
search
engines
vs.
IRS
• Differences:
– Must assemble document corpus by spidering
the web
– Can exploit the structural layout information in
HTML (XML)
– Documents are distributed
– Documents change uncontrollably
– Can exploit the link structure of the web
10/12/13
Page 54
55. Web
search
engines
crawling
• Spider,
wanderer,
crawler,
walker,
ant
– A
computer
program
that
browses
the
World
Wide
Web
in
a
methodical,
automated
manner.
(Wikipedia)
• U9li9es:
– Supports
a
search
engine
– Gathers
pages
from
the
Web
• Star9ng
from
one
page
–determine
which
page(s)
to
go
to
next
– Builds
an
index
• Object:
– Text,
video,
image
and
so
on
– Link
structure
10/12/13
Page 55
56. Web
search
engines
crawling
• Current
crawlers
–
–
–
–
–
–
–
–
10/12/13
Page 56
GoogleBot
Nutch
Big
Brother
Calif
CoolBot
Download
Express
Get
URL
…
57. Web
search
engines
crawling
• How
to
parse
web
pages?
download
HTTP
URL
Hyperlink
Extraction
Hyperlink
Parser
Video URL
Image URL
HTML URL
URL
buffer
A
A-B
New URL/URI
10/12/13
Page 57
Diff
operator
B
Visited pages
59. Web
search
engines
crawling
• Policies
– Selec9on
policy
• Pageranks
• Path
ascending
• Focused
crawling
– Re-‐visit
policy
• Freshness
• Age
– Politeness
10/12/13
Page 59
• So
that
crawlers
don’t
overload
web
servers
• Set
a
delay
between
GET
requests
• which
pages
can
be
crawled,
and
which
cannot
60. Web
search
engines
crawling
• Policies
– Robot
rules
by
Isaac
Asimov
in
1942
1. A
robot
may
not
injure
a
human
being,
or
through
inac9on,
allow
a
human
being
to
come
to
harm
2. A
robot
must
obey
orders
given
to
it
by
human
beings
except
where
such
orders
would
conflict
with
the
First
Law
3. A
robot
must
protect
its
own
existence
as
long
as
such
protec9on
does
not
conflict
with
the
First
or
Second
Law
10/12/13
Page 60
h"p://www.asimovonline.com/
61. Web
search
engines
crawling
• Policies
– Web
Robot
rules
•
•
•
•
A
web
robot
must
show
iden9fica9ons
A
web
robot
must
not
control
resources
A
web
robot
must
report
errors
A
web
robot
must
obey
exclusion
standard
("/
robots.txt"
ou
les
balises
META)
User-‐Agent:
{value}
Op9on
{Value}
Robot
name
10/12/13
Page 61
Name
or
*
Disallow,
etc.
URL,
/,
etc.
62. Web
search
engines
crawling
• Policies
– Web
Robot
rules
•
A
web
robot
must
obey
exclusion
standard
("/
robots.txt"
ou
les
balises
META)
–
Example
1:
deny
access
to
all
robots
User-Agent: *!
!Disallow: /!
10/12/13
Page 62
63. Web
search
engines
crawling
• Policies
– Web
Robot
rules
•
A
web
robot
must
obey
exclusion
standard
("/
robots.txt"
ou
les
balises
META)
–
Example
2:
deny
access
to
all
robots
to
the
give
folder
and
file
except
GoogleBot
User-agent: *!
!Disallow: /u-bourgogne/LE2I/!
!Disallow: /MR.html!
!
!
User-agent: GoogleBot !
!Disallow:!
10/12/13
Page 63
64. Web
search
engines
crawling
• Policies
– Web
Robot
rules
•
A
web
robot
must
obey
exclusion
in
html
tag
META
<meta name="robots1" content="index,follow">!
<meta name="robots2"
content="noindex,follow">!
<meta name="robots3"
content="index,nofollow">!
<meta name="robots4"
content="noindex,nofollow">!
!
!
10/12/13
Page 64
65. Web
search
engines
crawling
• Policies
– Paralleliza9on
• Distributed
web
crawling
• To
maximize
download
rate
– Robustness:
spider
traps
• Infinitely
deep
directory
structures:
hlp://foo.com/bar/foo/bar/foo/...
• Pages
filled
a
large
number
of
characters
10/12/13
Page 65
66. Web
search
engines
crawling
• Policies
– A
good
robot
must
• Get
the
header
(when
possible)
• Crawl
during
convenient
periods
• Avoid
loops
• Download
moderately
– Adopt
rota9on-‐based
requests
among
servers
– Delay
requests
on
the
same
server
• Ignore
forms
10/12/13
Page 66
67. Intelligent
IRS
• Taking into account the meaning of the
words used
• Taking into account the order of words in the
query
• Taking into account the authority of the
source
• Adapting to the user based on direct or
indirect feedback
10/12/13
Page 67