SlideShare ist ein Scribd-Unternehmen logo
1 von 74
(	
  

LABORATOIRE D’INFORMATIQUE DE
L’UNIVERSITE DE PAU ET DES PAYS
DE L’ADOUR

Information Retrieval
!

Richard Chbeir, Ph.D.

http://liuppa.univ-pau.fr

)	
  
Where	
  I	
  am	
  coming	
  from	
  …	
  

	
  
	
  
	
  
	
  

10/12/13
Page 2
Where	
  I	
  am	
  coming	
  from	
  …	
  

10/12/13
Page 3
Some	
  photos	
  of	
  Anglet/Biarritz	
  

10/12/13
Page 4
Some	
  photos	
  of	
  Anglet/Biarritz	
  

10/12/13
Page 5
Some	
  photos	
  of	
  Anglet/Biarritz	
  

10/12/13
Page 6
Defini9on	
  
•  Informa9on	
  Retrieval	
  
–  …	
  

10/12/13
Page 7
Informa9on	
  Retrieval	
  Process	
  
User	
  

Document	
  

Query	
  

Knowledge	
  Base	
  

Documents	
  
Documents	
  
Documents	
  

Processing	
  

Processing	
  

Query	
  representa9on	
  

Document	
  
representa9ons	
  

Retrieval	
  

Similarity	
  
comparison	
  

Retrieved	
  
Retrieved	
  
documents	
  
Documents	
  

10/12/13
Page 8

Relevance	
  evalua9on	
  
and	
  feedback	
  

Indexing	
  
Query/Document	
  processing	
  
•  Strip unwanted characters/markup (e.g., HTML
tags, punctuation, numbers, etc.)
•  Break into tokens (keywords) on whitespace
•  Remove common stopwords (using KB or a list)
–  e.g., a, the, it, www, what, when, from, etc.

•  Normalize keywords (using KB)
–  Using one index for synonyms
–  Stem tokens to “root” words (using KB)
•  computational à comput

•  Detect common phrases (using KB)
•  Build corresponding representation(s)	
  
10/12/13
Page 9
Example	
  

Identifying important blocs
10/12/13
Page 10
Example	
  

Identifying important blocs
10/12/13
Page 11
Example	
  

Eliminating stop words
10/12/13
Page 12
Example	
  

Identifying Concepts
10/12/13
Page 13
Example	
  

Université Bourgogne

Institut

Professeur

Result: (Université, Bourgogne, Professeur, Institut)
10/12/13
Page 14
Query/Document	
  processing	
  
•  Inverted index
An	
  inverted	
  index	
  (also	
  referred	
  to	
  as	
  pos9ngs	
  file	
  or	
  inverted	
  file)	
  is	
  an	
  
index	
  data	
  structure	
  storing	
  a	
  mapping	
  from	
  content,	
  	
  such	
  as	
  words	
  or	
  
numbers,	
  to	
  its	
  loca9ons	
  in	
  a	
  database	
  file,	
  or	
  in	
  a	
  document	
  or	
  a	
  set	
  of	
  
documents	
  

•  D0	
  =	
  "it	
  is	
  what	
  it	
  is",	
  D1	
  =	
  "what	
  is	
  it"	
  and	
  D2	
  =	
  "it	
  is	
  
a	
  banana”,	
  the	
  related	
  record	
  level	
  inverted	
  index	
  
(or	
  inverted	
  file	
  index	
  or	
  just	
  inverted	
  file)	
  is	
  	
  
"a":	
  	
  	
  	
  	
  	
  {2}	
  
"banana":	
  {2}	
  
"is":	
  	
  	
  	
  	
  {0,	
  1,	
  2}	
  
"it":	
  	
  	
  	
  	
  {0,	
  1,	
  2}	
  
"what":	
  	
  	
  {0,	
  1}	
  
10/12/13
Page 15
But	
  also….	
  

Result: (Université, Bourgogne, Professeur, Institut)
10/12/13
Page 16

And each term has its importance according to its weight
Query/Document	
  processing	
  
•  Inverted index
An	
  inverted	
  index	
  (also	
  referred	
  to	
  as	
  pos9ngs	
  file	
  or	
  inverted	
  file)	
  is	
  an	
  
index	
  data	
  structure	
  storing	
  a	
  mapping	
  from	
  content,	
  	
  such	
  as	
  words	
  or	
  
numbers,	
  to	
  its	
  loca9ons	
  in	
  a	
  database	
  file,	
  or	
  in	
  a	
  document	
  or	
  a	
  set	
  of	
  
documents	
  

•  D0	
  =	
  "it	
  is	
  what	
  it	
  is",	
  D1	
  =	
  "what	
  is	
  it"	
  and	
  D2	
  =	
  "it	
  is	
  
a	
  banana”,	
  the	
  related	
  word	
  level	
  inverted	
  index	
  
(or	
  full	
  inverted	
  index	
  or	
  inverted	
  list)	
  is	
  	
  
"a":	
  {(2,	
  2)}	
  	
  
"banana":	
  {(2,	
  3)}	
  	
  
"is":	
  {(0,	
  1),	
  (0,	
  4),	
  (1,	
  1),	
  (2,	
  1)}	
  "it":	
  {(0,	
  
0),	
  (0,	
  3),	
  (1,	
  2),	
  (2,	
  0)}	
  "what":	
  {(0,	
  2),	
  (1,	
  
0)}	
  
10/12/13
Page 17
IR	
  representa9on	
  models	
  
•  Boolean	
  retrieval	
  model	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
–  Probabilis9c	
  retrieval	
  models	
  

10/12/13
Page 18
IR	
  representa9on	
  models	
  
•  Boolean	
  retrieval	
  model	
  
•  A document is represented as a set of keywords
–  t1, t2, t3, …. tn

•  Queries are Boolean expressions of keywords,
connected by AND, OR, and NOT, including the use of
brackets to indicate scope
–  [[t1 & t2] | [t3& t4]] & t5& !t6]

•  Output: Document is relevant or not. No partial matches
or ranking	
  
Not	
  
relevant	
   Relevant	
  

10/12/13
Page 19
IR	
  representa9on	
  models	
  
•  Boolean	
  retrieval	
  model	
  
•  Popular retrieval model
–  Easy to understand for simple queries
–  Clean formalism
•  Boolean models can be extended to include ranking
–  t1 ADJ t2: means that two terms must occur
adjacently in the retrieved tuples
–  t1 NEAR t2: means that two terms must occur in a
common sentence
•  Reasonably efficient implementations possible for
normal queries
10/12/13
Page 20
IR	
  representa9on	
  models	
  
•  Boolean	
  retrieval	
  model	
  -­‐	
  problems	
  
•  Very rigid: AND means all; OR means any
•  Difficult to express complex user requests
•  Difficult to control the number of documents retrieved
–  All matched documents will be returned

•  Difficult to rank output
–  All matched documents logically satisfy the query

•  Difficult to perform relevance feedback
–  If a document is identified by the user as relevant or irrelevant,
how should the query be modified?

10/12/13
Page 21
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
•  A document is typically represented by a bag of words
(unordered words with frequencies)
•  Bag = set that allows multiple occurrences of the same
element
•  User specifies a set of desired terms with optional weights:
–  Weighted query terms:
Q = < database 0.5; text 0.8; information 0.2 >
–  Unweighted query terms:
Q = < database; text; information >
–  No Boolean conditions specified in the query	
  

10/12/13
Page 22
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
•  Support “natural language” queries
•  Treat documents and queries the same
•  Retrieval based on similarity between query and
documents
•  Output documents are ranked according to similarity
to query
•  A similarity threshold can be defined
•  Similarity based on occurrence frequencies of
keywords in query and document
•  Automatic relevance feedback can be supported:
–  Relevant documents “added” to query
–  Irrelevant documents “subtracted” from query
10/12/13
Page 23
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
•  Assume t distinct terms (index terms or vocabulary) remain after
preprocessing
•  These “orthogonal” terms form a vector space
Dimension = t = |vocabulary|
•  Each term, i, in a document or query, j, is given a real-valued
weight, wij.
•  Both documents and queries are expressed as t-dimensional
vectors:

dj = (w1j, w2j, …, wtj)

10/12/13
Page 24
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3

T3
5

D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
2
 3

D2 = 3T1 + 7T2 + T3
10/12/13
Page 25

T2

7

T1

•  Is D1 or D2 more similar to Q?
•  How to measure the degree of
similarity? Distance? Angle?
Projection?
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
•  Document	
  collec9on	
  
•  A collection of n documents can be represented in the vector space
model by a term-document matrix
•  An entry in the matrix corresponds to the “weight” of a term in the
document; zero means the term has no significance in the document
or it simply doesn’t exist in the document

10/12/13
Page 26

D1
D2
:
:
Dn


T1 T2
w11 w21
w12 w22
: :
: :
w1n w2n

….
…
…

…

Tt
wt1
wt2
:
:
wtn
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
•  Term Weights: Term Frequency
•  More frequent terms in a document are more important, i.e.
more indicative of the topic

fij = frequency of term i in document j
•  May want to normalize term frequency (tf) by dividing by
the frequency of the most common term in the document:

tfij = fij / maxi{fij}

10/12/13
Page 27
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
•  Term Weights: Inverse Document Frequency
–  Terms that appear in many different documents are less
indicative of overall topic
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
( = log2 (N/ df i) )
(N: total number of documents)

10/12/13
Page 28
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
•  TF-IDF Weighting
•  A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
•  A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight
•  Many other ways of determining term weights
have been proposed
•  Experimentally, tf-idf has been found to work
well
10/12/13
Page 29
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
•  TF-IDF Weighting - Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
10/12/13
Page 30
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
•  Queries
•  treated as a document and also tf-idf weighted
•  Alternative is for the user to supply weights for
the given query terms

10/12/13
Page 31
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
•  Similarity measure – Inner product
•  Similarity between vectors for the document di and query q can be
computed as the vector inner product (a.k.a. dot product):
t

sim(dj,q) = dj•q =∑ wij wiq
i =1

where wij is the weight of term i in document j and wiq is the
weight of term i in the query
•  For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection)
•  For weighted term vectors, it is the sum of the products of the
weights of the matched terms
10/12/13
Page 32
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
•  Similarity measure – Inner product properties
•  Unbounded
•  Favors long documents with a large number of unique terms
•  Measures how many terms matched but not how many terms are
not matched

10/12/13
Page 33
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  
•  Similarity measure – Inner product example

Binary:
–  D = 1,

1,

1,

0,

1,

1,

0

–  Q = 1,

0,

1,

0,

0,

1,

1

sim(D, Q) = 3

Weighted:	
  

Size	
  of	
  vector	
  =	
  size	
  of	
  vocabulary	
  =	
  7	
  
0	
  means	
  corresponding	
  term	
  not	
  found	
  in	
  
document	
  or	
  query	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  D1	
  =	
  2T1	
  +	
  3T2	
  +	
  5T3	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  D2	
  =	
  3T1	
  +	
  7T2	
  +	
  	
  1T3	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Q	
  =	
  0T1	
  +	
  0T2	
  +	
  	
  2T3	
  
	
  

10/12/13
Page 34

	
  sim(D1	
  ,	
  Q)	
  =	
  2*0	
  +	
  3*0	
  +	
  5*2	
  	
  =	
  10	
  
	
  	
  	
  	
  	
  	
   	
  sim(D2	
  ,	
  Q)	
  =	
  3*0	
  +	
  7*0	
  +	
  1*2	
  	
  =	
  	
  2	
  
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  

t3

–  Vector	
  space	
  retrieval	
  model	
  
•  Similarity measure – Cosine

Θ1

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted.
Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

D1
Θ2

CosSim(dj,	
  q)	
  =	

t2

Q
t1

D2

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / !(4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / !(9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner
10/12/13
Page 35

product.
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  properties	
  
•  Simple, mathematically based approach
•  Considers both local (tf) and global (idf) word occurrence
frequencies
•  Provides partial matching and ranked results
•  Tends to work quite well in practice despite obvious
weaknesses
•  Allows efficient implementation for large document
collections

10/12/13
Page 36
IR	
  representa9on	
  models	
  
•  Sta9s9cal	
  models	
  
–  Vector	
  space	
  retrieval	
  model	
  problems	
  
•  Missing semantic information (e.g., word sense)
•  Missing syntactic information (e.g., phrase structure, word
order, proximity information)
•  Assumption of term independence (e.g., ignores synonomy) and
of term dimensions orthogonality
•  Missing consideration of hot topics and documents
•  Lacks the control of a Boolean model (e.g., requiring a term to
appear in a document)
–  Given a two-term query “A B”, may prefer a document containing A
frequently but not B, over a document that contains both A and B, but both
less frequently

10/12/13
Page 37
Knowledge	
  Base	
  
•  is	
  a	
  seman9c	
  network	
  with	
  a	
  set	
  of	
  concepts	
  represen9ng	
  
groups	
  of	
  words/expressions,	
  and	
  a	
  set	
  of	
  links	
  connec9ng	
  
the	
  concepts,	
  represen9ng	
  seman9c	
  rela9ons	
  
•  Used	
  in	
  	
  
–  Indexing	
  normaliza9on	
  
–  Query	
  rewri9ng	
  
Government

State
Legislature	
  

Executive
Congress	
  

Difficulty

×

Senate	
  
Crisis

US	
  president,	
  
Chief	
  execu9ve	
  

Ease, relief

Aid, help
Loan

Emergency, Pinch,
Exigency

Bush,	
  George	
  W	
  Bush,	
  
George	
  Walker	
  Bush
	
  

Bailout
Concept (Synonym Set)
Hyponym/Hypernym relations (following direction)

10/12/13
Page 38

Meronym/Holonym relations (following direction)

×

Antonymy
Similarity	
  and	
  ranking	
  
•  Range	
  queries	
  




•  KNN	
  


10/12/13
Page 39
Similarity	
  and	
  ranking	
  
•  Ranking	
  is	
  related	
  to:	
  
–  Similarity	
  score	
  
–  Popularity	
  of	
  the	
  requested	
  documents	
  
–  Popularity	
  of	
  query	
  topics	
  
–  External	
  parameters	
  (events)	
  
–  User	
  preferences	
  

10/12/13
Page 40
Relevance	
  Feedback	
  
•  Relevance is a subjective judgment and may
include:
–  Being on the proper subject
–  Being timely (recent information)
–  Being authoritative (from a trusted source)
–  Satisfying the goals of the user and his/her
intended use of the information (information
need)	
  

10/12/13
Page 41
Relevance	
  Feedback	
  
•  Query	
  modifica9on	
  
•  Document	
  modifica9on	
  
–  To	
  allow	
  other	
  users	
  to	
  benefit	
  from	
  previous	
  
feedbacks	
  

10/12/13
Page 42
Relevance	
  Feedback	
  
•  Query	
  modifica9on	
  
–  Terms	
  occuring	
  in	
  documents	
  previously	
  
iden9fied	
  as	
  relevant	
  are	
  added	
  to	
  the	
  original	
  
query	
  or	
  the	
  weight	
  of	
  such	
  terms	
  is	
  increased	
  
–  Terms	
  occuring	
  in	
  documents	
  previously	
  
iden9fied	
  as	
  irrelevant	
  are	
  deleted	
  from	
  the	
  
original	
  query	
  or	
  the	
  weight	
  of	
  such	
  terms	
  in	
  
reduced	
  
1 r i 1 n i
Q = Q + ∑ drel − ∑ dirrel
r i=1
n i=1
i+1

10/12/13
Page 43

i
Relevance	
  Feedback	
  
•  Document	
  modifica9on	
  
–  Terms	
  in	
  the	
  query	
  are	
  added	
  to	
  the	
  document	
  
index	
  list	
  with	
  an	
  ini9al	
  weight	
  
–  Weights	
  of	
  index	
  terms	
  in	
  query	
  and	
  relevant	
  
documents	
  are	
  increased	
  
–  Weights	
  of	
  index	
  terms	
  not	
  in	
  query	
  in	
  relevant	
  
documents	
  are	
  decreased	
  	
  

10/12/13
Page 44
Performance	
  measurement	
  
•  What	
  can	
  be	
  measured	
  that	
  reflects	
  users’	
  ability	
  to	
  use	
  
system?	
  	
  
–  Coverage	
  of	
  informa9on	
  
–  Form	
  of	
  presenta9on	
  
effectiveness
–  Effort	
  required/ease	
  of	
  use	
  
–  Time	
  and	
  space	
  efficiency	
  
–  Recall	
  
•  propor9on	
  of	
  relevant	
  material	
  actually	
  retrieved	
  

–  Precision	
  

•  propor9on	
  of	
  retrieved	
  material	
  actually	
  relevant	
  

10/12/13
Page 45
Performance	
  measurement	
  
•  In	
  what	
  ways	
  can	
  a	
  document	
  be	
  relevant	
  to	
  
a	
  query?	
  
–  Answer	
  precise	
  ques9on	
  precisely	
  
–  Par9ally	
  answer	
  ques9on	
  
–  Suggest	
  a	
  source	
  for	
  more	
  informa9on	
  
–  Give	
  background	
  informa9on	
  
–  Remind	
  the	
  user	
  of	
  other	
  knowledge	
  
–  Others	
  ...	
  

10/12/13
Page 46
Performance	
  measurement	
  
•  Standard	
  IR	
  Evalua9on	
  
•  Precision	
  

Retrieved
Documents

# relevant retrieved
# retrieved

•  Recall	
  
# relevant retrieved
# relevant in collection

10/12/13
Page 47

Collection
Performance	
  measurement	
  
•  Standard	
  IR	
  Evalua9on	
  
•  Noise	
  

Retrieved
Documents

# irrelevant retrieved
# retrieved

•  Silence	
  
# relevant not retrieved
# relevant in collection

10/12/13
Page 48

Collection
Performance	
  measurement	
  
•  Standard	
  IR	
  Evalua9on	
  
–  P/R	
  curves	
  
•  There	
  is	
  a	
  tradeoff	
  between	
  Precision	
  and	
  Recall	
  
•  So	
  measure	
  Precision	
  at	
  different	
  levels	
  of	
  Recall	
  
B	
  

A	
  

precision

x
x

x

B	
  performs	
  beler	
  

x
x
x

recall
10/12/13
Page 49

x
x
Performance	
  measurement	
  
•  Standard	
  IR	
  Evalua9on	
  
–  P/R	
  curves	
  
•  Difficult	
  to	
  determine	
  which	
  of	
  these	
  two	
  
hypothe9cal	
  results	
  is	
  beler	
  

precision

x
x

x

recall
10/12/13
Page 50

x
Performance	
  measurement	
  
•  Document	
  Cutoff	
  Levels	
  
 

Another way to evaluate:!
  Fix the number of documents retrieved at several levels:"
 

top 5, top 10, top 20, top 50, top 100, top 500"

  Measure

 

10/12/13
Page 51

precision at each of these levels"
  Take (weighted) average over results"
This is a way to focus on high precision!
Performance	
  measurement	
  
•  F-­‐measure	
  
–  A	
  measure	
  that	
  combines	
  precision	
  and	
  recall	
  is	
  
the	
  harmonic	
  mean	
  of	
  precision	
  and	
  recall,	
  the	
  
tradi9onal	
  F-­‐measure	
  or	
  balanced	
  F-­‐score:	
  

10/12/13
Page 52
Web Search System
Web

Spider	
  

Document	
  
corpus	
  
IR	
  
System	
  

Query	
  
String	
  
1. Page1
2. Page2
3. Page3
.
.
10/12/13
Page 53

Ranked	
  
Documents	
  
53
Web	
  search	
  engines	
  vs.	
  IRS	
  
•  Differences:
–  Must assemble document corpus by spidering
the web
–  Can exploit the structural layout information in
HTML (XML)
–  Documents are distributed
–  Documents change uncontrollably
–  Can exploit the link structure of the web

10/12/13
Page 54
Web	
  search	
  engines	
  crawling	
  
•  Spider,	
  wanderer,	
  crawler,	
  walker,	
  ant	
  
–  A	
  computer	
  program	
  that	
  browses	
  the	
  World	
  Wide	
  Web	
  in	
  a	
  
methodical,	
  automated	
  manner.	
  (Wikipedia)	
  

•  U9li9es:	
  
–  Supports	
  a	
  search	
  engine	
  
–  Gathers	
  pages	
  from	
  the	
  Web	
  
•  Star9ng	
  from	
  one	
  page	
  –determine	
  which	
  page(s)	
  to	
  go	
  to	
  next	
  

–  Builds	
  an	
  index	
  

•  Object:	
  
–  Text,	
  video,	
  image	
  and	
  so	
  on	
  
–  Link	
  structure	
  

10/12/13
Page 55
Web	
  search	
  engines	
  crawling	
  
•  Current	
  crawlers	
  
– 
– 
– 
– 
– 
– 
– 
– 

10/12/13
Page 56

GoogleBot	
  
Nutch	
  
Big	
  Brother	
  
Calif	
  
CoolBot	
  
Download	
  Express	
  
Get	
  URL	
  
…	
  
Web	
  search	
  engines	
  crawling	
  
•  How	
  to	
  parse	
  web	
  pages?	
  
download
HTTP

URL

Hyperlink
Extraction

Hyperlink
Parser

Video URL
Image URL
HTML URL

URL
buffer
A
A-B
New URL/URI
10/12/13
Page 57

Diff
operator

B
Visited pages
Web	
  search	
  engines	
  crawling	
  
•  Policies	
  
–  Selec9on	
  policy	
  
–  Re-­‐visit	
  policy	
  
–  Politeness	
  policy	
  
–  Paralleliza9on	
  policy	
  
–  Robustness	
  

10/12/13
Page 58
Web	
  search	
  engines	
  crawling	
  
•  Policies	
  
–  Selec9on	
  policy	
  
•  Pageranks	
  
•  Path	
  ascending	
  
•  Focused	
  crawling	
  

–  Re-­‐visit	
  policy	
  
•  Freshness	
  
•  Age	
  

–  Politeness	
  

10/12/13
Page 59

•  So	
  that	
  crawlers	
  don’t	
  overload	
  web	
  servers	
  
•  Set	
  a	
  delay	
  between	
  GET	
  requests	
  
•  which	
  pages	
  can	
  be	
  crawled,	
  and	
  which	
  cannot	
  
Web	
  search	
  engines	
  crawling	
  
•  Policies	
  
–  Robot	
  rules	
  by	
  Isaac	
  Asimov	
  in	
  1942	
  
1.  A	
  robot	
  may	
  not	
  injure	
  a	
  human	
  being,	
  or	
  through	
  inac9on,	
  allow	
  
a	
  human	
  being	
  to	
  come	
  to	
  harm	
  
2.  A	
  robot	
  must	
  obey	
  orders	
  given	
  to	
  it	
  by	
  human	
  beings	
  except	
  
where	
  such	
  orders	
  would	
  conflict	
  with	
  the	
  First	
  Law	
  
3.  A	
  robot	
  must	
  protect	
  its	
  own	
  existence	
  as	
  long	
  as	
  such	
  protec9on	
  
does	
  not	
  conflict	
  with	
  the	
  First	
  or	
  Second	
  Law	
  

10/12/13
Page 60

h"p://www.asimovonline.com/	
  
Web	
  search	
  engines	
  crawling	
  
•  Policies	
  
–  Web	
  Robot	
  rules	
  
• 
• 
• 
• 

A	
  web	
  robot	
  must	
  show	
  iden9fica9ons	
  
A	
  web	
  robot	
  must	
  not	
  control	
  resources	
  
A	
  web	
  robot	
  must	
  report	
  errors	
  
A	
  web	
  robot	
  must	
  obey	
  exclusion	
  standard	
  ("/
robots.txt"	
  ou	
  les	
  balises	
  META)	
  

User-­‐Agent:	
  {value}	
  Op9on	
  {Value}	
  

Robot	
  name	
  
10/12/13
Page 61

Name	
  or	
  *	
  	
  

Disallow,	
  etc.	
  	
   URL,	
  /,	
  etc.	
  	
  
Web	
  search	
  engines	
  crawling	
  
•  Policies	
  
–  Web	
  Robot	
  rules	
  
• 

A	
  web	
  robot	
  must	
  obey	
  exclusion	
  standard	
  ("/
robots.txt"	
  ou	
  les	
  balises	
  META)	
  
– 

Example	
  1:	
  deny	
  access	
  to	
  all	
  robots	
  

User-Agent: *!
!Disallow: /!

10/12/13
Page 62
Web	
  search	
  engines	
  crawling	
  
•  Policies	
  
–  Web	
  Robot	
  rules	
  
• 

A	
  web	
  robot	
  must	
  obey	
  exclusion	
  standard	
  ("/
robots.txt"	
  ou	
  les	
  balises	
  META)	
  
– 

Example	
  2:	
  deny	
  access	
  to	
  all	
  robots	
  to	
  the	
  give	
  folder	
  
and	
  file	
  except	
  GoogleBot	
  

User-agent: *!
!Disallow: /u-bourgogne/LE2I/!
!Disallow: /MR.html!
!
!
User-agent: GoogleBot !
!Disallow:!

10/12/13
Page 63
Web	
  search	
  engines	
  crawling	
  
•  Policies	
  
–  Web	
  Robot	
  rules	
  
• 

A	
  web	
  robot	
  must	
  obey	
  exclusion	
  in	
  html	
  tag	
  META	
  

<meta name="robots1" content="index,follow">!
<meta name="robots2"
content="noindex,follow">!
<meta name="robots3"
content="index,nofollow">!
<meta name="robots4"
content="noindex,nofollow">!
!
!
10/12/13
Page 64
Web	
  search	
  engines	
  crawling	
  
•  Policies	
  
–  Paralleliza9on	
  
•  Distributed	
  web	
  crawling	
  
•  To	
  maximize	
  download	
  rate	
  

–  Robustness:	
  spider	
  traps	
  
•  Infinitely	
  deep	
  directory	
  structures:	
  
hlp://foo.com/bar/foo/bar/foo/...	
  
•  Pages	
  filled	
  a	
  large	
  number	
  of	
  characters	
  

10/12/13
Page 65
Web	
  search	
  engines	
  crawling	
  
•  Policies	
  
–  A	
  good	
  robot	
  must	
  
•  Get	
  the	
  header	
  (when	
  possible)	
  
•  Crawl	
  during	
  convenient	
  periods	
  
•  Avoid	
  loops	
  
•  Download	
  moderately	
  
–  Adopt	
  rota9on-­‐based	
  requests	
  among	
  servers	
  	
  
–  Delay	
  requests	
  on	
  the	
  same	
  server	
  

•  Ignore	
  forms	
  

10/12/13
Page 66
Intelligent	
  IRS	
  
•  Taking into account the meaning of the
words used
•  Taking into account the order of words in the
query
•  Taking into account the authority of the
source
•  Adapting to the user based on direct or
indirect feedback
10/12/13
Page 67
Adap9ve	
  IRS…	
  	
  

10/12/13
Page 68
But…	
  
•  What	
  happens	
  if	
  you	
  are	
  looking	
  for	
  
something	
  like	
  …	
  

10/12/13
Page 69
But…	
  
•  With	
  OS,	
  dedicated	
  mul9media	
  
func9onali9es	
  are	
  limited	
  …	
  
–  Manual	
  annota9on	
  and	
  categoriza9on	
  
–  Textual	
  retrieval	
  
–  …	
  

10/12/13
Page 70
But…	
  
•  With	
  DBMS,	
  dedicated	
  mul9media	
  
func9onali9es	
  are	
  limited	
  …	
  
Surveillance	
  Camera	
  	
  
ID	
   A-­‐Date	
  
1	
  

1/1/200
6	
  

2	
  

Create	
  Table	
  SURVEILLANCE	
  
ID	
  integer,	
  
A-­‐Date	
  date,	
  
Photo	
  BLOB);	
  

1/5/200
7	
  

3	
  

2/2/200
7	
  

4	
  
10/12/13
Page 71

Photo	
  

4/1/200
7	
  

Select	
  *	
  from	
  SURVEILLANCE	
  
Where	
  Photo	
  =	
  …	
  );	
  
But…	
  
•  With	
  Image	
  Processing	
  …	
  

10/12/13
Page 72
What	
  do	
  you	
  think?	
  

10/12/13
Page 73
Merci
Richard.chbeir@univ-pau.fr

Weitere ähnliche Inhalte

Was ist angesagt?

Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
9866825059
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
KU Leuven
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
Lokesh Ramaswamy
 

Was ist angesagt? (20)

IR
IRIR
IR
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Week12
Week12Week12
Week12
 
Ir 01
Ir   01Ir   01
Ir 01
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal
 
Text mining
Text miningText mining
Text mining
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Text mining
Text miningText mining
Text mining
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 

Ähnlich wie Information Retrieval

Slides
SlidesSlides
Slides
butest
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
vini89
 

Ähnlich wie Information Retrieval (20)

Ir models
Ir modelsIr models
Ir models
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Project Presentation.ppt
Project Presentation.pptProject Presentation.ppt
Project Presentation.ppt
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
Slides
SlidesSlides
Slides
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Text Mining
Text MiningText Mining
Text Mining
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Electronic Grading of Paper Assessments
Electronic Grading of Paper AssessmentsElectronic Grading of Paper Assessments
Electronic Grading of Paper Assessments
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 

Mehr von rchbeir

Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
rchbeir
 
SS tree (par SYLLA Demba et TALBI Rachid)
SS tree (par SYLLA Demba et TALBI Rachid)SS tree (par SYLLA Demba et TALBI Rachid)
SS tree (par SYLLA Demba et TALBI Rachid)
rchbeir
 
Ranking (par IBRAHIM Sirine et TANIOS Dany)
Ranking (par IBRAHIM Sirine et TANIOS	 Dany)Ranking (par IBRAHIM Sirine et TANIOS	 Dany)
Ranking (par IBRAHIM Sirine et TANIOS Dany)
rchbeir
 
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
rchbeir
 
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
rchbeir
 
NoSQL (par HEGUY Xabier)
NoSQL (par HEGUY Xabier)NoSQL (par HEGUY Xabier)
NoSQL (par HEGUY Xabier)
rchbeir
 
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
rchbeir
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
rchbeir
 
Arbre b (par EL HACHEM Marwan et RICHA Elias)
Arbre b (par EL HACHEM Marwan et RICHA Elias)Arbre b (par EL HACHEM Marwan et RICHA Elias)
Arbre b (par EL HACHEM Marwan et RICHA Elias)
rchbeir
 
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
rchbeir
 

Mehr von rchbeir (13)

Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
 
SS tree (par SYLLA Demba et TALBI Rachid)
SS tree (par SYLLA Demba et TALBI Rachid)SS tree (par SYLLA Demba et TALBI Rachid)
SS tree (par SYLLA Demba et TALBI Rachid)
 
Ranking (par IBRAHIM Sirine et TANIOS Dany)
Ranking (par IBRAHIM Sirine et TANIOS	 Dany)Ranking (par IBRAHIM Sirine et TANIOS	 Dany)
Ranking (par IBRAHIM Sirine et TANIOS Dany)
 
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
 
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
 
NoSQL (par HEGUY Xabier)
NoSQL (par HEGUY Xabier)NoSQL (par HEGUY Xabier)
NoSQL (par HEGUY Xabier)
 
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Arbre b (par EL HACHEM Marwan et RICHA Elias)
Arbre b (par EL HACHEM Marwan et RICHA Elias)Arbre b (par EL HACHEM Marwan et RICHA Elias)
Arbre b (par EL HACHEM Marwan et RICHA Elias)
 
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
 
Plsql2
Plsql2Plsql2
Plsql2
 
Plsql
PlsqlPlsql
Plsql
 
Sql3
Sql3Sql3
Sql3
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Information Retrieval

  • 1. (   LABORATOIRE D’INFORMATIQUE DE L’UNIVERSITE DE PAU ET DES PAYS DE L’ADOUR Information Retrieval ! Richard Chbeir, Ph.D. http://liuppa.univ-pau.fr )  
  • 2. Where  I  am  coming  from  …           10/12/13 Page 2
  • 3. Where  I  am  coming  from  …   10/12/13 Page 3
  • 4. Some  photos  of  Anglet/Biarritz   10/12/13 Page 4
  • 5. Some  photos  of  Anglet/Biarritz   10/12/13 Page 5
  • 6. Some  photos  of  Anglet/Biarritz   10/12/13 Page 6
  • 7. Defini9on   •  Informa9on  Retrieval   –  …   10/12/13 Page 7
  • 8. Informa9on  Retrieval  Process   User   Document   Query   Knowledge  Base   Documents   Documents   Documents   Processing   Processing   Query  representa9on   Document   representa9ons   Retrieval   Similarity   comparison   Retrieved   Retrieved   documents   Documents   10/12/13 Page 8 Relevance  evalua9on   and  feedback   Indexing  
  • 9. Query/Document  processing   •  Strip unwanted characters/markup (e.g., HTML tags, punctuation, numbers, etc.) •  Break into tokens (keywords) on whitespace •  Remove common stopwords (using KB or a list) –  e.g., a, the, it, www, what, when, from, etc. •  Normalize keywords (using KB) –  Using one index for synonyms –  Stem tokens to “root” words (using KB) •  computational à comput •  Detect common phrases (using KB) •  Build corresponding representation(s)   10/12/13 Page 9
  • 10. Example   Identifying important blocs 10/12/13 Page 10
  • 11. Example   Identifying important blocs 10/12/13 Page 11
  • 12. Example   Eliminating stop words 10/12/13 Page 12
  • 14. Example   Université Bourgogne Institut Professeur Result: (Université, Bourgogne, Professeur, Institut) 10/12/13 Page 14
  • 15. Query/Document  processing   •  Inverted index An  inverted  index  (also  referred  to  as  pos9ngs  file  or  inverted  file)  is  an   index  data  structure  storing  a  mapping  from  content,    such  as  words  or   numbers,  to  its  loca9ons  in  a  database  file,  or  in  a  document  or  a  set  of   documents   •  D0  =  "it  is  what  it  is",  D1  =  "what  is  it"  and  D2  =  "it  is   a  banana”,  the  related  record  level  inverted  index   (or  inverted  file  index  or  just  inverted  file)  is     "a":            {2}   "banana":  {2}   "is":          {0,  1,  2}   "it":          {0,  1,  2}   "what":      {0,  1}   10/12/13 Page 15
  • 16. But  also….   Result: (Université, Bourgogne, Professeur, Institut) 10/12/13 Page 16 And each term has its importance according to its weight
  • 17. Query/Document  processing   •  Inverted index An  inverted  index  (also  referred  to  as  pos9ngs  file  or  inverted  file)  is  an   index  data  structure  storing  a  mapping  from  content,    such  as  words  or   numbers,  to  its  loca9ons  in  a  database  file,  or  in  a  document  or  a  set  of   documents   •  D0  =  "it  is  what  it  is",  D1  =  "what  is  it"  and  D2  =  "it  is   a  banana”,  the  related  word  level  inverted  index   (or  full  inverted  index  or  inverted  list)  is     "a":  {(2,  2)}     "banana":  {(2,  3)}     "is":  {(0,  1),  (0,  4),  (1,  1),  (2,  1)}  "it":  {(0,   0),  (0,  3),  (1,  2),  (2,  0)}  "what":  {(0,  2),  (1,   0)}   10/12/13 Page 17
  • 18. IR  representa9on  models   •  Boolean  retrieval  model   •  Sta9s9cal  models   –  Vector  space  retrieval  model   –  Probabilis9c  retrieval  models   10/12/13 Page 18
  • 19. IR  representa9on  models   •  Boolean  retrieval  model   •  A document is represented as a set of keywords –  t1, t2, t3, …. tn •  Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope –  [[t1 & t2] | [t3& t4]] & t5& !t6] •  Output: Document is relevant or not. No partial matches or ranking   Not   relevant   Relevant   10/12/13 Page 19
  • 20. IR  representa9on  models   •  Boolean  retrieval  model   •  Popular retrieval model –  Easy to understand for simple queries –  Clean formalism •  Boolean models can be extended to include ranking –  t1 ADJ t2: means that two terms must occur adjacently in the retrieved tuples –  t1 NEAR t2: means that two terms must occur in a common sentence •  Reasonably efficient implementations possible for normal queries 10/12/13 Page 20
  • 21. IR  representa9on  models   •  Boolean  retrieval  model  -­‐  problems   •  Very rigid: AND means all; OR means any •  Difficult to express complex user requests •  Difficult to control the number of documents retrieved –  All matched documents will be returned •  Difficult to rank output –  All matched documents logically satisfy the query •  Difficult to perform relevance feedback –  If a document is identified by the user as relevant or irrelevant, how should the query be modified? 10/12/13 Page 21
  • 22. IR  representa9on  models   •  Sta9s9cal  models   •  A document is typically represented by a bag of words (unordered words with frequencies) •  Bag = set that allows multiple occurrences of the same element •  User specifies a set of desired terms with optional weights: –  Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > –  Unweighted query terms: Q = < database; text; information > –  No Boolean conditions specified in the query   10/12/13 Page 22
  • 23. IR  representa9on  models   •  Sta9s9cal  models   •  Support “natural language” queries •  Treat documents and queries the same •  Retrieval based on similarity between query and documents •  Output documents are ranked according to similarity to query •  A similarity threshold can be defined •  Similarity based on occurrence frequencies of keywords in query and document •  Automatic relevance feedback can be supported: –  Relevant documents “added” to query –  Irrelevant documents “subtracted” from query 10/12/13 Page 23
  • 24. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   •  Assume t distinct terms (index terms or vocabulary) remain after preprocessing •  These “orthogonal” terms form a vector space Dimension = t = |vocabulary| •  Each term, i, in a document or query, j, is given a real-valued weight, wij. •  Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj) 10/12/13 Page 24
  • 25. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 T3 5 D1 = 2T1+ 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 2 3 D2 = 3T1 + 7T2 + T3 10/12/13 Page 25 T2 7 T1 •  Is D1 or D2 more similar to Q? •  How to measure the degree of similarity? Distance? Angle? Projection?
  • 26. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   •  Document  collec9on   •  A collection of n documents can be represented in the vector space model by a term-document matrix •  An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document 10/12/13 Page 26 D1 D2 : : Dn T1 T2 w11 w21 w12 w22 : : : : w1n w2n …. … … … Tt wt1 wt2 : : wtn
  • 27. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   •  Term Weights: Term Frequency •  More frequent terms in a document are more important, i.e. more indicative of the topic fij = frequency of term i in document j •  May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij} 10/12/13 Page 27
  • 28. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   •  Term Weights: Inverse Document Frequency –  Terms that appear in many different documents are less indicative of overall topic df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, ( = log2 (N/ df i) ) (N: total number of documents) 10/12/13 Page 28
  • 29. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   •  TF-IDF Weighting •  A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi) •  A term occurring frequently in the document but rarely in the rest of the collection is given high weight •  Many other ways of determining term weights have been proposed •  Experimentally, tf-idf has been found to work well 10/12/13 Page 29
  • 30. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   •  TF-IDF Weighting - Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6 B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0 C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8 10/12/13 Page 30
  • 31. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   •  Queries •  treated as a document and also tf-idf weighted •  Alternative is for the user to supply weights for the given query terms 10/12/13 Page 31
  • 32. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   •  Similarity measure – Inner product •  Similarity between vectors for the document di and query q can be computed as the vector inner product (a.k.a. dot product): t sim(dj,q) = dj•q =∑ wij wiq i =1 where wij is the weight of term i in document j and wiq is the weight of term i in the query •  For binary vectors, the inner product is the number of matched query terms in the document (size of intersection) •  For weighted term vectors, it is the sum of the products of the weights of the matched terms 10/12/13 Page 32
  • 33. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   •  Similarity measure – Inner product properties •  Unbounded •  Favors long documents with a large number of unique terms •  Measures how many terms matched but not how many terms are not matched 10/12/13 Page 33
  • 34. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model   •  Similarity measure – Inner product example Binary: –  D = 1, 1, 1, 0, 1, 1, 0 –  Q = 1, 0, 1, 0, 0, 1, 1 sim(D, Q) = 3 Weighted:   Size  of  vector  =  size  of  vocabulary  =  7   0  means  corresponding  term  not  found  in   document  or  query                        D1  =  2T1  +  3T2  +  5T3                      D2  =  3T1  +  7T2  +    1T3                                                  Q  =  0T1  +  0T2  +    2T3     10/12/13 Page 34  sim(D1  ,  Q)  =  2*0  +  3*0  +  5*2    =  10                sim(D2  ,  Q)  =  3*0  +  7*0  +  1*2    =    2  
  • 35. IR  representa9on  models   •  Sta9s9cal  models   t3 –  Vector  space  retrieval  model   •  Similarity measure – Cosine Θ1 The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. D1 Θ2 CosSim(dj,  q)  = t2 Q t1 D2 D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / !(4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / !(9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3 D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner 10/12/13 Page 35 product.
  • 36. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model  properties   •  Simple, mathematically based approach •  Considers both local (tf) and global (idf) word occurrence frequencies •  Provides partial matching and ranked results •  Tends to work quite well in practice despite obvious weaknesses •  Allows efficient implementation for large document collections 10/12/13 Page 36
  • 37. IR  representa9on  models   •  Sta9s9cal  models   –  Vector  space  retrieval  model  problems   •  Missing semantic information (e.g., word sense) •  Missing syntactic information (e.g., phrase structure, word order, proximity information) •  Assumption of term independence (e.g., ignores synonomy) and of term dimensions orthogonality •  Missing consideration of hot topics and documents •  Lacks the control of a Boolean model (e.g., requiring a term to appear in a document) –  Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently 10/12/13 Page 37
  • 38. Knowledge  Base   •  is  a  seman9c  network  with  a  set  of  concepts  represen9ng   groups  of  words/expressions,  and  a  set  of  links  connec9ng   the  concepts,  represen9ng  seman9c  rela9ons   •  Used  in     –  Indexing  normaliza9on   –  Query  rewri9ng   Government State Legislature   Executive Congress   Difficulty × Senate   Crisis US  president,   Chief  execu9ve   Ease, relief Aid, help Loan Emergency, Pinch, Exigency Bush,  George  W  Bush,   George  Walker  Bush   Bailout Concept (Synonym Set) Hyponym/Hypernym relations (following direction) 10/12/13 Page 38 Meronym/Holonym relations (following direction) × Antonymy
  • 39. Similarity  and  ranking   •  Range  queries     •  KNN    10/12/13 Page 39
  • 40. Similarity  and  ranking   •  Ranking  is  related  to:   –  Similarity  score   –  Popularity  of  the  requested  documents   –  Popularity  of  query  topics   –  External  parameters  (events)   –  User  preferences   10/12/13 Page 40
  • 41. Relevance  Feedback   •  Relevance is a subjective judgment and may include: –  Being on the proper subject –  Being timely (recent information) –  Being authoritative (from a trusted source) –  Satisfying the goals of the user and his/her intended use of the information (information need)   10/12/13 Page 41
  • 42. Relevance  Feedback   •  Query  modifica9on   •  Document  modifica9on   –  To  allow  other  users  to  benefit  from  previous   feedbacks   10/12/13 Page 42
  • 43. Relevance  Feedback   •  Query  modifica9on   –  Terms  occuring  in  documents  previously   iden9fied  as  relevant  are  added  to  the  original   query  or  the  weight  of  such  terms  is  increased   –  Terms  occuring  in  documents  previously   iden9fied  as  irrelevant  are  deleted  from  the   original  query  or  the  weight  of  such  terms  in   reduced   1 r i 1 n i Q = Q + ∑ drel − ∑ dirrel r i=1 n i=1 i+1 10/12/13 Page 43 i
  • 44. Relevance  Feedback   •  Document  modifica9on   –  Terms  in  the  query  are  added  to  the  document   index  list  with  an  ini9al  weight   –  Weights  of  index  terms  in  query  and  relevant   documents  are  increased   –  Weights  of  index  terms  not  in  query  in  relevant   documents  are  decreased     10/12/13 Page 44
  • 45. Performance  measurement   •  What  can  be  measured  that  reflects  users’  ability  to  use   system?     –  Coverage  of  informa9on   –  Form  of  presenta9on   effectiveness –  Effort  required/ease  of  use   –  Time  and  space  efficiency   –  Recall   •  propor9on  of  relevant  material  actually  retrieved   –  Precision   •  propor9on  of  retrieved  material  actually  relevant   10/12/13 Page 45
  • 46. Performance  measurement   •  In  what  ways  can  a  document  be  relevant  to   a  query?   –  Answer  precise  ques9on  precisely   –  Par9ally  answer  ques9on   –  Suggest  a  source  for  more  informa9on   –  Give  background  informa9on   –  Remind  the  user  of  other  knowledge   –  Others  ...   10/12/13 Page 46
  • 47. Performance  measurement   •  Standard  IR  Evalua9on   •  Precision   Retrieved Documents # relevant retrieved # retrieved •  Recall   # relevant retrieved # relevant in collection 10/12/13 Page 47 Collection
  • 48. Performance  measurement   •  Standard  IR  Evalua9on   •  Noise   Retrieved Documents # irrelevant retrieved # retrieved •  Silence   # relevant not retrieved # relevant in collection 10/12/13 Page 48 Collection
  • 49. Performance  measurement   •  Standard  IR  Evalua9on   –  P/R  curves   •  There  is  a  tradeoff  between  Precision  and  Recall   •  So  measure  Precision  at  different  levels  of  Recall   B   A   precision x x x B  performs  beler   x x x recall 10/12/13 Page 49 x x
  • 50. Performance  measurement   •  Standard  IR  Evalua9on   –  P/R  curves   •  Difficult  to  determine  which  of  these  two   hypothe9cal  results  is  beler   precision x x x recall 10/12/13 Page 50 x
  • 51. Performance  measurement   •  Document  Cutoff  Levels     Another way to evaluate:!   Fix the number of documents retrieved at several levels:"   top 5, top 10, top 20, top 50, top 100, top 500"   Measure   10/12/13 Page 51 precision at each of these levels"   Take (weighted) average over results" This is a way to focus on high precision!
  • 52. Performance  measurement   •  F-­‐measure   –  A  measure  that  combines  precision  and  recall  is   the  harmonic  mean  of  precision  and  recall,  the   tradi9onal  F-­‐measure  or  balanced  F-­‐score:   10/12/13 Page 52
  • 53. Web Search System Web Spider   Document   corpus   IR   System   Query   String   1. Page1 2. Page2 3. Page3 . . 10/12/13 Page 53 Ranked   Documents   53
  • 54. Web  search  engines  vs.  IRS   •  Differences: –  Must assemble document corpus by spidering the web –  Can exploit the structural layout information in HTML (XML) –  Documents are distributed –  Documents change uncontrollably –  Can exploit the link structure of the web 10/12/13 Page 54
  • 55. Web  search  engines  crawling   •  Spider,  wanderer,  crawler,  walker,  ant   –  A  computer  program  that  browses  the  World  Wide  Web  in  a   methodical,  automated  manner.  (Wikipedia)   •  U9li9es:   –  Supports  a  search  engine   –  Gathers  pages  from  the  Web   •  Star9ng  from  one  page  –determine  which  page(s)  to  go  to  next   –  Builds  an  index   •  Object:   –  Text,  video,  image  and  so  on   –  Link  structure   10/12/13 Page 55
  • 56. Web  search  engines  crawling   •  Current  crawlers   –  –  –  –  –  –  –  –  10/12/13 Page 56 GoogleBot   Nutch   Big  Brother   Calif   CoolBot   Download  Express   Get  URL   …  
  • 57. Web  search  engines  crawling   •  How  to  parse  web  pages?   download HTTP URL Hyperlink Extraction Hyperlink Parser Video URL Image URL HTML URL URL buffer A A-B New URL/URI 10/12/13 Page 57 Diff operator B Visited pages
  • 58. Web  search  engines  crawling   •  Policies   –  Selec9on  policy   –  Re-­‐visit  policy   –  Politeness  policy   –  Paralleliza9on  policy   –  Robustness   10/12/13 Page 58
  • 59. Web  search  engines  crawling   •  Policies   –  Selec9on  policy   •  Pageranks   •  Path  ascending   •  Focused  crawling   –  Re-­‐visit  policy   •  Freshness   •  Age   –  Politeness   10/12/13 Page 59 •  So  that  crawlers  don’t  overload  web  servers   •  Set  a  delay  between  GET  requests   •  which  pages  can  be  crawled,  and  which  cannot  
  • 60. Web  search  engines  crawling   •  Policies   –  Robot  rules  by  Isaac  Asimov  in  1942   1.  A  robot  may  not  injure  a  human  being,  or  through  inac9on,  allow   a  human  being  to  come  to  harm   2.  A  robot  must  obey  orders  given  to  it  by  human  beings  except   where  such  orders  would  conflict  with  the  First  Law   3.  A  robot  must  protect  its  own  existence  as  long  as  such  protec9on   does  not  conflict  with  the  First  or  Second  Law   10/12/13 Page 60 h"p://www.asimovonline.com/  
  • 61. Web  search  engines  crawling   •  Policies   –  Web  Robot  rules   •  •  •  •  A  web  robot  must  show  iden9fica9ons   A  web  robot  must  not  control  resources   A  web  robot  must  report  errors   A  web  robot  must  obey  exclusion  standard  ("/ robots.txt"  ou  les  balises  META)   User-­‐Agent:  {value}  Op9on  {Value}   Robot  name   10/12/13 Page 61 Name  or  *     Disallow,  etc.     URL,  /,  etc.    
  • 62. Web  search  engines  crawling   •  Policies   –  Web  Robot  rules   •  A  web  robot  must  obey  exclusion  standard  ("/ robots.txt"  ou  les  balises  META)   –  Example  1:  deny  access  to  all  robots   User-Agent: *! !Disallow: /! 10/12/13 Page 62
  • 63. Web  search  engines  crawling   •  Policies   –  Web  Robot  rules   •  A  web  robot  must  obey  exclusion  standard  ("/ robots.txt"  ou  les  balises  META)   –  Example  2:  deny  access  to  all  robots  to  the  give  folder   and  file  except  GoogleBot   User-agent: *! !Disallow: /u-bourgogne/LE2I/! !Disallow: /MR.html! ! ! User-agent: GoogleBot ! !Disallow:! 10/12/13 Page 63
  • 64. Web  search  engines  crawling   •  Policies   –  Web  Robot  rules   •  A  web  robot  must  obey  exclusion  in  html  tag  META   <meta name="robots1" content="index,follow">! <meta name="robots2" content="noindex,follow">! <meta name="robots3" content="index,nofollow">! <meta name="robots4" content="noindex,nofollow">! ! ! 10/12/13 Page 64
  • 65. Web  search  engines  crawling   •  Policies   –  Paralleliza9on   •  Distributed  web  crawling   •  To  maximize  download  rate   –  Robustness:  spider  traps   •  Infinitely  deep  directory  structures:   hlp://foo.com/bar/foo/bar/foo/...   •  Pages  filled  a  large  number  of  characters   10/12/13 Page 65
  • 66. Web  search  engines  crawling   •  Policies   –  A  good  robot  must   •  Get  the  header  (when  possible)   •  Crawl  during  convenient  periods   •  Avoid  loops   •  Download  moderately   –  Adopt  rota9on-­‐based  requests  among  servers     –  Delay  requests  on  the  same  server   •  Ignore  forms   10/12/13 Page 66
  • 67. Intelligent  IRS   •  Taking into account the meaning of the words used •  Taking into account the order of words in the query •  Taking into account the authority of the source •  Adapting to the user based on direct or indirect feedback 10/12/13 Page 67
  • 68. Adap9ve  IRS…     10/12/13 Page 68
  • 69. But…   •  What  happens  if  you  are  looking  for   something  like  …   10/12/13 Page 69
  • 70. But…   •  With  OS,  dedicated  mul9media   func9onali9es  are  limited  …   –  Manual  annota9on  and  categoriza9on   –  Textual  retrieval   –  …   10/12/13 Page 70
  • 71. But…   •  With  DBMS,  dedicated  mul9media   func9onali9es  are  limited  …   Surveillance  Camera     ID   A-­‐Date   1   1/1/200 6   2   Create  Table  SURVEILLANCE   ID  integer,   A-­‐Date  date,   Photo  BLOB);   1/5/200 7   3   2/2/200 7   4   10/12/13 Page 71 Photo   4/1/200 7   Select  *  from  SURVEILLANCE   Where  Photo  =  …  );  
  • 72. But…   •  With  Image  Processing  …   10/12/13 Page 72
  • 73. What  do  you  think?   10/12/13 Page 73