2024: Domino Containers - The Next Step. News from the Domino Container commu...
Â
search engine
1. Web search engines
ď§Rooted in Information Retrieval (IR) systems
â˘Prepare a keyword index for corpus
â˘Respond to keyword queries with a ranked list of
documents.
ď§ARCHIE
â˘Earliest application of rudimentary IR systems to
the Internet
â˘Title search across sites serving files over FTP
2. Boolean queries: Examples
ď§ Simple queries involving relationships
between terms and documents
⢠Documents containing the word Java
⢠Documents containing the word Java but not
the word coffee
ď§ Proximity queries
⢠Documents containing the phrase Java beans
â˘
or the term API
Documents where Java and island occur in
the same sentence
Mining the Web Chakrabarti and Ramakrishnan
2
3. Document preprocessing
ď§ Tokenization
⢠Filtering away tags
⢠Tokens regarded as nonempty sequence of
â˘
â˘
â˘
characters excluding spaces and punctuations.
Token represented by a suitable integer, tid,
typically 32 bits
Optional: stemming/conflation of words
Result: document (did) transformed into a
sequence of integers (tid, pos)
Mining the Web Chakrabarti and Ramakrishnan
3
4. Storing tokens
ď§ Straight-forward implementation using a
relational database
⢠Example figure
⢠Space scales to almost 10 times
ď§ Accesses to table show common pattern
⢠reduce the storage by mapping tids to a
â˘
lexicographically sorted buffer of (did, pos)
tuples.
Indexing = transposing document-term matrix
Mining the Web Chakrabarti and Ramakrishnan
4
5. Two variants of the inverted index data structure, usually stored on disk. The simpler
version in the middle does not store term offset information; the version to the right stores
term
offsets. The mapping from terms to documents and positions (written as
âdocument/positionâ) may
be implemented using a B-tree or a hash-table.
Mining the Web Chakrabarti and Ramakrishnan
5
6. Storage
ď§ For dynamic corpora
⢠Berkeley DB2 storage manager
⢠Can frequently add, modify and delete
documents
ď§ For static collections
⢠Index compression techniques (to be
discussed)
Mining the Web Chakrabarti and Ramakrishnan
6
7. Stopwords
ď§ Function words and connectives
ď§ Appear in large number of documents and little
use in pinpointing documents
ď§ Indexing stopwords
⢠Stopwords not indexed
ďŽ
For reducing index space and improving performance
⢠Replace stopwords with a placeholder (to remember
the offset)
ď§ Issues
⢠Queries containing only stopwords ruled out
⢠Polysemous words that are stopwords in one sense
but not in others
ďŽ
E.g.; can as a verb vs. can as a noun
Mining the Web Chakrabarti and Ramakrishnan
7
8. Stemming
ď§ Conflating words to help match a query term with a
morphological variant in the corpus.
ď§ Remove inflections that convey parts of speech, tense
and number
ď§ E.g.: university and universal both stem to universe.
ď§ Techniques
⢠morphological analysis (e.g., Porter's algorithm)
⢠dictionary lookup (e.g., WordNet).
ď§ Stemming may increase recall but at the price of
precision
⢠Abbreviations, polysemy and names coined in the technical and
â˘
commercial sectors
E.g.: Stemming âidesâ to âIDEâ, âSOCKSâ to âsockâ, âgated â to
âgateâ, may be bad !
Mining the Web Chakrabarti and Ramakrishnan
8
9. Batch indexing and updates
ď§ Incremental indexing
⢠Time-consuming due to random disk IO
⢠High level of disk block fragmentation
ď§ Simple sort-merges.
⢠To replace the indexed update of variablelength postings
ď§ For a dynamic collection
⢠single document-level change may need to
â˘
update hundreds to thousands of records.
Solution : create an additional âstop-pressâ
index.
Mining the Web Chakrabarti and Ramakrishnan
9
11. Stop-press index
ď§ Collection of document in flux
⢠Model document modification as deletion followed by insertion
⢠Documents in flux represented by a signed record (d,t,s)
⢠âsâ specifies if âdâ has been deleted or inserted .
ď§ Getting the final answer to a query
⢠Main index returns a document set D0.
⢠Stop-press index returns two document sets
ďŽ
D+ : documents not yet indexed in D0 matching the query
ďŽ D- : documents matching the query removed from the collection
since D0 was constructed.
ď§ Stop-press index getting too large
⢠Rebuild the main index
ďŽ
â˘
signed (d,t,s) records are sorted in (t,d,s) order and mergepurged into the master (t,d) records
Stop-press index can be emptied out.
Mining the Web Chakrabarti and Ramakrishnan
11
12. Index compression techniques
ď§ Compressing the index so that much of it
can be held in memory
⢠Required for high-performance IR installations
(as with Web search engines),
ď§ Redundancy in index storage
⢠Storage of document IDs.
ď§ Delta encoding
⢠Sort Doc IDs in increasing order
⢠Store the first ID in full
⢠Subsequently store only difference (gap) from
previous ID
Mining the Web Chakrabarti and Ramakrishnan
12
13. Encoding gaps
ď§ Small gap must cost far fewer bits than a
document ID.
ď§ Binary encoding
⢠Optimal when all symbols are equally likely
ď§ Unary code
⢠optimal if probability of large gaps decays
exponentially
Mining the Web Chakrabarti and Ramakrishnan
13
14. Encoding gaps
ď§ Gamma code
⢠Represent gap xas
code for 1 +  logx 
followed by
represented in binary (
bits)
 logx
x - 2  logx 
ďŽ Unary
ďŽ
ď§ Golomb codes
⢠Further enhancement
Mining the Web Chakrabarti and Ramakrishnan
14
15. Lossy compression mechanisms
ď§ Trading off space for time
ď§ collect documents into buckets
⢠Construct inverted index from terms to
bucket IDs
Document' IDs shrink to half their size.
â˘
ď§ Cost: time overheads
⢠For each query, all documents in that bucket
need to be scanned
ď§ Solution: index documents in each bucket
separately
â˘
E.g.: Glimpse (http://webglimpse.org/)
Mining the Web Chakrabarti and Ramakrishnan
15
16. General dilemmas
ď§ Messy updates vs. High compression rate
ď§ Storage allocation vs. Random I/Os
ď§ Random I/O vs. large scale
implementation
Mining the Web Chakrabarti and Ramakrishnan
16
17. Relevance ranking
ď§ Keyword queries
⢠In natural language
⢠Not precise, unlike SQL
ďŽ
Boolean decision for response unacceptable
⢠Solution
ďŽ
Rate each document for how likely it is to satisfy the user's
information need
ďŽ Sort in decreasing order of the score
ďŽ Present results in a ranked list.
ď§ No algorithmic way of ensuring that the ranking
strategy always favors the information need
⢠Query: only a part of the user's information need
Mining the Web Chakrabarti and Ramakrishnan
17
18. Responding to queries
ď§ Set-valued response
⢠Response set may be very large
ďŽ (E.g.,
by recent estimates, over 12 million Web
pages contain the word java.)
ď§ Demanding selective query from user
ď§ Guessing user's information need and
ranking responses
ď§ Evaluating rankings
Mining the Web Chakrabarti and Ramakrishnan
18
19. Evaluating procedure
ď§ Given benchmark
⢠Corpus of ndocuments D
⢠A set of queries Q
⢠For each query,q â Q
an exhaustive set of
D
relevant documents q â D
manually
identified
ď§ Query submitted system 1 , d 2 ,âŚ, d n )
(d
⢠Ranked list of documents
â˘
(r1 , r2 , .., rn )
retrieved d â D
ri = 1
i
q
compute a 0/1 relevance list
ri = 0
ďŽ
iff
Mining ďŽ Web Chakrabarti and Ramakrishnan
the
19
20. Recall and precision
ď§ Recall at rank
⢠Fraction of all relevant documents included in
(d1 , d 2 , âŚ, d n )
1
. recall(k) =
| Dq |
.
â˘
âkri
1⤠i â¤
ď§ Precision at rank ⼠1
k
⢠Fraction of the top kresponses that are
â˘
actually relevant.
1
precision(k) = â ri
.
k
1⤠i ⤠k
Mining the Web Chakrabarti and Ramakrishnan
20
21. Other measures
ď§ Average precision
⢠Sum of precision at each relevant hit position in the
response list, divided by the total number of relevant
documents
⢠. avg.precision = 1 â rk * precision(k )
| D q | . 1⤠k â¤|D|
⢠avg.precision =1 iff engine retrieves all relevant
documents and ranks them ahead of any irrelevant
document
ď§ Interpolated precision
⢠To combine precision values from multiple queries
⢠Gives precision-vs.-recall curve for the benchmark.
Ď
For each query, take the maximum precision obtained for the
query for any recall greater than or equal to
ďŽ average them together for all queries
ďŽ
Mining the Web Chakrabarti and Ramakrishnan
ď§
21
22. Precision-Recall tradeoff
ď§ Interpolated precision cannot increase with recall
⢠Interpolated precision at recall level 0 may be less
than 1
ď§ Atlevelk= 0
⢠Precision(byconvention)=1,Recall=0
ď§ Inspecting more documents
⢠Canincreaserecall
⢠Precisionmaydecrease
ďŽ
we will start encountering more and more irrelevant
documents
ď§ Search engine with a good ranking function will
generally show a negative relation between recall
and precision.
⢠Higher the curve, better the engine
Mining the Web Chakrabarti and Ramakrishnan
22
23. ecision and interpolated precision plotted against recall for the given relevance vect
Missing rk are zeroes.
Mining the Web Chakrabarti and Ramakrishnan
23
24. The vector space model
ď§ Documents represented as vectors in a
multi-dimensional Euclidean space
⢠Each axis = a term (token)
ď§ Coordinate of document din direction of
term tdetermined by:
⢠Term frequency TF(d,t)
ďŽ number
of times term toccurs in document d,
scaled in a variety of ways to normalize document
length
⢠Inverse document frequency IDF(t)
ďŽ to
scale down the coordinates of terms that occur
in many documents
Mining the Web Chakrabarti and Ramakrishnan
24
25. Term frequency
ď§ . TF(d, t) =
n(d, t)
â n(d,Ď )
n(d, t)
TF(d, t) =
max (n(d,Ď ))
Ď
.
ď§ Cornell SMART system uses a smoothed
version
Ď
n( d , t ) = 0
TF (d , t ) = 0
TF (d , t ) = 1 + log(1 + n(d , t )) otherwise
Mining the Web Chakrabarti and Ramakrishnan
25
26. Inverse document frequency
ď§ Given
⢠Dis the document collection andt
D
is the set
of documents containing t
ď§ Formulae
⢠mostly dampened functions
⢠SMART
ďŽ.
D
ofDt |
|
1+ | D |
IDF (t ) = log(
)
| Dt |
Mining the Web Chakrabarti and Ramakrishnan
26
27. Vector space model
ď§ Coordinate of document din axis t
⢠.dt = TF (d , t ) IDF (t )
ď˛
⢠Transformed tod in the TFIDF-space
ď§ Query q
⢠Interpreted as a document
ď˛
⢠Transformed toq in the same TFIDF-space
as d
Mining the Web Chakrabarti and Ramakrishnan
27
28. Measures of proximity
ď§ Distance measure
⢠Magnitude of the vector difference
ď˛ ď˛
ďŽ.
|d âq|
⢠Document vectors must be normalized to unit
L1
( L2 or
) length
ďŽ Else
shorter documents dominate (since queries are
short)
ď§ Cosine similarity
â˘
ď˛
d
cosine of the angle between
ďŽ Shorter documents are penalized
Mining the Web Chakrabarti and Ramakrishnan
ď˛
q
and
28
29. Relevance feedback
ď§ Users learning how to modify queries
⢠Response list must have least some relevant
â˘
documents
Relevance feedback
ďŽ
`correcting' the ranks to the user's taste
ďŽ automates the query refinement process
ď§ Rocchio's method
ď˛
⢠Folding-in user q
feedback
⢠To query vector
ďŽ
â˘
Adda weighted sum of vectors for relevant documents D+
ď˛
ď˛
ď˛
ď˛
q'ďŽ Subtract a d - Îł â d sum of the irrelevant documents D= Îąq + β â weighted
D+
D.
Mining the Web Chakrabarti and Ramakrishnan
29
30. Relevance feedback (contd.)
ď§ Pseudo-relevance feedback
⢠D+ and D- generated automatically
ďŽ E.g.:
Cornell SMART system
ďŽ top 10 documents reported by the first round of
query execution are included in D+
⢠γ typically set to 0; D- not used
ď§ Not a commonly available feature
⢠Web users want instant gratification
⢠System complexity
ďŽ Executing
the second round query slower and
expensive for major search engines
Mining the Web Chakrabarti and Ramakrishnan
30
31. Ranking by odds ratio
ď§ R: Boolean random variable which
represents the relevance of document d
w.r.t. query q.
ď§ Ranking documents by their odds ratio for
ď˛
ď˛
Pr( R | q, d ) Pr( R, q, d ) / Pr(q, d ) Pr( R | q) / Pr(d | R , q )
ď˛
relevance= Pr( R , q, dď˛) / Pr(q, d ) = Pr(R | q) / Pr(d | R, q)
Pr( R | q, d )
â˘.
ď§ Approximating probability of d by product
ď˛
Pr( d
Pr( x R
of Pr(dď˛ || R ,,probabilities of ď˛individual terms in d
theR q)) â â Pr( x || R ,, q))
q
q
a (1 â b )
Pr( R | q, d )
ď˛ â â
â˘.
b (1 â a )
Pr( R | q, d )
⢠ApproximatelyâŚ
t
t
t
t ,q
tâq ⊠d
t ,q
t ,q
t ,q
Mining the Web Chakrabarti and Ramakrishnan
31
32. Bayesian Inferencing
Bayesian inference network for relevance ranking. A
document is relevant to the extent that setting its
corresponding belief node to true lets us assign a high degree
of belief in the node corresponding to the query.
Mining the Web Chakrabarti and Ramakrishnan
Manual specification of
mappings between terms to
approximate concepts.
32
33. Bayesian Inferencing (contd.)
ď§ Four layers
1.Document layer
2.Representation layer
3.Query concept layer
4.Query
ď§ Each node is associated with a random
Boolean variable, reflecting belief
ď§ Directed arcs signify that the belief of a
node is a function of the belief of its
immediate parents (and so on..)
Mining the Web Chakrabarti and Ramakrishnan
33
34. Bayesian Inferencing systems
ď§ 2 & 3 same for basic vector-space IR
systems
ď§ Verity's Search97
⢠Allows administrators and users to define
hierarchies of concepts in files
ď§ Estimation of relevance of a document d
w.r.t. the query q
â˘
â˘
â˘
â˘
Set the belief of the corresponding node to 1
Set all other document beliefs to 0
Compute the belief of the query
Rank documents in decreasing order of belief
that they induce in the query
Mining the Web Chakrabarti and Ramakrishnan
34
35. Other issues
ď§ Spamming
⢠Adding popular query terms to a page unrelated to
â˘
those terms
E.g.: Adding âHawaii vacation rentalâ to a page about
âInternet gamblingâ
Little setback due to hyperlink-based ranking
â˘
ď§ Titles, headings, meta tags and anchor-text
⢠TFIDF framework treats all terms the same
⢠Meta search engines:
ďŽ
Assign weight age to text occurring in tags, meta-tags
⢠Using anchor-text on pages uwhich link to v
ďŽ
Anchor-text on uoffers valuable editorial judgment about vas
well.
Mining the Web Chakrabarti and Ramakrishnan
35
36. Other issues (contd..)
ď§ Including phrases to rank complex queries
⢠Operators to specify word inclusions and
â˘
exclusions
With operators and phrases
queries/documents can no longer be treated as
ordinary points in vector space
ď§ Dictionary of phrases
⢠Could be cataloged manually
⢠Could be derived from the corpus itself using
â˘
statistical techniques
Two separate indices:
ďŽ one
for single terms and another for phrases
Mining the Web Chakrabarti and Ramakrishnan
36
37. Corpus derived phrase dictionary
t2
ď§ Two termst1 and
ď§ Null hypothesis = occurrences of and
are
t1
independent
ď§ To the extent the pair violates the null hypothesis, it is
likely to be a phrase
t2
⢠Measuring violation with likelihood ratio of
â˘
the hypothesis
Pick phrases that violate the null hypothesis
with large confidence
ď§ Contingency table built from statistics
k10 = k (t1 , t 2 )
k11 = k (t1 , t 2 )
k00 = k (t1 , t 2 ) k 01 = k (t1 , t 2 )
Mining the Web Chakrabarti and Ramakrishnan
37
38. Corpus derived phrase dictionary
ď§ Hypotheses
⢠Null hypothesis
k 00 k 01 k10 k11
H ( p00 , p01 , p10 , p11 ; k 00 , k01 , k10 , k11 ) â p00 p01 p10 p11
⢠Alternative hypothesis
H ( p1 , p2 ; k00 , k01 , k10 , k11 ) â ((1 â p1 )(1 â p2 )) k00 ((1 â p1 ) p2 ) k01 ( p1 (1 â p2 )) k10 ( p1 p2 ) k11
⢠Likelihood ratio
Îť=
max H ( p; k )
pââ 0
max H ( p; k )
pââ
Mining the Web Chakrabarti and Ramakrishnan
38
39. Approximate string matching
ď§
Non-uniformity of word spellings
⢠dialects of English
⢠transliteration from other languages
ď§ Two ways to reduce this problem.
1. Aggressive conflation mechanism to collapse
2.
variant spellings into the same token
Decompose terms into a sequence of q-grams
or sequences of qcharacters
Mining the Web Chakrabarti and Ramakrishnan
39
40. Approximate string matching
1. Aggressive conflation mechanism to collapse
variant spellings into the same token
â˘
â˘
E.g.: Soundex : takes phonetics and pronunciation details
into account
used with great success in indexing and searching last
names in census and telephone directory data.
1. Decompose terms into a sequence of q-grams or
sequences of qcharacters
â˘
â˘
Check for similarity in the q(2 ⤠q ⤠4)
grams
Looking up the inverted index : a two-stage affair:
â˘
â˘
â˘
â˘
Smaller index of q-grams consulted to expand each query term
into a set of slightly distorted query terms
These terms are submitted to the regular index
Used by Google for spelling correction
Idea also adopted for eliminating near-duplicate pages
Mining the Web Chakrabarti and Ramakrishnan
40
41. Meta-search systems
⢠Take the search engine to the document
⢠Forward queries to many geographically distributed
repositories
â˘
Each has its own search service
â˘
Suit a single user query to many search engines with
different query syntax
⢠Consolidate their responses.
⢠Advantages
⢠Perform non-trivial query rewriting
⢠Surprisingly small overlap between crawls
⢠Consolidating responses
⢠Function goes beyond just eliminating duplicates
⢠Search services do not provide standard ranks which
can be combined meaningfully
Mining the Web Chakrabarti and Ramakrishnan
41
42. Similarity search
⢠Cluster hypothesis
⢠Documents similar to relevant documents are
also likely to be relevant
⢠Handling âfind similarâ queries
⢠Replication or duplication of pages
⢠Mirroring of sites
Mining the Web Chakrabarti and Ramakrishnan
42
43. Document similarity
⢠Jaccard coefficient of similarity between
d
document d1 and2
⢠T(d) = set of tokens in document d
| T ( d1 ) ⊠T (d 2 ) |
| T ( d1 ) ⪠T (d 2 ) |
â˘
⢠Symmetric, reflexive, not a metric
⢠Forgives any number of occurrences and any
. r ' (d1 , d 2 ) =
permutations of the terms.
⢠1 â r ' (d1 , d 2 )
is a metric
Mining the Web Chakrabarti and Ramakrishnan
43
44. Estimating Jaccard coefficient with
random permutations
1.
2.
3.
4.
5.
6.
â
Generate a set of mrandom permutations
for each â
do
â( d
computeâ(d1 )
and 2 )
check if min T (d1 ) = min T (d 2 )
end for
if equality was observed in kcases,
estimate.k
r ' (d1 , d 2 ) =
m
Mining the Web Chakrabarti and Ramakrishnan
44
45. Fast similarity search with random
permutations
â
1. for each random permutation
do
2.
3.
4.
5.
6.
create a fileâ
f
for each document ddo
fâ
<
write out s = min â(T (d )), d >
to
end for
â
sort fusing key s--this results in contiguous blocks with fixed s
containing all associatedd s
g
7.
create a file â
(d
fâ
8.
for each pair1 , d 2 )
within a run of
having a given s
do
(d1 , d 2 )
9.
write out a document-pair record
to g
10.
end â
g for
(d1 , d 2 )
11.
sort
on key
12. end forâ
g
(d1 , d 2 )
(d1 , d 2 )
â
13. merge
for all
in
order, counting the number
of
Mining the entries
Web Chakrabarti and Ramakrishnan
45
46. Eliminating near-duplicates via shingling
⢠âFind-similarâ algorithm reports all duplicate/nearduplicate pages
⢠Eliminating duplicates
⢠Maintain a checksum with every page in the corpus
⢠Eliminating near-duplicates
⢠Represent each document as a set T(d) of q-grams (shingles)
r
d1
⢠Find Jaccard similarity(d1 , d 2 )
between d 2 and
⢠Eliminate the pair from step 9 if it has similarity above a
threshold
Mining the Web Chakrabarti and Ramakrishnan
46
47. â˘
â˘
Detecting locally similar sub-graphs of the
Web
â˘
â˘
Similarity search and duplicate elimination on the
graph structure of the web
To improve quality of hyperlink-assisted ranking
Detecting mirrored sites
Approach 1 [Bottom-up Approach]
1.
Start process with textual duplicate detection
â˘
â˘
â˘
1.
â˘
2.
cleaned URLs are listed and sorted to find duplicates/nearduplicates
each set of equivalent URLs is assigned a unique token ID
each page is stripped of all text, and represented as a sequence of
outlink IDs
Continue using link sequence representation
Until no further collapse of multiple URLs are possible
Approach 2 [Bottom-up Approach]
1.
2.
3.
identify single nodes which are near duplicates (using textshingling)
extend single-node mirrors to two-node mirrors
continue on to larger and larger graphs which are likely mirrors of
one another
Mining the Web Chakrabarti and Ramakrishnan
47
48. Detecting mirrored sites (contd.)
⢠Approach 3 [Step before fetching all pages]
â˘
Uses regularity in URL strings to identify host-pairs which are
mirrors
⢠Preprocessing
⢠Host are represented as sets of positional bigrams
⢠Convert host and path to all lowercase characters
⢠Let any punctuation or digit sequence be a token separator
⢠Tokenize the URL into a sequence of tokens, (e.g.,
www6.infoseek.com gives www, infoseek, com)
⢠Eliminate stop terms such as htm, html, txt, main, index, home,
bin, cgi
⢠Form positional bigrams from the token sequence
â˘
Two hosts are said to be mirrors if
⢠A large fraction of paths are valid on both web sites
⢠These common paths link to pages that are near-duplicates.
Mining the Web Chakrabarti and Ramakrishnan
48