SlideShare ist ein Scribd-Unternehmen logo
1 von 67
CSE 435/535
Information Retrieval
Fall 2013

Chapter 19: Web Search Basics
Brief (non-technical) history
īŽ

Early keyword-based engines
īŽ

īŽ

Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997

Sponsored search ranking: Goto.com (morphed
into Overture.com → Yahoo!)
īŽ

īŽ

Your search ranking depended on how much you
paid
Auction for keywords: casino was expensive!
Brief (non-technical) history
īŽ

1998+: Link-based ranking pioneered by Google
īŽ
īŽ

īŽ

īŽ

Blew away all early engines save Inktomi
Great user experience in search of a business
model
Meanwhile Goto/Overture’s annual revenues were
nearing $1 billion

Result: Google added paid-placement “ads” to
the side, independent of search results
īŽ

Yahoo followed suit, acquiring Overture (for paid
placement) and Inktomi (for search)
Ads

Algorithmic results.
Web search basics
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA

User

Miele Vacuum Cleaners
Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com
Miele Vacuum Cleaners
Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com

Web

Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise

Web spider

At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...
www.miele.com/ - 20k - Cached - Similar pages

Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this
page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ]
Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages

Search

Indexer

The Web
Indexes

Ad indexes
User Needs
īŽ

Need [Brod02, RL04]
īŽ

īŽ

īŽ

Informational – want to learn about something (~40% / 65%)
Low hemoglobin
Navigational – want to go to that page (~25% / 15%)
United Airlines
Transactional – want to do something (web-mediated) (~35% / 20%)
īŽ Access a service
Seattle weather
Mars surface images
īŽ Downloads
īŽ

īŽ

Shop

Canon S410

Gray areas
īŽ
īŽ

Car rental Brasil
Find a good hub
Exploratory search “see what’s there”
How far do people look for results?

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
Users’ empirical evaluation of results
īŽ

Quality of pages varies widely
īŽ
īŽ

Relevance is not enough
Other desirable qualities (non IR!!)
īŽ
īŽ
īŽ

īŽ

Precision vs. recall
īŽ

īŽ

On the web, recall seldom matters

What matters
īŽ
īŽ

Precision at 1? Precision above the fold?
Comprehensiveness – must be able to deal with obscure
queries
īŽ

īŽ

Content: Trustworthy, diverse, non-duplicated, well maintained
Web readability: display correctly & fast
No annoyances: pop-ups, etc

Recall matters when the number of matches is very small

User perceptions may be unscientific, but are
significant over a large aggregate
Users’ empirical evaluation of engines
īŽ
īŽ
īŽ
īŽ
īŽ

Relevance and validity of results
UI – Simple, no clutter, error tolerant
Trust – Results are objective
Coverage of topics for polysemic queries
Pre/Post process tools provided
īŽ
īŽ
īŽ

īŽ

Mitigate user errors (auto spell check, search assist,â€Ļ)
Explicit: Search within results, more like this, refine ...
Anticipative: related searches

Deal with idiosyncrasies
īŽ

Web specific vocabulary
īŽ

īŽ
īŽ

Impact on stemming, spell-check, etc

Web addresses typed in the search box
â€Ļ
The Web document collection
īŽ
īŽ

īŽ

īŽ

īŽ

īŽ

The Web
īŽ

No design/co-ordination
Distributed content creation, linking,
democratization of publishing
Content includes truth, lies, obsolete
information, contradictions â€Ļ
Unstructured (text, html, â€Ļ), semistructured (XML, annotated photos),
structured (Databases)â€Ļ
Scale much larger than previous text
collections â€Ļ but corporate records
are catching up
Growth – slowed down from initial
“volume doubling every few months”
but still expanding
Content can be dynamically
generated
Spam

Search Engine Optimization
The trouble with sponsored search â€Ļ
īŽ
īŽ

It costs money. What’s the alternative?
Search Engine Optimization:
īŽ

īŽ
īŽ

īŽ

īŽ

“Tuning” your web page to rank highly in the
algorithmic search results for select keywords
Alternative to paying for placement
Thus, intrinsically a marketing function

Performed by companies, webmasters and
consultants (“Search engine optimizers”) for
their clients
Some perfectly legitimate, some very shady
Simplest forms
īŽ

First generation engines relied heavily on tf/idf
īŽ

īŽ

The top-ranked pages for the query maui resort were the
ones containing the most maui’s and resort’s

SEOs responded with dense repetitions of chosen
terms
īŽ
īŽ

e.g., maui resort maui resort maui resort
Often, the repetitions would be in the same color as the
background of the web page
īŽ
īŽ

Repeated terms got indexed by crawlers
But not visible to humans on browsers

Pure word density cannot
be trusted as an IR signal
Variants of keyword stuffing
īŽ
īŽ

Misleading meta-tags, excessive repetition
Hidden text with colors, style sheet tricks,
etc.

Meta-Tags =
“â€Ļ London hotels, hotel, holiday inn, hilton, discount,
booking, reservation, sex, mp3, britney spears, viagra, â€Ļ”
Search engine optimization (Spam)
īŽ

Motives
īŽ
īŽ

īŽ

Operators
īŽ
īŽ
īŽ

īŽ

Commercial, political, religious, lobbies
Promotion funded by advertising budget
Contractors (Search Engine Optimizers) for lobbies, companies
Web masters
Hosting services

Forums
īŽ

E.g., Web master world ( www.webmasterworld.com )
īŽ
īŽ

Search engine specific tricks
Discussions about academic papers īŠ
Synthetic content

Monetization

Random words
Well-formed
sentences
stitched
together
Links to keep
crawlers going
Features identifying synthetic
content
īŽ

Average word length
īŽ

īŽ

Word frequency distribution
īŽ

īŽ

Certain words (“the”, “a”, â€Ļ) appear more often
than others

N-gram frequency distribution
īŽ

īŽ

The mean word length for English prose is about 5
characters; but longer for some forms of keyword
stuffing

Some words are more likely to occur next to each
other than others

Grammatical well-formedness
īŽ

Natural-language parsing is expensive
Really good synthetic content
“Nigritude Ultramarine”:
An SEO competition

Links to keep
crawlers going
Grammatically
well-formed but
meaningless
sentences
Content “repurposing”
īŽ

Content repurposing: The practice of
incorporating all or portions of other (unaffiliated)
web pages
īŽ

īŽ

īŽ

A “convenient” way to machine generate pages
that contain human-authored content
Not even necessarily illegal â€Ļ

Two flavors:
īŽ
īŽ

Incorporate large portions of a single page
Incorporate snippets of multiple pages
Example of page-level
content “repurposing”
Example of phrase-level content
“repurposing”
How is
monetization
taking place?
Techniques for detecting content
repurposing
īŽ

Single-page flavor: Cluster pages into
equivalence classes of very similar pages
īŽ

īŽ

īŽ

If most pages on a site a very similar to pages on
other sites, raise a red flag
(There are legitimate replicated sites; e.g. mirrors
of Linux man pages)

Many-snippets flavor: Test if page consists
mostly of phrases that also occur somewhere
else
īŽ
īŽ

Computationally hard problem
Probabilistic technique that makes it tractable
(Fetterly et al SIGIR 2005 paper)
Cloaking
īŽ
īŽ

Serve fake content to search engine spider
DNS cloaking: Switch IP address. Impersonate

Y

SPAM

Is this a Search
Engine spider?

Cloaking

N

Real
Doc
More spam techniques
īŽ

Doorway pages
īŽ

īŽ

Link spamming
īŽ

īŽ

īŽ

īŽ

Pages optimized for a single keyword that re-direct to the real
target page
links between pages that are present for reasons other than
merit
Mutual admiration societies, hidden links, awards – more on
these later
Domain flooding: numerous domains that point or re-direct to a
target page

Robots
īŽ

Fake query stream – rank checking programs
īŽ

īŽ

“Curve-fit” ranking programs of search engines

Millions of submissions via Add-Url
The war against spam
īŽ

Quality signals - Prefer
authoritative pages based
on:

īŽ

Votes from authors (linkage
signals)
Votes from users (usage
signals)

īŽ

īŽ

īŽ

īŽ

īŽ
īŽ

īŽ

Anti robot test , i.e. not
submitted by a robot

Limits on meta-keywords
Robust link analysis
īŽ

īŽ

Ignore statistically implausible
linkage (or text)
Use link analysis to detect
spammers (guilt by association)

īŽ

īŽ

Training set based on
known spam

Family friendly filters
īŽ

Policing of URL
submissions
īŽ

Spam recognition by
machine learning

Linguistic analysis, general
classification techniques,
etc.
For images: flesh tone
detectors, source text
analysis, etc.

Editorial intervention
īŽ
īŽ
īŽ
īŽ

Blacklists
Top queries audited
Complaints addressed
Suspect pattern detection
More on spam
īŽ

Web search engines have policies on SEO
practices they tolerate/block
īŽ
īŽ

īŽ

īŽ

http://help.yahoo.com/help/us/ysearch/index.html
http://www.google.com/intl/en/webmasters/

Adversarial IR: the unending (technical) battle
between SEO’s and web search engines
Research http://airweb.cse.lehigh.edu/
Size of the web
What is the size of the web ?
īŽ

Issues
īŽ

The web is really infinite
īŽ
īŽ

īŽ

īŽ

īŽ

Dynamic content, e.g., calendar
Soft 404: www.yahoo.com/<anything> is a valid page

Static web contains syntactic duplication, mostly
due to mirroring (~30%)
Some servers are seldom connected

Who cares?
īŽ
īŽ
īŽ

Media, and consequently the user
Engine design
Engine crawl policy. Impact on recall.
What can we attempt to measure?
The relative sizes of search engines

īŽ

īŽ

īŽ

The notion of a page being indexed is still reasonably
well defined.
Already there are problems
īŽ

īŽ

Document extension: e.g. engines index pages not yet
crawled, by indexing anchortext.
Document restriction: All engines restrict what is indexed
(first n words, only relevant words, etc.)

The coverage of a search engine relative to
another particular crawling process.
īŽ
Graph Structure in the Web

http://www9.org/w9cdrom/160/160.html
Zipf’s Law on the Web
īŽ

Number of in-links/out-links to/from a page has a
Zipfian distribution.
īŽ

īŽ

Length of web pages has a Zipfian distribution.
īŽ

īŽ

A large number of docs have the same number of
in-links/out-links to/from a page
Certain page lengths are most common; i.e. a
huge number of docs with the same page length.

Number of hits to a web page has a Zipfian
distribution.
īŽ

A large number of pages will generate the same
number of hits; dramatic drop-off in number of
pages as the above frequency reduces
New definition?
īŽ

(IQ is whatever the IQ tests measure.)
īŽ

īŽ

Different engines have different preferences
īŽ

īŽ

The statically indexable web is whatever
search engines index.
max url depth, max count/host, anti-spam rules,
priority rules, etc.

Different engines index different things under the
same URL:
īŽ

frames, meta-keywords, document restrictions,
document extensions, ...
Relative Size from Overlap
Given two engines A and B
Sample URLs randomly from A
Check if contained in B and vice versa
A∊ B

A∊ B =
A∊ B =

(1/2) * Size A
(1/6) * Size B

(1/2)*Size A = (1/6)*Size B
∴

Size A / Size B =
(1/6)/(1/2) = 1/3

Each test involves: (i) Sampling (ii) Checking
Sampling URLs
īŽ

Ideal strategy: Generate a random URL and
check for containment in each index.

īŽ

Problem: Random URLs are hard to find!
Enough to generate a random URL contained in
a given Engine.
Approach 1: Generate a random URL contained
in a given engine

īŽ

īŽ

īŽ

īŽ

Pick a random query from search log; pick random page
returned as search result
Suffices for the estimation of relative size

Approach 2: Random walks / IP addresses
īŽ

In theory: might give us a true estimate of the size of the
web (as opposed to just relative sizes of indexes)
Statistical methods
īŽ

Approach 1
īŽ
īŽ

īŽ

*** Random queries: preferred method
Random searches

Approach 2
īŽ
īŽ

Random IP addresses
Random walks
Random URLs from random queries
īŽ

Generate random query: how?
īŽ

īŽ

Not an English
Lexicon: 400,000+ words from a web crawl dictionary

Conjunctive Queries: w1 and w2
e.g., vocalists AND rsi

īŽ
īŽ

īŽ

īŽ

Get 100 result URLs from engine A
Choose a random URL as the candidate to check for
presence in engine B (how do you do this?)
This distribution induces a probability weight W(p) for each
page.
Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|
Query Based Checking
īŽ

Strong Query to check whether an engine B has a
document D (corresponding to random URL):
īŽ
īŽ
īŽ

īŽ

Download D. Get list of words.
Use 8 low frequency words as AND query to B
Check if D is present in result set.

Problems:
īŽ
īŽ
īŽ
īŽ
īŽ

Near duplicates (not D, but something identical to D)
Frames
Redirects
Engine time-outs
Is 8-word query good enough?
Advantages & disadvantages
īŽ
īŽ

Statistically sound under the induced weight.
Biases induced by random query
īŽ

Query Bias: Favors content-rich pages in the language(s) of the
lexicon

īŽ
īŽ
īŽ

Ranking Bias: Solution: Use conjunctive queries & fetch all
Checking Bias: Duplicates, impoverished pages omitted
Document or query restriction bias: engine might not deal
properly with 8 words conjunctive query

īŽ
īŽ

Malicious Bias: Sabotage by engine
Operational Problems: Time-outs, failures, engine
inconsistencies, index modification.
Alt 1: Random searches
īŽ

īŽ

These are real queries based on real users,
rather than “random queries”
Choose random searches extracted from a
local log [Lawrence & Giles 97] or build
“random searches” [Notess]
īŽ
īŽ
īŽ

Use only queries with small results sets.
Count normalized URLs in result sets.
Use ratio statistics
Advantages & disadvantages
īŽ

Advantage
īŽ

īŽ

Might be a better reflection of the human
perception of coverage

Issues
īŽ
īŽ
īŽ

Samples are correlated with source of log
Duplicates
Technical statistical problems (must have nonzero results, ratio average not statistically sound)
Random searches
īŽ
īŽ
īŽ

575 & 1050 queries from the NEC RI employee logs
6 Engines in 1998, 11 in 1999
Implementation:
īŽ Restricted to queries with < 600 results in total
īŽ Counted URLs from each engine after verifying query
match
īŽ Computed size ratio & overlap for individual queries
īŽ Estimated index size ratio & overlap by averaging
over all queries
Queries from Lawrence and Giles study
īŽ
īŽ

īŽ
īŽ
īŽ

īŽ

īŽ
īŽ

īŽ
īŽ

adaptive access control
neighborhood preservation
topographic
hamiltonian structures
right linear grammar
pulse width modulation
neural
unbalanced prior
probabilities
ranked assignment method
internet explorer favourites
importing
karvel thornber
zili liu

īŽ
īŽ

īŽ
īŽ
īŽ
īŽ
īŽ
īŽ
īŽ
īŽ

softmax activation function
bose multidimensional
system theory
gamma mlp
dvi2pdf
john oliensis
rieke spikes exploring neural
video watermarking
counterpropagation network
fat shattering dimension
abelson amorphous
computing
Alt 2: Random IP addresses
īŽ
īŽ

Generate random IP addresses
Find a web server at the given address
īŽ

īŽ

If there’s one

Collect all pages from server
īŽ

From this, choose a page at random
Random IP addresses
īŽ

HTTP requests to random IP addresses
īŽ
īŽ

īŽ

Ignored: empty or authorization required or excluded
[Lawr99] Estimated 2.8 million IP addresses running
crawlable web servers (16 million total) from
observing 2500 servers.
OCLC using IP sampling found 8.7 M hosts in 2001
īŽ

īŽ

Netcraft [Netc02] accessed 37.2 million hosts in July 2002

[Lawr99] exhaustively crawled 2500 servers
and extrapolated
īŽ
īŽ

Estimated size of the web to be 800 million
Estimated use of metadata descriptors:
īŽ

Meta tags (keywords, description) in 34% of home pages,
Dublin core metadata in 0.3%
Advantages & disadvantages
īŽ

Advantages
īŽ
īŽ

īŽ

Clean statistics
Independent of crawling strategies

Disadvantages
īŽ
īŽ
īŽ

Doesn’t deal with duplication
Many hosts might share one IP, or not accept requests
No guarantee all pages are linked to root page.
īŽ

īŽ

Power law for # pages/hosts generates bias towards
sites with few pages.
īŽ

īŽ

Eg: employee pages

But bias can be accurately quantified IF underlying distribution
understood

Potentially influenced by spamming (multiple IP’s for
same server to avoid IP block)
Alt 3: Random walks
īŽ
īŽ

View the Web as a directed graph
Build a random walk on this graph
īŽ

Includes various “jump” rules back to visited sites
īŽ
īŽ

īŽ

Converges to a stationary distribution
īŽ
īŽ
īŽ

īŽ
īŽ

Does not get stuck in spider traps!
Can follow all links!
Must assume graph is finite and independent of the walk.
Conditions are not satisfied (cookie crumbs, flooding)
Time to convergence not really known

Sample from stationary distribution of walk
Use the “strong query” method to check coverage by SE
Advantages & disadvantages
īŽ

Advantages
īŽ
īŽ

īŽ

“Statistically clean” method at least in theory!
Could work even for infinite web (assuming
convergence) under certain metrics.

Disadvantages
īŽ
īŽ
īŽ

List of seeds is a problem.
Practical approximation might not be valid.
Non-uniform distribution
īŽ

Subject to link spamming
Conclusions
īŽ
īŽ
īŽ
īŽ

No sampling solution is perfect.
Lots of new ideas ...
....but the problem is getting harder
Quantitative studies are fascinating and a
good research problem
Duplicate detection
Duplicate documents
īŽ
īŽ

The web is full of duplicated content
Strict duplicate detection = exact match
īŽ

īŽ

Not as common

But many, many cases of near duplicates
īŽ

E.g., Last modified date the only difference
between two copies of a page
Duplicate/Near-Duplicate Detection
īŽ

īŽ

Duplication: Exact match can be detected with
fingerprints (64-bit), if fingerprints match, check
further
Near-Duplication: Approximate match
īŽ

Overview
īŽ

īŽ

Compute syntactic similarity with an edit-distance
measure
Use similarity threshold to detect near-duplicates
īŽ
īŽ

E.g., Similarity > 80% => Documents are “near duplicates”
Not transitive though sometimes used transitively
Computing Similarity
īŽ

īŽ

Features:
īŽ Segments of a document (natural or artificial breakpoints)
īŽ Shingles (Word N-Grams)
īŽ a rose is a rose is a rose →
a_rose_is_a
rose_is_a_rose
is_a_rose_is
a_rose_is_a
Similarity Measure between two docs (= sets of shingles)
īŽ Set intersection
īŽ Specifically (Size_of_Intersection / Size_of_Union)
Shingles + Set Intersection
Computing exact set intersection of shingles
between all pairs of documents is
expensive/intractable
īŽ

īŽ

Approximate using a cleverly chosen subset of
shingles from each (a sketch)

Estimate (size_of_intersection / size_of_union)
based on a short sketch
īŽ

Doc
Doc
A
A

Shingle set A

Doc
Doc
B
B

Shingle set A

Sketch A
Jaccard
Sketch A
Sketch of a document
īŽ

Create a “sketch vector” (of size ~200) for
each document
īŽ

īŽ

Documents that share â‰Ĩ t (say 80%)
corresponding vector elements are near
duplicates
For doc D, sketchD[ i ] is as follows:
īŽ

īŽ
īŽ

Let f map all shingles in the universe to 0..2m
(e.g., f = fingerprinting)
Let Ī€i be a random permutation on 0..2m
Pick MIN {Ī€i(f(s))} over all shingles s in D
Computing Sketch[i] for Doc1
Document 1
264

Start with 64-bit f(shingles)

264

Permute on the number line

2

64

264

with

Ī€i

Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 2

Document 1
264
264

264

264
A

264

264

2

64

B

Are these equal?
Test for 200 random permutations: Ī€1, Ī€2,â€Ļ Ī€200

264
Howeverâ€Ļ
Document 2

Document 1

264
264

264
264
A

264
264

B

264
264

A = B iff the shingle with the MIN value in the union of
Doc1 and Doc2 is common to both (i.e., lies in the
intersection)
Why?

Claim: This happens with probability
Size_of_intersection / Size_of_union
Set Similarity of sets Ci , Cj
Jaccard(Ci , C j ) =
īŽ

īŽ

Ci ī‰ C j
Ci ī• C j

View sets as columns of a matrix A; one row for each
element in the universe. aij = 1 indicates presence of
item i in set j
C1 C2
Example
0
1
1
0
1

1
0
1
0
1

Jaccard(C 1 ,C 2 ) = 2/5 = 0.4
Key Observation
īŽ

For columns Ci, Cj, four types of rows
Ci
A
B
C
D

īŽ
īŽ

1
1
0
0

Cj

1
0
1
0

Overload notation: A = # of rows of type A
Claim
A

Jaccard(Ci , C j ) =

A+B+C
“Min” Hashing
īŽ
īŽ

Randomly permute rows
Hash h(Ci) = index of first row with 1 in column Ci

īŽ

Surprising Property

īŽ

Why?

P [ h(Ci ) = h(C j ) ] = Jaccard ( Ci , C j )

īŽ
īŽ
īŽ

Both are A/(A+B+C)
Min: Look down columns Ci, Cj until first non-Type-D row
h(Ci) = h(Cj) īƒŸīƒ  type A row
īŽ

Because we chose a random permutation, prob of this occurring is
A/(A+B+C)
Min-Hash sketches
īŽ
īŽ

Pick P random row permutations
MinHash sketch
SketchD = list of P (usually, 200) indexes (i.e.
integers) of first rows with 1 in column C

īŽ

Similarity of signatures
īŽ

īŽ

Let sim[sketch(Ci),sketch(Cj)] = fraction of
permutations where MinHash values agree
Observe E[sim(sig(Ci),sig(Cj))] = Jaccard(Ci,Cj)
Example

R1
R2
R3
R4
R5

C1
1
0
1
1
0

C2
0
1
0
0
1

C3
1
1
0
1
0

Signatures (composed of min)
S1 S2 S3
Perm 1 = (12345) 1 2 1
Perm 2 = (54321) 4 5 4
Perm 3 = (34512) 3 5 4

Similarities
1-2
1-3
2-3
Col-Col 0.00 0.50 0.25
Sig-Sig 0.00 0.67 0.00
Good estimate
Implementation Trick
īŽ
īŽ

Permuting universe even once is prohibitive
Row Hashing
īŽ
īŽ

īŽ

Pick P hash functions hk: {1,â€Ļ,n}īƒ {1,â€Ļ,O(n)}
Ordering under hk gives random permutation of
rows

One-pass Implementation
īŽ

For each Ci and hk, keep “slot” for min-hash value

īŽ

Initialize all slot(Ci,hk) to infinity

īŽ

Scan rows in arbitrary order looking for 1’s
īŽ

Suppose row Rj has 1 in column Ci

īŽ

For each hk,
Example
R1
R2
R3
R4
R5

C1
1
0
1
1
0

C2
0
1
1
0
1

h(x) = x mod 5
g(x) = 2x+1 mod 5

h(1) = 1
g(1) = 3

C1 slots C2 slots
1
3
-

h(2) = 2
g(2) = 0

1
3

2
0

h(3) = 3
g(3) = 2

1
2

2
0

h(4) = 4
g(4) = 4
h(5) = 0
g(5) = 1

1
2
1
2

2
0
0
0
Comparing Signatures
īŽ

īŽ

Signature Matrix S
īŽ Rows = Hash Functions
īŽ Columns = Columns
īŽ Entries = Signatures
Can compute – Pair-wise similarity of
any pair of signature columns
All signature pairs
īŽ

īŽ

Now we have an extremely efficient method for
estimating a Jaccard coefficient for a single pair
of documents.
But we still have to estimate N2 coefficients where
N is the number of web pages.
īŽ

īŽ
īŽ

Still slow

One solution: locality sensitive hashing (LSH)
Another solution: sorting (Henzinger 2006)
More resources
īŽ

IIR Chapter 19

Weitere ähnliche Inhalte

Was ist angesagt?

Search Engine Marketing - Search Engines Background and SEO Introduction
Search Engine Marketing - Search Engines Background and SEO IntroductionSearch Engine Marketing - Search Engines Background and SEO Introduction
Search Engine Marketing - Search Engines Background and SEO IntroductionJon Rognerud Chaosmap Digital
 
Seo & Internet Marketing
Seo & Internet MarketingSeo & Internet Marketing
Seo & Internet Marketingalafrancis
 
Seo Manual
Seo ManualSeo Manual
Seo Manualimgaurav16
 
Introduction to SEO
Introduction to SEOIntroduction to SEO
Introduction to SEORand Fishkin
 
What you need to know about seo for Rocks Digital Conference by Barry Schwartz
What you need to know about seo for Rocks Digital Conference by Barry SchwartzWhat you need to know about seo for Rocks Digital Conference by Barry Schwartz
What you need to know about seo for Rocks Digital Conference by Barry SchwartzBarry Schwartz
 
Getting the most out of your web site.
Getting the most out of your web site.Getting the most out of your web site.
Getting the most out of your web site.Stephen Shaw
 
Onsite seo from the wizard of moz
Onsite seo from the  wizard of mozOnsite seo from the  wizard of moz
Onsite seo from the wizard of mozUmberto Tessitore
 
SEO: search Engine Optimization
SEO: search Engine OptimizationSEO: search Engine Optimization
SEO: search Engine Optimizationphoolchand yadav
 
Acting local in a global world (local SEO)
Acting local in a global world (local SEO)Acting local in a global world (local SEO)
Acting local in a global world (local SEO)Similarweb
 
Digital Discoveries: Strategies to End 2013 with Online Business Success
Digital Discoveries: Strategies to End 2013 with Online Business SuccessDigital Discoveries: Strategies to End 2013 with Online Business Success
Digital Discoveries: Strategies to End 2013 with Online Business SuccesstbkCreative
 
Emarketing [repaired]
Emarketing [repaired]Emarketing [repaired]
Emarketing [repaired]stephen Ward
 
Web analytics
Web analyticsWeb analytics
Web analyticsracerunner
 
Competitive Intelligence For Competitive SEOs
Competitive Intelligence For Competitive SEOsCompetitive Intelligence For Competitive SEOs
Competitive Intelligence For Competitive SEOsJoelBondorowsky
 
Emarketing
EmarketingEmarketing
Emarketingbavalekar
 
John Jantsch's Keynote Presentation at the eMarketing For Entrepreneurs Confe...
John Jantsch's Keynote Presentation at the eMarketing For Entrepreneurs Confe...John Jantsch's Keynote Presentation at the eMarketing For Entrepreneurs Confe...
John Jantsch's Keynote Presentation at the eMarketing For Entrepreneurs Confe...Corporate College
 

Was ist angesagt? (20)

Search Engine Marketing - Search Engines Background and SEO Introduction
Search Engine Marketing - Search Engines Background and SEO IntroductionSearch Engine Marketing - Search Engines Background and SEO Introduction
Search Engine Marketing - Search Engines Background and SEO Introduction
 
Seo & Internet Marketing
Seo & Internet MarketingSeo & Internet Marketing
Seo & Internet Marketing
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
 
Introduction to SEO
Introduction to SEOIntroduction to SEO
Introduction to SEO
 
What you need to know about seo for Rocks Digital Conference by Barry Schwartz
What you need to know about seo for Rocks Digital Conference by Barry SchwartzWhat you need to know about seo for Rocks Digital Conference by Barry Schwartz
What you need to know about seo for Rocks Digital Conference by Barry Schwartz
 
SEO Workshop
SEO WorkshopSEO Workshop
SEO Workshop
 
Getting the most out of your web site.
Getting the most out of your web site.Getting the most out of your web site.
Getting the most out of your web site.
 
Onsite seo from the wizard of moz
Onsite seo from the  wizard of mozOnsite seo from the  wizard of moz
Onsite seo from the wizard of moz
 
SEO: search Engine Optimization
SEO: search Engine OptimizationSEO: search Engine Optimization
SEO: search Engine Optimization
 
Acting local in a global world (local SEO)
Acting local in a global world (local SEO)Acting local in a global world (local SEO)
Acting local in a global world (local SEO)
 
Digital Discoveries: Strategies to End 2013 with Online Business Success
Digital Discoveries: Strategies to End 2013 with Online Business SuccessDigital Discoveries: Strategies to End 2013 with Online Business Success
Digital Discoveries: Strategies to End 2013 with Online Business Success
 
Collins book (1) (1)
Collins book (1) (1)Collins book (1) (1)
Collins book (1) (1)
 
Seo ppt
Seo pptSeo ppt
Seo ppt
 
Emarketing
EmarketingEmarketing
Emarketing
 
Seo
SeoSeo
Seo
 
Emarketing [repaired]
Emarketing [repaired]Emarketing [repaired]
Emarketing [repaired]
 
Web analytics
Web analyticsWeb analytics
Web analytics
 
Competitive Intelligence For Competitive SEOs
Competitive Intelligence For Competitive SEOsCompetitive Intelligence For Competitive SEOs
Competitive Intelligence For Competitive SEOs
 
Emarketing
EmarketingEmarketing
Emarketing
 
John Jantsch's Keynote Presentation at the eMarketing For Entrepreneurs Confe...
John Jantsch's Keynote Presentation at the eMarketing For Entrepreneurs Confe...John Jantsch's Keynote Presentation at the eMarketing For Entrepreneurs Confe...
John Jantsch's Keynote Presentation at the eMarketing For Entrepreneurs Confe...
 

Ähnlich wie Cse535 chapter19-web search

SEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTINGSEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTINGBUDNET
 
(2004) Using internet search engines to market your services
(2004) Using internet search engines to market your services(2004) Using internet search engines to market your services
(2004) Using internet search engines to market your servicesOfficeFinder Network / TrueView360s
 
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018Henning Spjelkavik
 
Introduction to search_marketing
Introduction to search_marketingIntroduction to search_marketing
Introduction to search_marketingBill Hunt
 
Search Enginesv2
Search Enginesv2Search Enginesv2
Search Enginesv2athiracyborg
 
Digital Marketing
Digital MarketingDigital Marketing
Digital MarketingDigipro India
 
Seo Beginners Slide Show
Seo Beginners Slide ShowSeo Beginners Slide Show
Seo Beginners Slide ShowTin180 VietNam
 
Vertical Search - challenges and opportunities
Vertical Search - challenges and opportunitiesVertical Search - challenges and opportunities
Vertical Search - challenges and opportunitiesAndy Black
 
Free Traffic Seo Smo 101 (Search Engine Social Media Optimization) Present...
Free Traffic  Seo Smo 101 (Search Engine   Social Media Optimization) Present...Free Traffic  Seo Smo 101 (Search Engine   Social Media Optimization) Present...
Free Traffic Seo Smo 101 (Search Engine Social Media Optimization) Present...jward5519
 
SEO Secrets for Online Retailers Webinar
SEO Secrets for Online Retailers Webinar SEO Secrets for Online Retailers Webinar
SEO Secrets for Online Retailers Webinar getelastic
 
BEST SEO TIPS - Tricks SEO
BEST SEO TIPS - Tricks SEO BEST SEO TIPS - Tricks SEO
BEST SEO TIPS - Tricks SEO vickybish
 
Vikram seo ppt
Vikram seo pptVikram seo ppt
Vikram seo pptvickybish
 
Binance Supports Number
Binance Supports NumberBinance Supports Number
Binance Supports Numberkanha lal
 

Ähnlich wie Cse535 chapter19-web search (20)

SEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTINGSEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTING
 
(2004) Using internet search engines to market your services
(2004) Using internet search engines to market your services(2004) Using internet search engines to market your services
(2004) Using internet search engines to market your services
 
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018
 
Introduction to search_marketing
Introduction to search_marketingIntroduction to search_marketing
Introduction to search_marketing
 
Search Enginesv2
Search Enginesv2Search Enginesv2
Search Enginesv2
 
Digital Marketing
Digital MarketingDigital Marketing
Digital Marketing
 
seo (1).ppt
seo (1).pptseo (1).ppt
seo (1).ppt
 
seo.ppt
seo.pptseo.ppt
seo.ppt
 
Seo Beginners Slide Show
Seo Beginners Slide ShowSeo Beginners Slide Show
Seo Beginners Slide Show
 
Vertical Search - challenges and opportunities
Vertical Search - challenges and opportunitiesVertical Search - challenges and opportunities
Vertical Search - challenges and opportunities
 
SEO
SEOSEO
SEO
 
17 00 distil rami
17 00 distil rami17 00 distil rami
17 00 distil rami
 
Free Traffic Seo Smo 101 (Search Engine Social Media Optimization) Present...
Free Traffic  Seo Smo 101 (Search Engine   Social Media Optimization) Present...Free Traffic  Seo Smo 101 (Search Engine   Social Media Optimization) Present...
Free Traffic Seo Smo 101 (Search Engine Social Media Optimization) Present...
 
SEO Secrets for Online Retailers Webinar
SEO Secrets for Online Retailers Webinar SEO Secrets for Online Retailers Webinar
SEO Secrets for Online Retailers Webinar
 
BEST SEO TIPS - Tricks SEO
BEST SEO TIPS - Tricks SEO BEST SEO TIPS - Tricks SEO
BEST SEO TIPS - Tricks SEO
 
Vikram seo ppt
Vikram seo pptVikram seo ppt
Vikram seo ppt
 
Seo
SeoSeo
Seo
 
Binance Supports Number
Binance Supports NumberBinance Supports Number
Binance Supports Number
 
SEO PPT
SEO PPTSEO PPT
SEO PPT
 
Seo
SeoSeo
Seo
 

Mehr von Yihong Chen

Contributer
ContributerContributer
ContributerYihong Chen
 
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告Yihong Chen
 
上æĩˇé‡‘į‚ŗ2011嚴金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011嚴金äģˇéĸ„æĩ‹æŠĨ告上æĩˇé‡‘į‚ŗ2011嚴金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011嚴金äģˇéĸ„æĩ‹æŠĨ告Yihong Chen
 
上æĩˇé‡‘į‚ŗ2011åš´2月金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011åš´2月金äģˇéĸ„æĩ‹æŠĨ告上æĩˇé‡‘į‚ŗ2011åš´2月金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011åš´2月金äģˇéĸ„æĩ‹æŠĨ告Yihong Chen
 
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告BAK
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告BAK上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告BAK
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告BAKYihong Chen
 

Mehr von Yihong Chen (9)

0.9
0.90.9
0.9
 
0.5
0.5 0.5
0.5
 
0.3
0.30.3
0.3
 
0 0.3
0 0.30 0.3
0 0.3
 
Contributer
ContributerContributer
Contributer
 
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告
 
上æĩˇé‡‘į‚ŗ2011嚴金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011嚴金äģˇéĸ„æĩ‹æŠĨ告上æĩˇé‡‘į‚ŗ2011嚴金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011嚴金äģˇéĸ„æĩ‹æŠĨ告
 
上æĩˇé‡‘į‚ŗ2011åš´2月金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011åš´2月金äģˇéĸ„æĩ‹æŠĨ告上æĩˇé‡‘į‚ŗ2011åš´2月金äģˇéĸ„æĩ‹æŠĨ告
上æĩˇé‡‘į‚ŗ2011åš´2月金äģˇéĸ„æĩ‹æŠĨ告
 
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告BAK
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告BAK上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告BAK
上æĩˇé‡‘į‚ŗ2011嚴一å­ŖåēĻ金äģˇéĸ„æĩ‹æŠĨ告BAK
 

KÃŧrzlich hochgeladen

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
đŸŦ The future of MySQL is Postgres 🐘
đŸŦ  The future of MySQL is Postgres   🐘đŸŦ  The future of MySQL is Postgres   🐘
đŸŦ The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

KÃŧrzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
đŸŦ The future of MySQL is Postgres 🐘
đŸŦ  The future of MySQL is Postgres   🐘đŸŦ  The future of MySQL is Postgres   🐘
đŸŦ The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Cse535 chapter19-web search

  • 1. CSE 435/535 Information Retrieval Fall 2013 Chapter 19: Web Search Basics
  • 2. Brief (non-technical) history īŽ Early keyword-based engines īŽ īŽ Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997 Sponsored search ranking: Goto.com (morphed into Overture.com → Yahoo!) īŽ īŽ Your search ranking depended on how much you paid Auction for keywords: casino was expensive!
  • 3. Brief (non-technical) history īŽ 1998+: Link-based ranking pioneered by Google īŽ īŽ īŽ īŽ Blew away all early engines save Inktomi Great user experience in search of a business model Meanwhile Goto/Overture’s annual revenues were nearing $1 billion Result: Google added paid-placement “ads” to the side, independent of search results īŽ Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search)
  • 5. Web search basics Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA User Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds) Miele, Inc -- Anything else is a compromise Web spider At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages Search Indexer The Web Indexes Ad indexes
  • 6. User Needs īŽ Need [Brod02, RL04] īŽ īŽ īŽ Informational – want to learn about something (~40% / 65%) Low hemoglobin Navigational – want to go to that page (~25% / 15%) United Airlines Transactional – want to do something (web-mediated) (~35% / 20%) īŽ Access a service Seattle weather Mars surface images īŽ Downloads īŽ īŽ Shop Canon S410 Gray areas īŽ īŽ Car rental Brasil Find a good hub Exploratory search “see what’s there”
  • 7. How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
  • 8. Users’ empirical evaluation of results īŽ Quality of pages varies widely īŽ īŽ Relevance is not enough Other desirable qualities (non IR!!) īŽ īŽ īŽ īŽ Precision vs. recall īŽ īŽ On the web, recall seldom matters What matters īŽ īŽ Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with obscure queries īŽ īŽ Content: Trustworthy, diverse, non-duplicated, well maintained Web readability: display correctly & fast No annoyances: pop-ups, etc Recall matters when the number of matches is very small User perceptions may be unscientific, but are significant over a large aggregate
  • 9. Users’ empirical evaluation of engines īŽ īŽ īŽ īŽ īŽ Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided īŽ īŽ īŽ īŽ Mitigate user errors (auto spell check, search assist,â€Ļ) Explicit: Search within results, more like this, refine ... Anticipative: related searches Deal with idiosyncrasies īŽ Web specific vocabulary īŽ īŽ īŽ Impact on stemming, spell-check, etc Web addresses typed in the search box â€Ļ
  • 10. The Web document collection īŽ īŽ īŽ īŽ īŽ īŽ The Web īŽ No design/co-ordination Distributed content creation, linking, democratization of publishing Content includes truth, lies, obsolete information, contradictions â€Ļ Unstructured (text, html, â€Ļ), semistructured (XML, annotated photos), structured (Databases)â€Ļ Scale much larger than previous text collections â€Ļ but corporate records are catching up Growth – slowed down from initial “volume doubling every few months” but still expanding Content can be dynamically generated
  • 12. The trouble with sponsored search â€Ļ īŽ īŽ It costs money. What’s the alternative? Search Engine Optimization: īŽ īŽ īŽ īŽ īŽ “Tuning” your web page to rank highly in the algorithmic search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants (“Search engine optimizers”) for their clients Some perfectly legitimate, some very shady
  • 13. Simplest forms īŽ First generation engines relied heavily on tf/idf īŽ īŽ The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s SEOs responded with dense repetitions of chosen terms īŽ īŽ e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page īŽ īŽ Repeated terms got indexed by crawlers But not visible to humans on browsers Pure word density cannot be trusted as an IR signal
  • 14. Variants of keyword stuffing īŽ īŽ Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks, etc. Meta-Tags = “â€Ļ London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, â€Ļ”
  • 15. Search engine optimization (Spam) īŽ Motives īŽ īŽ īŽ Operators īŽ īŽ īŽ īŽ Commercial, political, religious, lobbies Promotion funded by advertising budget Contractors (Search Engine Optimizers) for lobbies, companies Web masters Hosting services Forums īŽ E.g., Web master world ( www.webmasterworld.com ) īŽ īŽ Search engine specific tricks Discussions about academic papers īŠ
  • 17. Features identifying synthetic content īŽ Average word length īŽ īŽ Word frequency distribution īŽ īŽ Certain words (“the”, “a”, â€Ļ) appear more often than others N-gram frequency distribution īŽ īŽ The mean word length for English prose is about 5 characters; but longer for some forms of keyword stuffing Some words are more likely to occur next to each other than others Grammatical well-formedness īŽ Natural-language parsing is expensive
  • 18. Really good synthetic content “Nigritude Ultramarine”: An SEO competition Links to keep crawlers going Grammatically well-formed but meaningless sentences
  • 19. Content “repurposing” īŽ Content repurposing: The practice of incorporating all or portions of other (unaffiliated) web pages īŽ īŽ īŽ A “convenient” way to machine generate pages that contain human-authored content Not even necessarily illegal â€Ļ Two flavors: īŽ īŽ Incorporate large portions of a single page Incorporate snippets of multiple pages
  • 20. Example of page-level content “repurposing”
  • 21. Example of phrase-level content “repurposing” How is monetization taking place?
  • 22. Techniques for detecting content repurposing īŽ Single-page flavor: Cluster pages into equivalence classes of very similar pages īŽ īŽ īŽ If most pages on a site a very similar to pages on other sites, raise a red flag (There are legitimate replicated sites; e.g. mirrors of Linux man pages) Many-snippets flavor: Test if page consists mostly of phrases that also occur somewhere else īŽ īŽ Computationally hard problem Probabilistic technique that makes it tractable (Fetterly et al SIGIR 2005 paper)
  • 23. Cloaking īŽ īŽ Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Y SPAM Is this a Search Engine spider? Cloaking N Real Doc
  • 24. More spam techniques īŽ Doorway pages īŽ īŽ Link spamming īŽ īŽ īŽ īŽ Pages optimized for a single keyword that re-direct to the real target page links between pages that are present for reasons other than merit Mutual admiration societies, hidden links, awards – more on these later Domain flooding: numerous domains that point or re-direct to a target page Robots īŽ Fake query stream – rank checking programs īŽ īŽ “Curve-fit” ranking programs of search engines Millions of submissions via Add-Url
  • 25. The war against spam īŽ Quality signals - Prefer authoritative pages based on: īŽ Votes from authors (linkage signals) Votes from users (usage signals) īŽ īŽ īŽ īŽ īŽ īŽ īŽ Anti robot test , i.e. not submitted by a robot Limits on meta-keywords Robust link analysis īŽ īŽ Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) īŽ īŽ Training set based on known spam Family friendly filters īŽ Policing of URL submissions īŽ Spam recognition by machine learning Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention īŽ īŽ īŽ īŽ Blacklists Top queries audited Complaints addressed Suspect pattern detection
  • 26. More on spam īŽ Web search engines have policies on SEO practices they tolerate/block īŽ īŽ īŽ īŽ http://help.yahoo.com/help/us/ysearch/index.html http://www.google.com/intl/en/webmasters/ Adversarial IR: the unending (technical) battle between SEO’s and web search engines Research http://airweb.cse.lehigh.edu/
  • 27. Size of the web
  • 28. What is the size of the web ? īŽ Issues īŽ The web is really infinite īŽ īŽ īŽ īŽ īŽ Dynamic content, e.g., calendar Soft 404: www.yahoo.com/<anything> is a valid page Static web contains syntactic duplication, mostly due to mirroring (~30%) Some servers are seldom connected Who cares? īŽ īŽ īŽ Media, and consequently the user Engine design Engine crawl policy. Impact on recall.
  • 29. What can we attempt to measure? The relative sizes of search engines īŽ īŽ īŽ The notion of a page being indexed is still reasonably well defined. Already there are problems īŽ īŽ Document extension: e.g. engines index pages not yet crawled, by indexing anchortext. Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process. īŽ
  • 30. Graph Structure in the Web http://www9.org/w9cdrom/160/160.html
  • 31. Zipf’s Law on the Web īŽ Number of in-links/out-links to/from a page has a Zipfian distribution. īŽ īŽ Length of web pages has a Zipfian distribution. īŽ īŽ A large number of docs have the same number of in-links/out-links to/from a page Certain page lengths are most common; i.e. a huge number of docs with the same page length. Number of hits to a web page has a Zipfian distribution. īŽ A large number of pages will generate the same number of hits; dramatic drop-off in number of pages as the above frequency reduces
  • 32. New definition? īŽ (IQ is whatever the IQ tests measure.) īŽ īŽ Different engines have different preferences īŽ īŽ The statically indexable web is whatever search engines index. max url depth, max count/host, anti-spam rules, priority rules, etc. Different engines index different things under the same URL: īŽ frames, meta-keywords, document restrictions, document extensions, ...
  • 33. Relative Size from Overlap Given two engines A and B Sample URLs randomly from A Check if contained in B and vice versa A∊ B A∊ B = A∊ B = (1/2) * Size A (1/6) * Size B (1/2)*Size A = (1/6)*Size B ∴ Size A / Size B = (1/6)/(1/2) = 1/3 Each test involves: (i) Sampling (ii) Checking
  • 34. Sampling URLs īŽ Ideal strategy: Generate a random URL and check for containment in each index. īŽ Problem: Random URLs are hard to find! Enough to generate a random URL contained in a given Engine. Approach 1: Generate a random URL contained in a given engine īŽ īŽ īŽ īŽ Pick a random query from search log; pick random page returned as search result Suffices for the estimation of relative size Approach 2: Random walks / IP addresses īŽ In theory: might give us a true estimate of the size of the web (as opposed to just relative sizes of indexes)
  • 35. Statistical methods īŽ Approach 1 īŽ īŽ īŽ *** Random queries: preferred method Random searches Approach 2 īŽ īŽ Random IP addresses Random walks
  • 36. Random URLs from random queries īŽ Generate random query: how? īŽ īŽ Not an English Lexicon: 400,000+ words from a web crawl dictionary Conjunctive Queries: w1 and w2 e.g., vocalists AND rsi īŽ īŽ īŽ īŽ Get 100 result URLs from engine A Choose a random URL as the candidate to check for presence in engine B (how do you do this?) This distribution induces a probability weight W(p) for each page. Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|
  • 37. Query Based Checking īŽ Strong Query to check whether an engine B has a document D (corresponding to random URL): īŽ īŽ īŽ īŽ Download D. Get list of words. Use 8 low frequency words as AND query to B Check if D is present in result set. Problems: īŽ īŽ īŽ īŽ īŽ Near duplicates (not D, but something identical to D) Frames Redirects Engine time-outs Is 8-word query good enough?
  • 38. Advantages & disadvantages īŽ īŽ Statistically sound under the induced weight. Biases induced by random query īŽ Query Bias: Favors content-rich pages in the language(s) of the lexicon īŽ īŽ īŽ Ranking Bias: Solution: Use conjunctive queries & fetch all Checking Bias: Duplicates, impoverished pages omitted Document or query restriction bias: engine might not deal properly with 8 words conjunctive query īŽ īŽ Malicious Bias: Sabotage by engine Operational Problems: Time-outs, failures, engine inconsistencies, index modification.
  • 39. Alt 1: Random searches īŽ īŽ These are real queries based on real users, rather than “random queries” Choose random searches extracted from a local log [Lawrence & Giles 97] or build “random searches” [Notess] īŽ īŽ īŽ Use only queries with small results sets. Count normalized URLs in result sets. Use ratio statistics
  • 40. Advantages & disadvantages īŽ Advantage īŽ īŽ Might be a better reflection of the human perception of coverage Issues īŽ īŽ īŽ Samples are correlated with source of log Duplicates Technical statistical problems (must have nonzero results, ratio average not statistically sound)
  • 41. Random searches īŽ īŽ īŽ 575 & 1050 queries from the NEC RI employee logs 6 Engines in 1998, 11 in 1999 Implementation: īŽ Restricted to queries with < 600 results in total īŽ Counted URLs from each engine after verifying query match īŽ Computed size ratio & overlap for individual queries īŽ Estimated index size ratio & overlap by averaging over all queries
  • 42. Queries from Lawrence and Giles study īŽ īŽ īŽ īŽ īŽ īŽ īŽ īŽ īŽ īŽ adaptive access control neighborhood preservation topographic hamiltonian structures right linear grammar pulse width modulation neural unbalanced prior probabilities ranked assignment method internet explorer favourites importing karvel thornber zili liu īŽ īŽ īŽ īŽ īŽ īŽ īŽ īŽ īŽ īŽ softmax activation function bose multidimensional system theory gamma mlp dvi2pdf john oliensis rieke spikes exploring neural video watermarking counterpropagation network fat shattering dimension abelson amorphous computing
  • 43. Alt 2: Random IP addresses īŽ īŽ Generate random IP addresses Find a web server at the given address īŽ īŽ If there’s one Collect all pages from server īŽ From this, choose a page at random
  • 44. Random IP addresses īŽ HTTP requests to random IP addresses īŽ īŽ īŽ Ignored: empty or authorization required or excluded [Lawr99] Estimated 2.8 million IP addresses running crawlable web servers (16 million total) from observing 2500 servers. OCLC using IP sampling found 8.7 M hosts in 2001 īŽ īŽ Netcraft [Netc02] accessed 37.2 million hosts in July 2002 [Lawr99] exhaustively crawled 2500 servers and extrapolated īŽ īŽ Estimated size of the web to be 800 million Estimated use of metadata descriptors: īŽ Meta tags (keywords, description) in 34% of home pages, Dublin core metadata in 0.3%
  • 45. Advantages & disadvantages īŽ Advantages īŽ īŽ īŽ Clean statistics Independent of crawling strategies Disadvantages īŽ īŽ īŽ Doesn’t deal with duplication Many hosts might share one IP, or not accept requests No guarantee all pages are linked to root page. īŽ īŽ Power law for # pages/hosts generates bias towards sites with few pages. īŽ īŽ Eg: employee pages But bias can be accurately quantified IF underlying distribution understood Potentially influenced by spamming (multiple IP’s for same server to avoid IP block)
  • 46. Alt 3: Random walks īŽ īŽ View the Web as a directed graph Build a random walk on this graph īŽ Includes various “jump” rules back to visited sites īŽ īŽ īŽ Converges to a stationary distribution īŽ īŽ īŽ īŽ īŽ Does not get stuck in spider traps! Can follow all links! Must assume graph is finite and independent of the walk. Conditions are not satisfied (cookie crumbs, flooding) Time to convergence not really known Sample from stationary distribution of walk Use the “strong query” method to check coverage by SE
  • 47. Advantages & disadvantages īŽ Advantages īŽ īŽ īŽ “Statistically clean” method at least in theory! Could work even for infinite web (assuming convergence) under certain metrics. Disadvantages īŽ īŽ īŽ List of seeds is a problem. Practical approximation might not be valid. Non-uniform distribution īŽ Subject to link spamming
  • 48. Conclusions īŽ īŽ īŽ īŽ No sampling solution is perfect. Lots of new ideas ... ....but the problem is getting harder Quantitative studies are fascinating and a good research problem
  • 50. Duplicate documents īŽ īŽ The web is full of duplicated content Strict duplicate detection = exact match īŽ īŽ Not as common But many, many cases of near duplicates īŽ E.g., Last modified date the only difference between two copies of a page
  • 51. Duplicate/Near-Duplicate Detection īŽ īŽ Duplication: Exact match can be detected with fingerprints (64-bit), if fingerprints match, check further Near-Duplication: Approximate match īŽ Overview īŽ īŽ Compute syntactic similarity with an edit-distance measure Use similarity threshold to detect near-duplicates īŽ īŽ E.g., Similarity > 80% => Documents are “near duplicates” Not transitive though sometimes used transitively
  • 52. Computing Similarity īŽ īŽ Features: īŽ Segments of a document (natural or artificial breakpoints) īŽ Shingles (Word N-Grams) īŽ a rose is a rose is a rose → a_rose_is_a rose_is_a_rose is_a_rose_is a_rose_is_a Similarity Measure between two docs (= sets of shingles) īŽ Set intersection īŽ Specifically (Size_of_Intersection / Size_of_Union)
  • 53. Shingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expensive/intractable īŽ īŽ Approximate using a cleverly chosen subset of shingles from each (a sketch) Estimate (size_of_intersection / size_of_union) based on a short sketch īŽ Doc Doc A A Shingle set A Doc Doc B B Shingle set A Sketch A Jaccard Sketch A
  • 54. Sketch of a document īŽ Create a “sketch vector” (of size ~200) for each document īŽ īŽ Documents that share â‰Ĩ t (say 80%) corresponding vector elements are near duplicates For doc D, sketchD[ i ] is as follows: īŽ īŽ īŽ Let f map all shingles in the universe to 0..2m (e.g., f = fingerprinting) Let Ī€i be a random permutation on 0..2m Pick MIN {Ī€i(f(s))} over all shingles s in D
  • 55. Computing Sketch[i] for Doc1 Document 1 264 Start with 64-bit f(shingles) 264 Permute on the number line 2 64 264 with Ī€i Pick the min value
  • 56. Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 2 Document 1 264 264 264 264 A 264 264 2 64 B Are these equal? Test for 200 random permutations: Ī€1, Ī€2,â€Ļ Ī€200 264
  • 57. Howeverâ€Ļ Document 2 Document 1 264 264 264 264 A 264 264 B 264 264 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Why? Claim: This happens with probability Size_of_intersection / Size_of_union
  • 58. Set Similarity of sets Ci , Cj Jaccard(Ci , C j ) = īŽ īŽ Ci ī‰ C j Ci ī• C j View sets as columns of a matrix A; one row for each element in the universe. aij = 1 indicates presence of item i in set j C1 C2 Example 0 1 1 0 1 1 0 1 0 1 Jaccard(C 1 ,C 2 ) = 2/5 = 0.4
  • 59. Key Observation īŽ For columns Ci, Cj, four types of rows Ci A B C D īŽ īŽ 1 1 0 0 Cj 1 0 1 0 Overload notation: A = # of rows of type A Claim A Jaccard(Ci , C j ) = A+B+C
  • 60. “Min” Hashing īŽ īŽ Randomly permute rows Hash h(Ci) = index of first row with 1 in column Ci īŽ Surprising Property īŽ Why? P [ h(Ci ) = h(C j ) ] = Jaccard ( Ci , C j ) īŽ īŽ īŽ Both are A/(A+B+C) Min: Look down columns Ci, Cj until first non-Type-D row h(Ci) = h(Cj) īƒŸīƒ  type A row īŽ Because we chose a random permutation, prob of this occurring is A/(A+B+C)
  • 61. Min-Hash sketches īŽ īŽ Pick P random row permutations MinHash sketch SketchD = list of P (usually, 200) indexes (i.e. integers) of first rows with 1 in column C īŽ Similarity of signatures īŽ īŽ Let sim[sketch(Ci),sketch(Cj)] = fraction of permutations where MinHash values agree Observe E[sim(sig(Ci),sig(Cj))] = Jaccard(Ci,Cj)
  • 62. Example R1 R2 R3 R4 R5 C1 1 0 1 1 0 C2 0 1 0 0 1 C3 1 1 0 1 0 Signatures (composed of min) S1 S2 S3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00 Good estimate
  • 63. Implementation Trick īŽ īŽ Permuting universe even once is prohibitive Row Hashing īŽ īŽ īŽ Pick P hash functions hk: {1,â€Ļ,n}īƒ {1,â€Ļ,O(n)} Ordering under hk gives random permutation of rows One-pass Implementation īŽ For each Ci and hk, keep “slot” for min-hash value īŽ Initialize all slot(Ci,hk) to infinity īŽ Scan rows in arbitrary order looking for 1’s īŽ Suppose row Rj has 1 in column Ci īŽ For each hk,
  • 64. Example R1 R2 R3 R4 R5 C1 1 0 1 1 0 C2 0 1 1 0 1 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 1 g(1) = 3 C1 slots C2 slots 1 3 - h(2) = 2 g(2) = 0 1 3 2 0 h(3) = 3 g(3) = 2 1 2 2 0 h(4) = 4 g(4) = 4 h(5) = 0 g(5) = 1 1 2 1 2 2 0 0 0
  • 65. Comparing Signatures īŽ īŽ Signature Matrix S īŽ Rows = Hash Functions īŽ Columns = Columns īŽ Entries = Signatures Can compute – Pair-wise similarity of any pair of signature columns
  • 66. All signature pairs īŽ īŽ Now we have an extremely efficient method for estimating a Jaccard coefficient for a single pair of documents. But we still have to estimate N2 coefficients where N is the number of web pages. īŽ īŽ īŽ Still slow One solution: locality sensitive hashing (LSH) Another solution: sorting (Henzinger 2006)

Hinweis der Redaktion

  1. Wikipedia has a good description of the Nigritude Ultramarine contest, which ran from 5/7/2004 to 7/7/2004 and drew 200 contestants.
  2. Unfortunately, I can’t show you just how the first replicating site monetizes on the Paris Hilton sentences, but you might be able to figure it out yourselves. The second site is even more amazing: Not only does it try to sell vacations to Paris by invoking Paris Hilton’s name, but it also manages to put Miss Hilton and Condoleezza Rice into the same sentence!
  3. Union = 7-2 = 5. The number of unique set items in both sets