SlideShare ist ein Scribd-Unternehmen logo
1 von 154
Downloaden Sie, um offline zu lesen
Tutorial:
Web Information Retrieval

          Monika Henzinger
  http://www.henzinger.com/~monika
What is this talk about?
l Topic:
   – Algorithms for retrieving information on the
         Web
l Non-Topics:
  – Algorithmic issues in classic information retrieval
    (IR), e.g. stemming
  – String algorithms, e.g. approximate matching
  – Other algorithmic issues related to the Web:
     • networking & routing
     • cacheing
     • security
     • e-commerce
 M. Henzinger            Web Information Retrieval        2
Preliminaries

l Each Web page has a unique URL
                                                    Page name


   http://theory.stanford.edu/~focs98/tutorials.html



 Access                Host name =
 protocol              Domain name
                                                        Path

l (Hyper) link = pointer from one page to another, loads
  second page if clicked on
l In this talk: document = (Web) page

M. Henzinger            Web Information Retrieval               3
Example of a query
                          princess diana
  Engine 1                    Engine 2                   Engine 3




          Relevant and             Relevant but            Not relevant
           high quality             low quality          index pollution




M. Henzinger                 Web Information Retrieval                     4
Outline

l Classic IR vs. Web IR

l Some IR tools specific to the Web Details on
                                                          - Ranking
     For each type                                        - Duplicate elimination
                                                          - Search-by-example
       – Examples                                         - Measuring search
                                                            engine index quality
       – Algorithmic issues

l Conclusions
                                                          Open problems


M. Henzinger                  Web Information Retrieval                         5
Classic IR

l Input: Document collection
l Goal: Retrieve documents or text with information
     content that is relevant to user’s information need
l Two aspects:
       1. Processing the collection
       2. Processing queries (searching)
l Reference Texts: SW’97, BR’99



M. Henzinger               Web Information Retrieval       6
Determining query results
“model” = strategy for determining which documents to return
l Logical model: String matches plus AND, OR, NOT
l Vector model (Salton et al.):
      – Documents and query represented as vector of terms
      – Vector entry i = weight of term i = function of
        frequencies within document and within collection
      – Similarity of document & query = cosine of angle of their
        vectors
      – Query result: documents ordered by similarity
l Other models used in IR but not discussed here:
      – Probabilistic model, cognitive model, …

M. Henzinger               Web Information Retrieval            7
IR on the Web

l Input:The publicly accessible Web
l Goal: Retrieve high quality pages that are relevant to
  user’s need
   – Static (files: text, audio, … )
   – Dynamically generated on request: mostly data base
     access
l Two aspects:
   1. Processing and representing the collection
      • Gathering the static pages
      • “Learning” about the dynamic pages
   2. Processing queries (searching)

M. Henzinger           Web Information Retrieval       8
What’s different about the Web?
(1) Pages:
l Bulk …………………… >1B (12/99) [GL’99]
l Lack of stability……….. Estimates: 23%/day, 38%/week [CG’99]
l Heterogeneity
   – Type of documents .. Text, pictures, audio, scripts,…
   – Quality ……………… From dreck to ICDE papers …
   – Language ………….. 100+
l Duplication
   – Syntactic……………. 30% (near) duplicates
   – Semantic……………. ??
l Non-running text……… many home pages, bookmarks, ...
l High linkage…………… ≥ 8 links/page in the average

M. Henzinger           Web Information Retrieval          9
Typical home page: non-running
               text




M. Henzinger           Web Information Retrieval   10
Typical home page:
               Non-running text




M. Henzinger           Web Information Retrieval   11
The big challenge




                Meet the user needs
                       given
           the heterogeneity of Web pages




M. Henzinger           Web Information Retrieval   12
What’s different about the Web?
(2) Users:
l Make poor queries                  l Specific behavior
  – Short (2.35 terms avg)              – 85% look over one result
  – Imprecise terms                       screen only
  – Sub-optimal syntax                  – 78% of queries are not
    (80% queries without                  modified
    operator)                           – Follow links
  – Low effort                          – See various user studies
l Wide variance in                        in CHI, Hypertext, SIGIR,
                                          etc.
  – Needs
  – Knowledge
  – Bandwidth
M. Henzinger           Web Information Retrieval                13
The bigger challenge



                Meet the user needs
                        given
           the heterogeneity of Web pages
                         and
              the poorly made queries.



M. Henzinger           Web Information Retrieval   14
Why don’t the users get what
                 they want?
                                                               Example
                                                      I need to get rid of mice in
                      User need
                                                      the basement


Translation          User request                     What’s the best way to
 problems            (verbalized)                     trap mice alive?


                       Query to                       mouse trap
                      IR system
Polysemy
Synonymy
                       Results                        Software, toy cars,
                                                      inventive products, etc
  M. Henzinger            Web Information Retrieval                        15
Google output: mouse trap




                                                   J


M. Henzinger           Web Information Retrieval   16
Google output: trap mice
                                                   J

                                                   J



                                                   J



                                                   J




M. Henzinger           Web Information Retrieval       17
The bright side:
                Web advantages vs. classic IR

User                                        Collection/tools
l Many tools available                      l Redundancy
l Personalization                           l Hyperlinks
l Interactivity (refine the                 l Statistics
  query if needed)                             – Easy to gather
                                               – Large sample sizes
                                            l Interactivity (make the
                                              users explain what they
                                              want)



 M. Henzinger             Web Information Retrieval                18
Quantifying the quality of results

l How to evaluate different strategies?

l How to compare different search engines?




M. Henzinger            Web Information Retrieval   19
Classic evaluation of IR systems
  We start from a human made relevance judgement
  for each (query, page) pair and compute:

l Precision: % of returned pages
  that are relevant.                                   Relevant    Returned
                                                       pages       pages
l Recall: % of relevant pages
  that are returned.
l Precision at (rank) 10: % of
  top 10 pages that are relevant
l Relative Recall: % of relevant
  pages found by some means
  that are returned
                                                       All pages
  M. Henzinger             Web Information Retrieval                     20
Evaluation in the Web context

l Quality of pages varies widely

l We need both relevance and high quality = value of
     page.
Ł Precision at 10: % of top 10 pages that are valuable
  …




M. Henzinger           Web Information Retrieval         21
Web IR tools

l General-purpose search engines:
       – direct: AltaVista, Excite, Google, Infoseek, Lycos, ….
       – Indirect (Meta-search): MetaCrawler, DogPile,
         AskJeeves, InvisibleWeb, …
l Hierarchical directories: Yahoo!, all portals.
l Specialized search engines:
       – Home page finder: Ahoy                          Database
                                                         mostly built
       – Shopping robots: Jango, Junglee,…                by hand
       – Applet finders

M. Henzinger               Web Information Retrieval            22
Web IR tools (cont...)

l Search-by-example: Alexa’s “What’s related”,
  Excite’s “More like this”, Google’s “Googlescout”,
  etc.
l Collaborative filtering: Firefly, GAB, …
l …

l Meta-information:
  – Search Engine Comparisons
  – Query log statistics
  –…

M. Henzinger            Web Information Retrieval      23
General purpose search engines

l Search engines’ components:

       – Spider = Crawler -- collects the documents

       – Indexer -- process and represents the data

       – Search interface -- answers queries




M. Henzinger               Web Information Retrieval   24
Algorithmic issues related to
                 search engines
l Collecting          l Processing and                l Processing
  documents             representing the                queries
                        data
   – Priority                                            – Query-
                         – Query-                          dependent
   – Load                  independent
     balancing             ranking                         ranking
      • Internal         – Graph                         – Duplicate
                           representation                  elimination
      • External
                         – Index building                – Query
   – Trap                                                  refinement
     avoidance           – Duplicate
                           elimination                   – Clustering
   – ...                 – Categorization                – ...
                         –…


  M. Henzinger            Web Information Retrieval                  25
Ranking

l Goal: order the answers to a query in decreasing order
  of value
   – Query-independent: assign an intrinsic value to a
     document, regardless of the actual query
   – Query-dependent: value is determined only wrt a
     particular query.
   – Mixed: combination of both valuations.
l Examples
   – Query-independent: length, vocabulary, publication
     data, number of citations (indegree), etc.
   – Query-dependent: cosine measure


M. Henzinger             Web Information Retrieval         26
Some ranking criteria

l Content-based techniques (variant of term vector model
  or probabilistic model) – mostly query-dependent
l Ad-hoc factors (anti-porn heuristics, publication/location
  data, ...) – mostly query-independent
l Human annotations
l Connectivity-based techniques
   – Query-independent: PageRank [PBMW’98, BP’98],
     indegree [CK’97], …
   – Query-dependent: HITS [K’98], …




M. Henzinger            Web Information Retrieval          27
Connectivity analysis

l Idea: Mine hyperlink information of the Web
l Assumptions:
       – Links often connect related pages
       – A link between pages is a recommendation
l Classic IR work (citations = links) a.k.a. “Bibliometrics”
  [K’63, G’72, S’73,…]
l Socio-metrics [K’53, MMSM’86,…]
l Many Web related papers build on this idea [PPR’96,
  AMM’97, S’97, CK’97, K’98, BP’98,…]

M. Henzinger              Web Information Retrieval
Graph representation for the Web

l A node for each page u

l A directed edge (u,v) if page u contains a hyperlink to
     page v.




M. Henzinger            Web Information Retrieval           29
Query-independent ranking:
               Motivation for PageRank

l Assumption: A link from page A to page B is a
  recommendation of page B by the author of A
  (we say B is successor of A)
Ł Quality of a page is related to its in-degree


l Recursion: Quality of a page is related to
       – its in-degree, and to
       – the quality of pages linking to it
Ł PageRank [BP ‘98]


M. Henzinger                 Web Information Retrieval   30
Definition of PageRank [BP’98]

l Consider the following infinite random walk (surf):
       – Initially the surfer is at a random page
       – At each step, the surfer proceeds
               • to a randomly chosen web page with probability d
               • to a randomly chosen successor of the current
                 page with probability 1-d
l The PageRank of a page p is the fraction of steps
  the surfer spends at p in the limit.


M. Henzinger                    Web Information Retrieval           31
PageRank (cont.)

Said differently:
l Transition probability matrix is

                    d × U + (1 − d ) × A
     where U is the uniform distribution and A is adjacency
     matrix (normalized)
l PageRank = stationary probability for this Markov chain,
  i.e.
                 d
  PageRank (u ) = + (1 − d ) ∑ PageRank (v) / outdegree(v)
                 n          ( v ,u )∈E

     where n is the total number of nodes in the graph
l Used as one of the ranking criteria in Google
M. Henzinger              Web Information Retrieval           32
Output from Google:
               princess diana


                                                   J

                                                   J

                                                   J


                                                   J




M. Henzinger           Web Information Retrieval       33
Query-dependent ranking:
                 the neighborhood graph
l Subgraph associated to each query
                             Query Results
               Back Set       = Start Set               Forward Set

                 b1             Result1                     f1
                 b2                                         f2
                                Result2
                                                             ...
                 …                    ...
                 bm                                          fs
                                Resultn

    An edge for each hyperlink, but no edges within the same host


M. Henzinger                Web Information Retrieval                 34
HITS [K’98]

l Goal: Given a query find:



       – Good sources of content (authorities)



       – Good sources of links (hubs)




M. Henzinger               Web Information Retrieval   35
Intuition
l Authority comes from in-edges.
  Being a good hub comes from out-edges.




l Better authority comes from in-edges from good hubs.
  Being a better hub comes from out-edges to good
  authorities.




M. Henzinger               Web Information Retrieval     36
HITS details

    Repeat until HUB and AUTH converge:
               Normalize HUB and AUTH
               HUB[v] := Σ AUTH[ui] for all ui with Edge(v, ui)
               AUTH[v] := Σ HUB[wi] for all wi with Edge(wi, v)

                                       v
                   w1                                      u1
                   w2            A H                        u2
                   ...                                     ...
                   wk                                       uk

M. Henzinger                   Web Information Retrieval          37
Output from HITS:
                jobs

1. www.ajb.dni.uk             - British career website   J
2. www.britnet.co.uk/jobs.htm
3. www.monster.com            - US career website        J
4. www.careermosaic.com        - US career website       J
5. plasma-gate.weizmann.ac.il/Job…
6. www. jobtrak.com            - US career website       J
7. www.occ.com                 - US career website       J
8. www.jobserve.com            - US career website       J
9. www.allny.com/jobs.html - jobs in NYC
10.www.commarts.com/bin/… - US career website            J

 M. Henzinger            Web Information Retrieval           38
Output from HITS:
               +jaguar +car

1. www.toyota.com
2. www.chryslercars.com
3. www.vw.com
4. www.jaguravehicles.com    J
5. www.dodge.com
6. www.usa.mercedes-benz.com
7. www.buick.com
8. www.acura.com
9. www.bmw.com
10. www.honda.com

M. Henzinger           Web Information Retrieval   39
Problems & solutions

l Some edges are “wrong” -- not a
  recommendation:                                     Author A 1/3 Author B
   – multiple edges from same author
   – automatically generated                                  1/3

   – spam, etc.
  Solution: Weight edges to limit influence                         1/3


l Topic drift
   – Query: +jaguar +car
     Result: pages about cars in general
  Solution: Analyze content and assign
  topic scores to nodes

  M. Henzinger            Web Information Retrieval                       40
Modified HITS algorithms

    Repeat until HUB and AUTH converge:
           Normalize HUB and AUTH
           HUB[v] := Σ AUTH[ui] TopicScore[ui] weight[v,ui]
                        for all ui with Edge(v, ui)
           AUTH[v] := Σ HUB[wi] TopicScore[wi] weight[wi,v]
                        for all wi with Edge(wi, v)


       [CDRRGK’98, BH’98, CDGKRRT’98]


M. Henzinger                 Web Information Retrieval        41
Output from modified HITS:
                 +jaguar +car

1. www.jaguarcars.com/        - official website of Jaguar cars      J
2. www.collection.co.uk/      - official Jaguar accessories          J
3. home.sn.no/.../jaguar.html - the Jaguar Enthusiast Place          J
4. www.terrysjag.com/         - Jaguar Parts                         J
5. www.jaguarvehicles.com/ - official website of Jaguar cars         J
6. www.jagweb.com/              - for companies specializing in Jags. J
7. jagweb.com/jdht/jdht.html - articles about Jaguars and Daimler
8. www.jags.org/               - Oldest Jaguar Club              J
9. connection.se/jagsport/ - Sports car version of Jaguar MK II
10. users.aol.com/.../jane.htm -Jaguar Association of New England Ltd.



  M. Henzinger                Web Information Retrieval              42
User study [BH’98]

                Valuable pages within 10 top answers
                     (averaged over 28 topics)
     10
      9
      8                                                  Original
      7
      6
      5                                                  Edge
      4                                                  Weighting
      3
      2                                                  EW +
      1                                                  Content
      0                                                  Analysis
               Authorities                       Hubs


M. Henzinger                 Web Information Retrieval               43
PageRank vs. HITS

l Computation:                         l Computation:
   – Expensive                            – Expensive
   – Once for all documents               – Requires computation
     and queries (offline)                  for each query

l Query-independent –                  l Query-dependent
  requires combination with
  query-dependent criteria
l Hard to spam                         l Relatively easy to spam
                                       l Quality depends on quality
                                         of start set
                                       l Gives hubs as well as
                                         authorities
M. Henzinger            Web Information Retrieval                     44
Open problems

l Compare performance of query-dependent and query-
     independent connectivity analysis

l Exploit order of links on the page (see e.g.
     [CDGKRRT’98],[DH’99])

l Both Google and HITS compute principal eigenvector.
     What about non-principal eigenvector? ([K’98])

l Derive other graphs from the hyperlink structure …



M. Henzinger             Web Information Retrieval      45
Algorithmic issues related to
                 search engines
l Collecting          l Processing and                l Processing
  documents             representing the                queries
                        data
   – Priority                                              Query-
                           Query-                          dependent
   – Load                  independent
     balancing             ranking                         ranking
      • Internal         – Graph                         – Duplicate
                           representation                  elimination
      • External
                         – Index building                – Query
   – Trap                                                  refinement
     avoidance           – Duplicate
                           elimination                   – Clustering
   – ...                 – Categorization                – ...
                         –…


  M. Henzinger            Web Information Retrieval                  46
More on graph representation

l Graphs derived from the hyperlink structure of the Web:
   – Node =page
   – Edge (u,v) iff pages u and v are related in a specific
     way (directed or not)
l Examples of edges:
   – iff u has hyperlink to v
   – iff there exists a page w pointing to both u and v
   – iff u is often retrieved within x seconds after v
   –…



M. Henzinger            Web Information Retrieval         47
Graph representation usage

l Ranking algorithms                  l Visualization/Navigation
   – PageRank                            – Mapuccino
   – HITS                                  [MJSUZB’97]
   –…                                    – WebCutter [MS’97]
l Search-by-example                      –…
  [DH’99]                             l Structured Web query
l Categorization of Web                 tools
  pages                                  – WebSQL [AMM’97]
   – [CDI’98]                            – WebQuery [CK’97]
                                         –…



M. Henzinger           Web Information Retrieval                   48
Example: SRC Connectivity
               Server [BBHKV’98]

Directed edges = Hyperlinks
l Goal: Support two basic operations for all URLs
   collected by AltaVista
    – InEdges(URL u, int k)
       • Return k URLs pointing to u
    – OutEdges(URL u, int k)
       • Return k URLs that u points to
l Difficulties:
    – Memory usage (~180 M nodes, 1B edges)
    – Preprocessing time (days …)
    – Query time (~ 0.0001s/result URL)


M. Henzinger           Web Information Retrieval    49
URL database

Sorted list of URLs is 8.7 GB (≈ 48 bytes/URL)
  Delta encoding reduces it to 3.8 GB (≈ 21 bytes/URL)


               Original text                                    Delta Encoding
      www.foobar.com/                                       0 www.foobar.com/ 1
      www.foobar.com/gandalf.htm                            15 gandalf.htm 26
      www.foograb.com/                                      7 grab.com/ 41



                               15 gandalf.htm                             26

                   size of shared prefix                               Node ID
M. Henzinger                        Web Information Retrieval
Graph data structure

                                        Node Table
                     ptr to URL           ptr to inlist table ptr to outlist table




                                                   ...



                                                                   ...
URL Database            ...
                      Inlist Table                        Outlist Table




                                                                ...
                         ...




  M. Henzinger                Web Information Retrieval                              51
Web graph factoid

l >1B nodes (12/99), mean indegree is ~ 8
l Zipfian Degree Distributions [KRRT’99]:
   – Fin(i) = fraction of pages with indegree i

                                    1
                       Fin(i)~
                                  i 2 .1
       – Fout(i) = fraction of pages with outdegree i


                                       1
                      Fout (i)~
                                   i 2 .38

M. Henzinger                   Web Information Retrieval   52
Open problems

l Graph compression: How much compression possible
  without significant run-time penalty?
       – Efficient algorithms to find frequently repeated small
         structures (e.g. wheels, K2,2)
l External memory graph algorithms: How to assign the
  graph representation to pages so as to reduce paging?
  (see [NGV’96, AAMVV’98])
l Stringology: Less space for URL database? Faster
  algorithms for URL to node translation?
l Dynamic data structures: How to make updates efficient
  at the same space cost?

M. Henzinger                Web Information Retrieval             53
Algorithmic issues related to
                 search engines
l Collecting          l Processing and                l Processing
  documents             representing the                queries
                        data
   – Priority                                              Query-
                           Query-                          dependent
   – Load                  independent
     balancing             ranking                         ranking
      • Internal           Graph                         – Duplicate
                           representation                  elimination
      • External
                         – Index building                – Query
   – Trap                                                  refinement
     avoidance           – Duplicate
                           elimination                   – Clustering
   – ...                 – Categorization                – ...
                         –…


  M. Henzinger            Web Information Retrieval                  54
Index building
l Inverted index data structure: Consider all documents
  concatenated into one huge document
   – For each word keep an ordered array of all positions
     in document, potentially compressed

               Word 1   1st position                   …       last position




                                                                    ...
                ...


                           ...




l Allows efficient implementation of AND, OR, and AND
  NOT operations

M. Henzinger                       Web Information Retrieval                   55
Algorithmic issues related to
                 search engines
l Collecting          l Processing and                l Processing
  documents             representing the                queries
                        data
   – Priority                                              Query-
                           Query-                          dependent
   – Load                  independent
     balancing             ranking                         ranking
      • Internal           Graph                         – Duplicate
                           representation                  elimination
      • External
                           Index building                – Query
   – Trap                                                  refinement
     avoidance           – Duplicate
                           elimination                   – Clustering
   – ...                 – Categorization                – ...
                         –…


  M. Henzinger            Web Information Retrieval                  56
Reasons for duplicates

l Proliferation of almost equal documents on the Web:
       – Legitimate: Mirrors, local copies, updates, etc.
       – Malicious: Spammers, spider traps, dynamic URLs
       – Mistaken: Spider errors
l Approximately 30% of the pages on the Web are (near)
     duplicates. [BGMZ’97,SG’98]




M. Henzinger                Web Information Retrieval       57
Uses of duplicate information

        l Smarter crawlers
        l Smarter web proxies
               – Better caching
               – Handling broken links
        l Smarter search engines
               – no duplicate answers
               – smarter connectivity analysis
               – less RAM and disk



M. Henzinger                      Web Information Retrieval   58
2 Types of duplicate filtering


l Fine-grain: Finding near-duplicate documents


l Coarse-grain: Finding near-duplicate hosts (mirrors)




M. Henzinger            Web Information Retrieval        59
Fine-grain: Basic mechanism

l Must filter both duplicate and near-duplicate documents

l Computing pair-wise edit distance would take forever

l Preferably to store only a short sketch for each
     document.




M. Henzinger            Web Information Retrieval
The basics of a solution


[B’97],[BGMZ’97]

1. Reduce the problem to a set intersection problem

2. Estimate intersections by sampling minima.




M. Henzinger            Web Information Retrieval
Shingling

l Shingle = Fixed size sequence of w contiguous words
    a rose is a rose is a rose
    a rose is a
       rose is a rose
              is a rose is
                  a rose is a
                     rose is a rose

   D                                                                    Set of
                              Set of
               Shingling                               Fingerprint      64 bit
                             shingles
                                                                     fingerprints


M. Henzinger               Web Information Retrieval                         62
Defining resemblance

           D1                                                    D2



                        S1                               S2


                                 | S1 I S2 |
                   resemblance = | S I S                     |
                resemblance = U  | S1 1 S2 | 2
                                                  | S1 U S 2 |
M. Henzinger                 Web Information Retrieval                63
Sampling minima

     l Apply a random permutation σ to the set [0..264]
     l Crucial fact
           Let α = min(σ ( S1 ))           β = min(σ ( S 2 ))


                         S1                           S2

                                  | S1 I S 2 |
                     Pr(α = β ) =
                                  | S1 U S 2 |
M. Henzinger                  Web Information Retrieval         64
Implementation

   l Choose a set of t random permutations of U
   l For each document keep a sketch S(D) consisting of t
     minima = samples
   l Estimate resemblance of A and B by counting common
     samples
   l The permutations should be from a min-wise
     independent family of permutations. See [BCFM’97] for
     the theory of mwi permutations.




M. Henzinger            Web Information Retrieval
If we need only high resemblance

  Sketch 1:
  Sketch 2:


  l Divide sketch into k groups of s samples (t = k * s)
  l Fingerprint each group ⇒ feature
  l Two documents are fungible if they have at least
         r common features.
  l Want
    Fungibility ⇔ Resemblance above fixed threshold ρ


M. Henzinger              Web Information Retrieval
Real implementation

l     ρ = 90%. In a 1000 word page with shingle length = 8
     this corresponds to
               • Delete a paragraph of about 50-60 words.
               • Change 5-6 random words.
l Sketch size t = 84, divide into k = 6 groups of s = 14
  samples
l 8 bytes fingerprints → we store only 6 x 8 =
                         48 bytes/document
l Threshold r = 2


M. Henzinger                    Web Information Retrieval
Probability that two documents
               are deemed fungible

Two documents with resemblance ρ
l Using the full sketch
                      k ⋅s
                             k ⋅s i
                P = ∑        i ρ (1 − ρ )
                                            k ⋅s −i
                                  
                    i = r ⋅s     
l Using features

                          k  s⋅i
                P = ∑  ρ (1 − ρ )
                       k
                                   s k −i

                    i =r  i 



M. Henzinger                 Web Information Retrieval   68
Features vs. full sketch
           Probability that two pages are deemed fungible




  Prob

                     Using full sketch
                     Using features




                            Resemblance
M. Henzinger                 Web Information Retrieval
Fine-grain duplicate elimination:
               open problems and related work

l Best way of grouping samples for a given threshold
  and/or for multiple thresholds?
l Efficient ways to find in a data base pairs of records that
  share many attributes. Best approach?
l Min-wise independent permutations -- lots of open
  questions.
l Other applications possible (images, sounds, ...) -- need
  translation into set intersection problem.


l Related work: M’94, BDG’95, SG’95, H’96, FSGMU’98

M. Henzinger            Web Information Retrieval           70
2 Types of duplicate filtering


     Fine-grain: Finding near-duplicate documents


l Coarse-grain: Finding near-duplicate hosts (mirrors)




M. Henzinger             Web Information Retrieval       71
Input: set of URLs

       l Input:
               – Subset of URLs on various hosts, collected e.g. by
                 search engine crawl or web proxy
               – No content of pages pointed to by URLs
                 except each page is labeled with its out-links


       l Goal: Find pairs of hosts that mirror content




M. Henzinger                    Web Information Retrieval         72
Example
                   www.synthesis.org                                   synthesis.stanford.edu


                     a              b                                          a            b
                                                d                                                      d
                             c                                                       c

www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html             synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-classroom.html
www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html     synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced-circ-anal.html
www.synthesis.org/Docs/annual.report96.final.html                synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-mechatron.html
www.synthesis.org/Docs/cicee-berlin-paper.html                   synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-studies.html
www.synthesis.org/Docs/myr5                                      synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html
www.synthesis.org/Docs/myr5/cicee/bridge-gap.html                synthesis.stanford.edu/Docs/annual.report96.final.html
www.synthesis.org/Docs/myr5/cs/cs-meta.html                      synthesis.stanford.edu/Docs/annual.report96.final_fn.html
www.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html       synthesis.stanford.edu/Docs/myr5/assessment
www.synthesis.org/Docs/myr5/mech/mech-take-home.html             synthesis.stanford.edu/Docs/myr5/assessment/assessment-main.html
www.synthesis.org/Docs/myr5/synsys/experiential-learning.html    synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-A6-E25.html
www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html           synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html
www.synthesis.org/Docs/yr5ar                                     synthesis.stanford.edu/Docs/myr5/assessment/not-available.html
www.synthesis.org/Docs/yr5ar/assess                              synthesis.stanford.edu/Docs/myr5/cicee
www.synthesis.org/Docs/yr5ar/cicee                               synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html
www.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html               synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html
www.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html      synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html

          M. Henzinger                                   Web Information Retrieval                                        73
Coarse-grain: Basic mechanism

l Must filter both duplicate and near-duplicate mirrors

l Pair-wise testing would take forever

l Both high precision (not outputting wrong mirrors) and
     high recall (finding almost all mirrors) are important




M. Henzinger               Web Information Retrieval
A definition of mirroring

Host1 and Host2 are mirrors iff
       For all paths p such that
               http://Host1/p
       is a web page,
               http://Host2/p
       exists with duplicate (or near-duplicate) content,
       and vice versa.




M. Henzinger                 Web Information Retrieval      75
The basics of a solution


[BBDH’99]

1. Pre-filter to create a small set of pairs of potential
   mirrors (pre-filtering step)

2. Test each pair of potential mirrors (testing step)

3. Use different pre-filtering algorithms to improve recall



M. Henzinger              Web Information Retrieval
Testing step

l Test root pages + x URLs from each host sample
l If one test returns “not near-duplicate”
  then hosts are not mirrors
l If root pages and > c$ x URLs from each host sample are
  near-identical
  then hosts are mirrors,
  else they are not mirrors




M. Henzinger             Web Information Retrieval      77
Pre-filtering step

l Goal: Output quickly list of pairs of potential mirrors
  containing
       – many true mirror pairs (high recall)
       – not many non-mirror pairs (high precision)
l Note: 2-sided error is allowed
       – Type-1: true mirror pairs might be missing in output
       – Type-2: non-mirror pair might be output
l Testing of host pairs will eliminate type-2 errors, but not
  type-1 errors

M. Henzinger                Web Information Retrieval           78
Different pre-filtering techniques

l IP-based
l URL-string based
l URL-string and hyperlink based
l Hostname and hyperlink based




M. Henzinger            Web Information Retrieval   79
Problem with IP addresses

                 eliza-iii.ibex.co.nz




203.29.170.23



      pixel.ibex.co.nz




  M. Henzinger                    Web Information Retrieval   80
Number of host with same IP
                                       address vs mirror probability

                                 0.9
                                 0.8
      Probability of Mirroring




                                 0.7
                                 0.6
                                 0.5
                                 0.4
                                 0.3
                                 0.2
                                 0.1
                                  0
                                        2    3    4        5          6           7   8   9   >9

                                            Number of hosts with same IP address

M. Henzinger                                          Web Information Retrieval                    81
IP based pre-filtering algorithms

l IP4: Cluster hosts based on IP address
   – Enumerate pairs from clusters in increasing
     cluster size (max 200 pairs)

l IP3: Cluster hosts based on first 3 octets of
  their IP address
   – Enumerate pairs from clusters in increasing
     cluster size (max 5 pairs)




M. Henzinger            Web Information Retrieval   82
URL string based pre-filtering
               algorithms
Information extracted from URL strings:
l Similar hostnames: might belong to same organization
l Similar paths: might have replicated directories
Ł extract “features” for host from URL strings and test
  similarity
Similarity Testing Approach:
l Feature vector for each host similar to term vector for
  document:
      – Host corresponds to document
      – Feature corresponds to term
l Similarity of hosts = Cosine of angle of feature vectors
M. Henzinger             Web Information Retrieval           83
URL string based algorithms
               (cont.)

     – paths: Features are paths: e.g.,
        /staff/homepages/~dilbert/foo
     – prefixes: Features are prefixes: e.g.,
         /staff
         /staff/homepages
         /staff/homepages/~dilbert
         /staff/homepages/~dilbert/foo
     – Other variants: hosts and shingles




M. Henzinger               Web Information Retrieval   84
Paths + connectivity (conn)

l Take output from paths and filter thus:
       – Consider 10 common paths in sample with highest
         outdegree
       – Paths are equivalent if 90% of their combined out-
         edges are common to both
       – Keep host-pair if 75% of the paths are equivalent




M. Henzinger               Web Information Retrieval          85
Hostname connectivity

 l Idea: Mirrors point to similar set of other hosts
 Ł Feature vector approach to test similarity:
        – features are hosts that are pointed to
        – 2 different ways of feature weighting:
               • hconn1
               • hconn2




M. Henzinger                Web Information Retrieval   86
Experiments

       l Input: 140 million URLs on 233,035 hosts + out-
         edges
               – Original 179 million URLs reduced by considering
                 only hosts with at least 100 URLs in set
       l For each of the above pre-filtering algorithms:
               – Compute list of 25,000 (100,000) ranked pairs of
                 potential mirrors
               – Test each pair of potential mirrors (testing step)
                 and output list of mirrors
                 Determine precision and relative recall


M. Henzinger                    Web Information Retrieval             87
Precision up to rank 25,000




M. Henzinger            Web Information Retrieval   88
Relative recall up to rank 25,000
       Relative Recall




                                          Rank
M. Henzinger                      Web Information Retrieval   89
Relative recall at 25,000 for
                     combined output

               hosts IP3 IP4 conn hconn1 hconn2 paths               prefix shingles
hosts          17%
IP3            39% 30%
IP4            61% 58% 54%
conn           58% 66% 80% 47%
hconn1 40% 51% 69% 59%                 26%
hconn2 41% 52% 70% 60%                 29%              27%
paths          48% 59% 78% 55%         51%              52%   36%
prefix         54% 61% 75% 65%         58%              58%   57%   44%
shingles 53% 61% 75% 64%               57%              58%   57%   48%     44%

      M. Henzinger              Web Information Retrieval                    90
Combined approach (combined)

l Combines top 100,000 results from hosts, IP4, paths,
     prefix, and hconn1.
l Sort host pairs by:
       – Number of algorithms that return the host pair
       – Use best rank for any algorithm to break ties between
          host pairs
l At rank 100,000: relative recall of 86%, precision of 57%



M. Henzinger               Web Information Retrieval         91
Precision vs relative recall




M. Henzinger            Web Information Retrieval   92
Web host graph

l A node for each host h
l An undirected edge (h,h’) if h and h’ are output as
     mirrors


Ł Each (connected) component gives a set of mirrors




M. Henzinger            Web Information Retrieval       93
Example of a component
                     Protein Data
                        Bank




M. Henzinger           Web Information Retrieval   94
Component size distribution

                       100000
                                                                    43,491 mirrored
                                15650
                                                                         hosts
Number of Components




                        10000
                                     2204                                of 233,035
                        1000                553
                                                                         considered
                                                  185
                                                        95
                         100                                 49
                                                                   26
                                                                          18
                                                                                    13
                                                                                         10
                          10                                                                  6   6
                                                                                                      4
                                                                                                          3


                           1
                                 2      3   4      5    6     7     8      9    10 11 12 13 14 15

                                                         Component Size
M. Henzinger                                            Web Information Retrieval                             95
Coarse-grain duplicate filtering:
               Summary and open problems
l Mirroring is common (43,491 mirrored hosts out of
  233,035 considered hosts)
       – Load balancing, franchises/branding, virtual hosting, spam
l Mirror detection based on non-content attributes is
  feasible.
l [CSG’00] use page content similarity based approach.
  Open Problem: Compare and combine content and non-
  content techniques.
l Open Problem: Assume you can choose which URLs to
  visit at a host. Determine best technique.

M. Henzinger                 Web Information Retrieval            96
Algorithmic issues related to
                 search engines
l Collecting          l Processing and                l Processing
  documents             representing the                queries
                        data
   – Priority                                              Query-
                           Query-                          dependent
   – Load                  independent
     balancing             ranking                         ranking
      • Internal           Graph                           Duplicate
                           representation                  elimination
      • External
                           Index building                – Query
   – Trap                                                  refinement
     avoidance             Duplicate
                           elimination                   – Clustering
   – ...                 – Categorization                – ...
                         –…


  M. Henzinger            Web Information Retrieval                  97
Adding pages to the index
Crawling process

        Get link at
       top of queue                                     Expired pages
                                                         from index
                              Queue of
        Fetch page            links to
                              explore

        Index page
            and
         parse links

                                                    Add URL
       Add to queue

 M. Henzinger           Web Information Retrieval                 98
Queuing discipline

l Standard graph exploration:
   – Random
   – BFS
   – DFS (+ depth limits)
l Goal: get “best” pages for a given index size
   – Priority based on query-independent ranking:
      • highest indegree [M’95]
      • highest potential PageRank [CGP’98]
l Goal: keep index fresh
   – Priority based on rate of change [CLW’97]


M. Henzinger           Web Information Retrieval    99
Load balancing

l Internal -- can not handle too much retrieved data
  simultaneously, but
   – Response time is unpredictable
   – Size of answers is unpredictable
   – There are additional system constraints (# threads, #
     open connections, etc.)
l External
   – Should not overload any server or connection
   – A well-connected crawler can saturate the entire
     outside bandwidth of some small countries
   – Any queuing discipline must be acceptable to the
     community


M. Henzinger           Web Information Retrieval         100
Web IR Tools

     General-purpose search engines
l    Hierarchical directories
l    Specialized search engines
     (dealing with heterogeneous data sources)
l    Search-by-example
l    Collaborative filtering
l    Meta-information




M. Henzinger             Web Information Retrieval   101
Hierarchical directories

Building of hierarchical directories:
l Manual: Yahoo!, LookSmart, Open Directory
l Automatic:
       – Populating of hierarchy [CDRRGK’98]: For each node
         in the hierarchy formulate fine-tuned query and run
         modified HITS algorithm
       – Categorization: For each document find “best”
         placement in the hierarchy. Techniques are connectivity
         and/or text based [CDI’98, …]


M. Henzinger               Web Information Retrieval           102
Web IR Tools

     General-purpose search engines
     Hierarchical directories
l    Specialized search engines
     (dealing with heterogeneous data sources)
      – Shopping robots
      – Home page finder [SLE’97]
      – Applet finders
      –…
l    Search-by-example
l    Collaborative filtering
l    Meta-information

M. Henzinger             Web Information Retrieval   103
Dealing with heterogeneous
                 sources
l Modern life problem:

               Given information sources with various capabilities,
                    query all of them and combine the output.

l Examples
   – Inter-business e-commerce e.g. www.industry.net
   – Meta search engines
   – Shopping robots
l Issues
   – Determining relevant sources -- the “identification”
     problem
   – Merging the results -- the “fusion” problem

M. Henzinger                     Web Information Retrieval            104
Example: a shopping robot
l Input: A product description in some form
  Find: Merchants for that product on the Web
l Jango [DEW’97]
       – preprocessing: Store vendor URLs in database;
         learn for each vendor:
               • the URL of the search form
               • how to fill in the search form and
               • how the answer is returned
       – request processing: fill out form at every vendor and
         test whether the result is a success
       – range of products is predetermined


M. Henzinger                      Web Information Retrieval      105
Jango input example




M. Henzinger           Web Information Retrieval   106
Jango output




M. Henzinger           Web Information Retrieval   107
Direct query to K&L Wine




M. Henzinger           Web Information Retrieval   108
What price decadence?




M. Henzinger           Web Information Retrieval   109
Open problems

l Going beyond the lowest common capability
l Learning problem: automatic understanding of new
     interfaces
l Efficient ways to determine which DB is relevant to a
     particular query
l “Partial knowledge indexing”: indexer has limited access to
     the full DB




M. Henzinger            Web Information Retrieval          110
Web IR Tools

     General-purpose search engines
     Hierarchical directories
     Specialized search engines
     (dealing with heterogeneous data sources)
l Search-by-example
l Collaborative filtering
l Meta-information




M. Henzinger                Web Information Retrieval   111
Search-by-example

l Given: set S of URLs
l Find: URLs of similar pages
l Various Approaches:
       – Connectivity-based
       – Usage based: related pages are pages visited
         frequently after S
       – Query refinement
l Related Work:G’72, G’79, K’98, PP’97,S’97, CDI’98


M. Henzinger                Web Information Retrieval   112
Output from Google:
               related:www.ebay.com




M. Henzinger           Web Information Retrieval   113
Output from Alexa:
               www.ebay.com




M. Henzinger           Web Information Retrieval   114
Connectivity based solutions

[DH’99]
l Algorithm Companion

l Algorithm Co-citation




M. Henzinger              Web Information Retrieval   115
Algorithm Companion

  l Build modified neighborhood graph N.

  l Run modified HITS algorithm on N.

  Major Question: How to form neighborhood graph
        s.t. top returned pages are useful related pages




M. Henzinger               Web Information Retrieval       116
Building neighborhood graph N
l Node set: From URL u go ‘back’, ‘forward’, ‘back-
  forward’, and ‘forward-back’
l Edge set: Directed edge if there is a hyperlink
  between 2 nodes
l Apply refinements to N




                                    u




M. Henzinger           Web Information Retrieval      117
Refinement 1: Limit out-degree

           l Motivation: Some nodes have high out-degree =>
             graph would become too large
           l Limit out-degree when going “forward”
               – Going forward from u : choose first 50 out-links
                 on page
               – Going forward from other nodes: choose 8 out-
                 links surrounding the in-link traversed to reach
                 u
      u

M. Henzinger                 Web Information Retrieval        118
Co-citation algorithm

l Determine 2000 arbitrary back-nodes of u.
l Add to set S of siblings of u:
    For each back node 8 forward-nodes
    surrounding the link to u
                                                          u
l If there is enough co-citation with u then
      – return nodes in S in decreasing order of co-
        citations
    else
      – restart algorithm with one path element removed
        (http://…/X/Y/ -> http://…/X/)

M. Henzinger             Web Information Retrieval        119
Alexa’s “What’s Related”

l Uses:
       – Document Content
       – Usage Data
       – Connectivity


l Removes path elements if no answer for u is found




M. Henzinger            Web Information Retrieval     120
User study
                                                           V a lua b le p a g e s within 1 0 to p a nswe rs
                                              a ve ra g e d o ve r 5 9 (=A L L ) o r 3 7 (=INTE RS E C T.) q ue rie s
V a lua b le p a g e s within to p 1 0




                                         10
                                          9
                                          8
                                          7                                                                     A le xa
                                          6
                                          5
                                          4                                                                     C o cita tio n
                                          3
                                          2
                                          1                                                                     C o m p a nio n
                                          0
                                                    A L L Q UE RIE S               INTE RS E C TIO N
                                                                                      Q UE RIE S

                       M. Henzinger                                        Web Information Retrieval                             121
Web IR Tools

  General-purpose search engines:
  Hierarchical directories
  Specialized search engines:
  Search-by-example
l Collaborative filtering
l Meta-information




M. Henzinger           Web Information Retrieval   122
Collaborative filtering

                  User       Ex                          Collected
                  input        pli                         input
                                  cit
                                      pre
                                         fer
                                            en
                                              ce
                                                s
                prediction                                analysis
                  phase                                    phase




               Suggestions                                Model


M. Henzinger                 Web Information Retrieval               123
Lots of projects

l Collaborative filtering seldom used in classic IR, big
  revival on the Web. Projects:
       – PHOAKS -- ATT labs → Web pages recommendation
          based on Usenet postings
       – GAB -- Bellcore → Web browsing
       – Grouplens -- U. Minnesota → Usenet newsgroups
       – EachToEach -- Compaq SRC → rating movies
       –…
See http://sims.berkeley.edu/resources/collab/



M. Henzinger                Web Information Retrieval      124
Why do we care?

l The ranking schemes that we discussed are also a form
  of collaborative ranking!
       – Connectivity = people vote with their links
       – Usage = people vote with their clicks
l These schemes are used only for a global model building.
  Can it be combined with per-user data?
  Ideas:
       – Consider the graph induced by the user’s bookmarks.
       – Profile the user -- see www.globalbrain.net
       – Deal with privacy concerns!


M. Henzinger                Web Information Retrieval     125
Web IR Tools

  General-purpose search engines:
  Hierarchical directories
  Specialized search engines:
  Search-by-example
  Collaborative filtering
l Meta-information
   – Comparison of
     search engines
   – Log statistics



M. Henzinger           Web Information Retrieval   126
Comparison of search engines


                           – Ideal measure: User satisfaction


          Difficulty of  – Number of user requests
           independent
           measurement; – Quality of search engine index
          Usefulness for
           Comparison
                         – Size of search engine index


M. Henzinger             Web Information Retrieval       127
Comparing Search Engine Sizes

l Naïve Approaches
   – Get a list of URLs from each search engine and
     compare
      • Not practical or reliable.
   – Result Set Size Comparison
      • Reported sizes are approximate.
   –…
l Better Approach
   – Statistical Sampling



M. Henzinger           Web Information Retrieval      128
URL sampling
l Ideal strategy: Generate a random URL and check for
  containment in each index.
l Random URLs are hard to generate:
       – Random walks methods
               • Graph is directed
               • Stationary distribution is non-uniform
               • Must prove rapid mixing.
       – Pages in cache, query logs [LG’98a], etc.
               • Correlated to the interests of a particular group of
                 users.
l A simple way: collect all pages on the Web and pick one
  at random.
M. Henzinger                      Web Information Retrieval             129
Sampling via queries [BB’98]

l Search engines have the best crawlers -- why not
     exploit them?
l Method:
       – Sample from each engine in turn
       – Estimate the relative sizes of two search engines
       – Compute absolute sizes from a reference point




M. Henzinger               Web Information Retrieval         130
Estimate relative sizes

                                 Select pages randomly from
                                 A (resp. B)
                                 Check if page contained in
                                 B (resp. A)
                                |A ∩ B| ≈ (1/2) * |A|
                                |A ∩ B| ≈ (1/6) * |B|
                                ∴            |B| ≈ 3 * |A|




   Two steps: (i) Selecting (ii) Checking
M. Henzinger            Web Information Retrieval             131
Selecting a random page
l Generate random query
       – Build a lexicon of words that occur on the Web
       – Combine random words from lexicon to form queries
l Get the first 100 query results from engine A
l Select a random page out of the set
l Distribution is biased -- the conjecture is that

                     ∑ p( D) | A ∩ B |
                    D∈ A∩ B
                            ~
                     ∑ p( D) | A |
                    D∈ A

    where p(D) is the probability that D is picked by this scheme

M. Henzinger                  Web Information Retrieval       132
Checking if an engine has a page
l Create a “unique query” for the page:
   – Use 8 rare words.
   – E.g., for the Digital Systems Research Center Home Page:




  M. Henzinger            Web Information Retrieval        133
Results of the BB’98 study


                                                                                                                35 0
    S ta tic W e b
                                                                                    24 0
           U nio n
                                                     12 5
     A lta V is ta
                                              10 0
                                                                            Status as of July ‘98
         H o tB o t
                                                                          Web size: 350 million pages
                                 45
           E xc ite                                                       Growth: 25M pages/month
                                37
       Info s e e k                                                       Six largest engines cover: 2/3
                                       75                                 Small overlap: 3M pages
N o rthe rn Light
                           21
           Ly c o s

                      0    25   50    75    100   125 150         175     200   225 250    275   300   325 350    375
                                                   Size in M illions of P a ges
                                            Jun-9 7    N ov-97     M a r -98    Jul-98

      M. Henzinger                            Web Information Retrieval                                   134
Crawling strategies are different!

                                Exclu siv e listin g s in m illio n s o f p ag es

                 L yco s           5
                                                                                                  Jul-98
  N o rth ern Lig ht                                 20

         A ltaV ista                                                                    50

               H o tB o t                                                         45

                E xcite                     13

           Info seek                   8


                            0          10         20             30          40        50    60



M. Henzinger                                     Web Information Retrieval                          135
Comparison of search engines


                           – Ideal measure: User satisfaction


          Difficulty of  – Number of user requests
           independent
           measurement; – Quality of search engine index
          Usefulness for
           Comparison
                           Size of search engine index


M. Henzinger             Web Information Retrieval       136
Quality: A general definition
               [HHMN’99]

l Assign each page p a weight w(p) such that

                         ∑ w(p) = 1
                         all p
Ł Can be thought of as probability distribution on pages

l Quality of a search engine index S is w(S) =      ∑ w( p)
                                                    p∈S

l Example:
   – If w is same for all pages, weight is proportional to
     total size (in pages).
l Average page quality in index S is w(S)/|S|.
l We use: weight w(p) of a page p is its PageRank
M. Henzinger            Web Information Retrieval             137
Estimating quality by sampling

Suppose we can choose random pages according to w (so
  that page p appears with probability w(p))
l Choose a sample of pages p1,p2,p3… pn
l Check if the pages are in search engine index S
l Estimate for quality of index S is the percentage of
  sampled pages that are in S, i.e.
                            1
                  w (S ) = ∑ I [ p j ∈ S ]
                            n j
     where I[pj in S] = 1 if pj is in S and 0 otherwise


M. Henzinger                Web Information Retrieval     138
Missing pieces


l Sample pages according to the PageRank distribution.


l Test whether page p is in search engine index S.
       → same methodology as [BB’98]




M. Henzinger            Web Information Retrieval        139
Sampling pages (almost)
               according to PageRank

l Perform a random walk and select n random pages from it.

l Problems:

       – Starting state bias: finite walk only approximates
          PageRank.

       – Can’t jump to a random page; instead, jump to a random
          page on a random host seen so far.

Ł Sampling pages according to a distribution that behaves
     similarly to PageRank, but it not identical to PageRank


M. Henzinger                 Web Information Retrieval         140
Experiments

l Performed two long random walks with d=1/7 starting at
  www.yahoo.com

                                  Walk 1              Walk2
length                            18 hours            54 hours
attempted downloads               2,867,466           6,219,704

HTML pages                        1,393,265           2,940,794
successfully downloaded
unique HTML pages                 509,279             1,002,745
sampled pages                     1,025               1,100

M. Henzinger              Web Information Retrieval              141
Random walk effectiveness

l Pages (or hosts) that are “highly-reachable” are visited
  often by the random walks
l Initial bias for www.yahoo.com is reduced in longer walk
l Results are consistent over the 2 walks
l The average indegree of pages with indegree <= 1000
  is high:
       – 53 in walk 1
       – 60 in walk 2


M. Henzinger            Web Information Retrieval            142
Most frequently visited pages

                           P a ge                                   F re q.   F re q.   R a nk
                                                                    W a lk2   W a lk1   W a lk1

w w w .mic ro s o ft.c o m/                                           3172      1600           1
w w w .mic ro s o ft.c o m/w indo w s /ie /de fa ult.htm              2064      1045           3
w w w .ne ts c a pe .c o m/                                           1991       876           6
w w w .mic ro s o ft.c o m/ie /                                       1982      1017           4
w w w .mic ro s o ft.c o m/w indo w s /ie /do w nlo a d/              1915       943           5
w w w .mic ro s o ft.c o m/w indo w s /ie /do w nlo a d/a ll.htm      1696       830           7
w w w .a do be .c o m/pro dinde x/a c ro ba t/re a ds te p.htm        1634       780           8
ho me .ne ts c a pe .c o m/                                           1581       695          10
w w w .linke xc ha nge .c o m/                                        1574       763           9
w w w .y a ho o .c o m/                                               1527      1132           2

    M. Henzinger                        Web Information Retrieval                       143
Most frequently visited hosts
                  Site       Frequency                   Frequency    Rank
                               Walk 2                      Walk 1     Walk 1

www.microsoft.com                          32452              16917              1
home.netscape.com                          23329              11084              2
www.adobe.com                              10884               5539              3
www.amazon.com                             10146               5182              4
www.netscape.com                            4862               2307             10
excite.netscape.com                         4714               2372              9
www.real.com                                4494               2777              5
www.lycos.com                               4448               2645              6
www.zdnet.com                               4038               2562              8
www.linkexchange.com                        3738               1940             12
www.yahoo.com                               3461               2595              7

   M. Henzinger              Web Information Retrieval                    144
Results for index quality

                      0.35

                        0.3                                                                    Walk 1 (exact
                                                                                               match)
                      0.25
                                                                                               Walk 2 (exact
Index quality




                        0.2                                                                    match)
                      0.15                                                                     Walk 1 (host
                        0.1
                                                                                               match)
                                                                                               Walk 2 (host
                      0.05
                                                                                               match)
                          0
                                                                  )
                               5)


                                         0)


                                                     )




                                                                                           )
                                                                              )
                                                                  7
                                                      5




                                                                                          1
                                                                             5
                                                               (3
                              2


                                        0


                                                   (4




                                                                                       (2
                                                                          (2
                           (1


                                     (1




                                                              k


                                                                          )
                                                                       ta


                                                                                   s
                                              te


                                                          ee
                                    ot
                          ta




                                                                                 co
                                                                      e
                                            ci
                       is


                                 B




                                                          s


                                                                   (b
                                         Ex
                               ot




                                                       fo
                      V




                                                                              Ly
                   ta




                                                                 e
                                                    In
                               H




                                                               gl
                 Al




                                                           oo
                                                          G




                M. Henzinger                                   Web Information Retrieval                  145
Results for index quality/page

                                  0.8
Index Quality per Page




                                  0.7                                        Walk 1 (exact
                                  0.6                                        match)
                                  0.5                                        Walk 2 (exact
                                  0.4                                        match)
                                  0.3                                        Walk 1 (host
                                  0.2                                        match)
                                  0.1                                        Walk 2 (host
                                   0
                                                                             match)
                                                          )




                                                         1)
                                                         7)

                                                         5)
                                                          )




                                                        5)
                                                       00
                                                       25




                                                      (2
                                                      (3

                                                     (2
                                                     (4
                                                    (1
                                                    (1




                                                  a)
                                                  ek




                                                   s
                                                 ot

                                                 te
                                ta




                                               co
                                                et
                                               se
                                               ci
                             is

                                              tb




                                             (b

                                            Ly
                                           Ex


                                            fo
                              v

                                          Ho
                           ta




                                          e
                                         In

                                        gl
                         Al




                                     oo
                                   G




      M. Henzinger                               Web Information Retrieval            146
Insights from the data

l Our approach appears consistent over repeated tests

Ł Random walks are a useful tool

l Quality is different from size for search engine indices

Ł Some search engines are apparently trying to index

     high quality pages




M. Henzinger              Web Information Retrieval          147
Open problems

l Random page generation via random walks
l Cryptography based approach: want random pages
  from each engine but no cheating! (page should be
  chosen u.a.r. from the actual index)
   – Each search engine can commit to the set of pages it
     has without revealing it
   – Need to ensure that this set is the same as the set
     actually indexed
   – Need efficient oblivious protocol to obtain random
     page from search engine
   – See [NP‘98] for possible solution


M. Henzinger           Web Information Retrieval       148
Web IR Tools

  General-purpose search engines:
  Hierarchical directories
  Specialized search engines:
  Search-by-example
  Collaborative filtering
l Meta-information
     Comparison of search
     engines
   – Log statistics



M. Henzinger           Web Information Retrieval   149
How often do people view a
               page?
l Problems:
   – Web caches interfere with click counting
   – cheating pays (advertisers pay by the click)
l Solutions:
   – naïve: forces caches to re-fetch for every click.
      • Lots of traffic, annoyed Web users
   – extend HTML with counters [ML’97]
      • requires compliance, down caches falsify results.
   – use sampling [P’97]
      • force refetches on random days
      • force refetches for random users and IP addresses
   – cryptographic audit bureaus [NP’98a]
l Commercial providers: 100hot, Media Matrix, Relevant
  Knowledge, …
M. Henzinger              Web Information Retrieval         150
Query log statistics [SHMM’98]
request = new query or new result screen of old query
session = a series of requests by one user close together in time
l analyzed ~1B AltaVista requests consisting of:
   – ~840 M non-empty requests
   – ~575 M non-empty queries
           ➪1.5 requests per query in the average
   – ~153 M unique non-empty queries
           ➪ query is repeated 3.8 times in the average, but
            64% of queries occur only once
   – ~285 M user sessions
           ➪2.9 requests and 2.0 queries per session in the
            average

M. Henzinger            Web Information Retrieval           151
Lots of things we didn’t even
               touch...

l Clustering = group similar items (documents or queries)
  together ↔ unsupervised learning
l Categorization = assign items to predefined categories ↔
  supervised learning
l Classic IR issues that are not substantially different in the
  Web context:
   – Latent semantic indexing -- associate “concepts” to
     queries and documents and match on concepts
   – Summarization: abstract the most important parts of text
     content. (See [TS’98] for the Web context)
l Many others …




M. Henzinger              Web Information Retrieval               152
Final conclusions
l We talked mostly about IR methods and tools that
   – take advantage of the Web particularities
   – mitigate some of the difficulties
l Web IR offers plenty of interesting problems…
       … but not on a silver platter
l Almost every area of algorithms research is relevant

l Great need for good algorithm engineering!




M. Henzinger             Web Information Retrieval       153
Acknowledgements

l An earlier version of this talk was created in collaboration
  with Andrei Broder and was presented at the 39th IEEE
  Symposium on Foundations in Computer Science 1998.
l Thanks for useful comments to Krishna Bharat, Moses
  Charikar, Edith Cohen, Jeff Dean,Uri Feige, Puneet Kumar,
  Hannes Marais, Michael Mitzenmacher, Prabhakar
  Raghavan, and Lyle Ramshaw.




M. Henzinger              Web Information Retrieval              154

Weitere ähnliche Inhalte

Ähnlich wie wEb infomation retrieval

랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표
랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표
랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표Jin Young Kim
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석datasciencekorea
 
Birds Bears and Bs:Optimal SEO for Today's Search Engines
Birds Bears and Bs:Optimal SEO for Today's Search EnginesBirds Bears and Bs:Optimal SEO for Today's Search Engines
Birds Bears and Bs:Optimal SEO for Today's Search EnginesMarianne Sweeny
 
Optimal SEO (Marianne Sweeny)
Optimal SEO (Marianne Sweeny) Optimal SEO (Marianne Sweeny)
Optimal SEO (Marianne Sweeny) uxpa-dc
 
Defensa.V11
Defensa.V11Defensa.V11
Defensa.V11promanas
 
Information Quality Assessment in the WIQ-EI EU Project
Information Quality Assessment in the WIQ-EI EU ProjectInformation Quality Assessment in the WIQ-EI EU Project
Information Quality Assessment in the WIQ-EI EU Projectwiqei
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Information Quality Assessment in the WIQ-EI EU Project
Information Quality Assessment in the WIQ-EI EU ProjectInformation Quality Assessment in the WIQ-EI EU Project
Information Quality Assessment in the WIQ-EI EU ProjectElisabeth Lex
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web searchEmrullah Delibas
 
Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...Frank van Harmelen
 
Session 0.0 poster minutes madness
Session 0.0   poster minutes madnessSession 0.0   poster minutes madness
Session 0.0 poster minutes madnesssemanticsconference
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学Xu jiakon
 
Semantics And Multimedia
Semantics And MultimediaSemantics And Multimedia
Semantics And MultimediaPeter Berger
 

Ähnlich wie wEb infomation retrieval (20)

How do we search? Themes and challenges
How do we search? Themes and challengesHow do we search? Themes and challenges
How do we search? Themes and challenges
 
랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표
랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표
랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Birds Bears and Bs:Optimal SEO for Today's Search Engines
Birds Bears and Bs:Optimal SEO for Today's Search EnginesBirds Bears and Bs:Optimal SEO for Today's Search Engines
Birds Bears and Bs:Optimal SEO for Today's Search Engines
 
Optimal SEO (Marianne Sweeny)
Optimal SEO (Marianne Sweeny) Optimal SEO (Marianne Sweeny)
Optimal SEO (Marianne Sweeny)
 
Defensa.V11
Defensa.V11Defensa.V11
Defensa.V11
 
Information Quality Assessment in the WIQ-EI EU Project
Information Quality Assessment in the WIQ-EI EU ProjectInformation Quality Assessment in the WIQ-EI EU Project
Information Quality Assessment in the WIQ-EI EU Project
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Information Quality Assessment in the WIQ-EI EU Project
Information Quality Assessment in the WIQ-EI EU ProjectInformation Quality Assessment in the WIQ-EI EU Project
Information Quality Assessment in the WIQ-EI EU Project
 
Computing for Human Experience [v3, Aug-Oct 2010]
Computing for Human Experience [v3, Aug-Oct 2010]Computing for Human Experience [v3, Aug-Oct 2010]
Computing for Human Experience [v3, Aug-Oct 2010]
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web search
 
Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...
 
Session 0.0 poster minutes madness
Session 0.0   poster minutes madnessSession 0.0   poster minutes madness
Session 0.0 poster minutes madness
 
Mazhiming
MazhimingMazhiming
Mazhiming
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学
 
Web mining
Web miningWeb mining
Web mining
 
WISE2019 presentation
WISE2019 presentationWISE2019 presentation
WISE2019 presentation
 
Semantics And Multimedia
Semantics And MultimediaSemantics And Multimedia
Semantics And Multimedia
 

Mehr von George Ang

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textGeorge Ang
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿George Ang
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势George Ang
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程George Ang
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qqGeorge Ang
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 

Mehr von George Ang (20)

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 

Kürzlich hochgeladen

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Kürzlich hochgeladen (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

wEb infomation retrieval

  • 1. Tutorial: Web Information Retrieval Monika Henzinger http://www.henzinger.com/~monika
  • 2. What is this talk about? l Topic: – Algorithms for retrieving information on the Web l Non-Topics: – Algorithmic issues in classic information retrieval (IR), e.g. stemming – String algorithms, e.g. approximate matching – Other algorithmic issues related to the Web: • networking & routing • cacheing • security • e-commerce M. Henzinger Web Information Retrieval 2
  • 3. Preliminaries l Each Web page has a unique URL Page name http://theory.stanford.edu/~focs98/tutorials.html Access Host name = protocol Domain name Path l (Hyper) link = pointer from one page to another, loads second page if clicked on l In this talk: document = (Web) page M. Henzinger Web Information Retrieval 3
  • 4. Example of a query princess diana Engine 1 Engine 2 Engine 3 Relevant and Relevant but Not relevant high quality low quality index pollution M. Henzinger Web Information Retrieval 4
  • 5. Outline l Classic IR vs. Web IR l Some IR tools specific to the Web Details on - Ranking For each type - Duplicate elimination - Search-by-example – Examples - Measuring search engine index quality – Algorithmic issues l Conclusions Open problems M. Henzinger Web Information Retrieval 5
  • 6. Classic IR l Input: Document collection l Goal: Retrieve documents or text with information content that is relevant to user’s information need l Two aspects: 1. Processing the collection 2. Processing queries (searching) l Reference Texts: SW’97, BR’99 M. Henzinger Web Information Retrieval 6
  • 7. Determining query results “model” = strategy for determining which documents to return l Logical model: String matches plus AND, OR, NOT l Vector model (Salton et al.): – Documents and query represented as vector of terms – Vector entry i = weight of term i = function of frequencies within document and within collection – Similarity of document & query = cosine of angle of their vectors – Query result: documents ordered by similarity l Other models used in IR but not discussed here: – Probabilistic model, cognitive model, … M. Henzinger Web Information Retrieval 7
  • 8. IR on the Web l Input:The publicly accessible Web l Goal: Retrieve high quality pages that are relevant to user’s need – Static (files: text, audio, … ) – Dynamically generated on request: mostly data base access l Two aspects: 1. Processing and representing the collection • Gathering the static pages • “Learning” about the dynamic pages 2. Processing queries (searching) M. Henzinger Web Information Retrieval 8
  • 9. What’s different about the Web? (1) Pages: l Bulk …………………… >1B (12/99) [GL’99] l Lack of stability……….. Estimates: 23%/day, 38%/week [CG’99] l Heterogeneity – Type of documents .. Text, pictures, audio, scripts,… – Quality ……………… From dreck to ICDE papers … – Language ………….. 100+ l Duplication – Syntactic……………. 30% (near) duplicates – Semantic……………. ?? l Non-running text……… many home pages, bookmarks, ... l High linkage…………… ≥ 8 links/page in the average M. Henzinger Web Information Retrieval 9
  • 10. Typical home page: non-running text M. Henzinger Web Information Retrieval 10
  • 11. Typical home page: Non-running text M. Henzinger Web Information Retrieval 11
  • 12. The big challenge Meet the user needs given the heterogeneity of Web pages M. Henzinger Web Information Retrieval 12
  • 13. What’s different about the Web? (2) Users: l Make poor queries l Specific behavior – Short (2.35 terms avg) – 85% look over one result – Imprecise terms screen only – Sub-optimal syntax – 78% of queries are not (80% queries without modified operator) – Follow links – Low effort – See various user studies l Wide variance in in CHI, Hypertext, SIGIR, etc. – Needs – Knowledge – Bandwidth M. Henzinger Web Information Retrieval 13
  • 14. The bigger challenge Meet the user needs given the heterogeneity of Web pages and the poorly made queries. M. Henzinger Web Information Retrieval 14
  • 15. Why don’t the users get what they want? Example I need to get rid of mice in User need the basement Translation User request What’s the best way to problems (verbalized) trap mice alive? Query to mouse trap IR system Polysemy Synonymy Results Software, toy cars, inventive products, etc M. Henzinger Web Information Retrieval 15
  • 16. Google output: mouse trap J M. Henzinger Web Information Retrieval 16
  • 17. Google output: trap mice J J J J M. Henzinger Web Information Retrieval 17
  • 18. The bright side: Web advantages vs. classic IR User Collection/tools l Many tools available l Redundancy l Personalization l Hyperlinks l Interactivity (refine the l Statistics query if needed) – Easy to gather – Large sample sizes l Interactivity (make the users explain what they want) M. Henzinger Web Information Retrieval 18
  • 19. Quantifying the quality of results l How to evaluate different strategies? l How to compare different search engines? M. Henzinger Web Information Retrieval 19
  • 20. Classic evaluation of IR systems We start from a human made relevance judgement for each (query, page) pair and compute: l Precision: % of returned pages that are relevant. Relevant Returned pages pages l Recall: % of relevant pages that are returned. l Precision at (rank) 10: % of top 10 pages that are relevant l Relative Recall: % of relevant pages found by some means that are returned All pages M. Henzinger Web Information Retrieval 20
  • 21. Evaluation in the Web context l Quality of pages varies widely l We need both relevance and high quality = value of page. Ł Precision at 10: % of top 10 pages that are valuable … M. Henzinger Web Information Retrieval 21
  • 22. Web IR tools l General-purpose search engines: – direct: AltaVista, Excite, Google, Infoseek, Lycos, …. – Indirect (Meta-search): MetaCrawler, DogPile, AskJeeves, InvisibleWeb, … l Hierarchical directories: Yahoo!, all portals. l Specialized search engines: – Home page finder: Ahoy Database mostly built – Shopping robots: Jango, Junglee,… by hand – Applet finders M. Henzinger Web Information Retrieval 22
  • 23. Web IR tools (cont...) l Search-by-example: Alexa’s “What’s related”, Excite’s “More like this”, Google’s “Googlescout”, etc. l Collaborative filtering: Firefly, GAB, … l … l Meta-information: – Search Engine Comparisons – Query log statistics –… M. Henzinger Web Information Retrieval 23
  • 24. General purpose search engines l Search engines’ components: – Spider = Crawler -- collects the documents – Indexer -- process and represents the data – Search interface -- answers queries M. Henzinger Web Information Retrieval 24
  • 25. Algorithmic issues related to search engines l Collecting l Processing and l Processing documents representing the queries data – Priority – Query- – Query- dependent – Load independent balancing ranking ranking • Internal – Graph – Duplicate representation elimination • External – Index building – Query – Trap refinement avoidance – Duplicate elimination – Clustering – ... – Categorization – ... –… M. Henzinger Web Information Retrieval 25
  • 26. Ranking l Goal: order the answers to a query in decreasing order of value – Query-independent: assign an intrinsic value to a document, regardless of the actual query – Query-dependent: value is determined only wrt a particular query. – Mixed: combination of both valuations. l Examples – Query-independent: length, vocabulary, publication data, number of citations (indegree), etc. – Query-dependent: cosine measure M. Henzinger Web Information Retrieval 26
  • 27. Some ranking criteria l Content-based techniques (variant of term vector model or probabilistic model) – mostly query-dependent l Ad-hoc factors (anti-porn heuristics, publication/location data, ...) – mostly query-independent l Human annotations l Connectivity-based techniques – Query-independent: PageRank [PBMW’98, BP’98], indegree [CK’97], … – Query-dependent: HITS [K’98], … M. Henzinger Web Information Retrieval 27
  • 28. Connectivity analysis l Idea: Mine hyperlink information of the Web l Assumptions: – Links often connect related pages – A link between pages is a recommendation l Classic IR work (citations = links) a.k.a. “Bibliometrics” [K’63, G’72, S’73,…] l Socio-metrics [K’53, MMSM’86,…] l Many Web related papers build on this idea [PPR’96, AMM’97, S’97, CK’97, K’98, BP’98,…] M. Henzinger Web Information Retrieval
  • 29. Graph representation for the Web l A node for each page u l A directed edge (u,v) if page u contains a hyperlink to page v. M. Henzinger Web Information Retrieval 29
  • 30. Query-independent ranking: Motivation for PageRank l Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Ł Quality of a page is related to its in-degree l Recursion: Quality of a page is related to – its in-degree, and to – the quality of pages linking to it Ł PageRank [BP ‘98] M. Henzinger Web Information Retrieval 30
  • 31. Definition of PageRank [BP’98] l Consider the following infinite random walk (surf): – Initially the surfer is at a random page – At each step, the surfer proceeds • to a randomly chosen web page with probability d • to a randomly chosen successor of the current page with probability 1-d l The PageRank of a page p is the fraction of steps the surfer spends at p in the limit. M. Henzinger Web Information Retrieval 31
  • 32. PageRank (cont.) Said differently: l Transition probability matrix is d × U + (1 − d ) × A where U is the uniform distribution and A is adjacency matrix (normalized) l PageRank = stationary probability for this Markov chain, i.e. d PageRank (u ) = + (1 − d ) ∑ PageRank (v) / outdegree(v) n ( v ,u )∈E where n is the total number of nodes in the graph l Used as one of the ranking criteria in Google M. Henzinger Web Information Retrieval 32
  • 33. Output from Google: princess diana J J J J M. Henzinger Web Information Retrieval 33
  • 34. Query-dependent ranking: the neighborhood graph l Subgraph associated to each query Query Results Back Set = Start Set Forward Set b1 Result1 f1 b2 f2 Result2 ... … ... bm fs Resultn An edge for each hyperlink, but no edges within the same host M. Henzinger Web Information Retrieval 34
  • 35. HITS [K’98] l Goal: Given a query find: – Good sources of content (authorities) – Good sources of links (hubs) M. Henzinger Web Information Retrieval 35
  • 36. Intuition l Authority comes from in-edges. Being a good hub comes from out-edges. l Better authority comes from in-edges from good hubs. Being a better hub comes from out-edges to good authorities. M. Henzinger Web Information Retrieval 36
  • 37. HITS details Repeat until HUB and AUTH converge: Normalize HUB and AUTH HUB[v] := Σ AUTH[ui] for all ui with Edge(v, ui) AUTH[v] := Σ HUB[wi] for all wi with Edge(wi, v) v w1 u1 w2 A H u2 ... ... wk uk M. Henzinger Web Information Retrieval 37
  • 38. Output from HITS: jobs 1. www.ajb.dni.uk - British career website J 2. www.britnet.co.uk/jobs.htm 3. www.monster.com - US career website J 4. www.careermosaic.com - US career website J 5. plasma-gate.weizmann.ac.il/Job… 6. www. jobtrak.com - US career website J 7. www.occ.com - US career website J 8. www.jobserve.com - US career website J 9. www.allny.com/jobs.html - jobs in NYC 10.www.commarts.com/bin/… - US career website J M. Henzinger Web Information Retrieval 38
  • 39. Output from HITS: +jaguar +car 1. www.toyota.com 2. www.chryslercars.com 3. www.vw.com 4. www.jaguravehicles.com J 5. www.dodge.com 6. www.usa.mercedes-benz.com 7. www.buick.com 8. www.acura.com 9. www.bmw.com 10. www.honda.com M. Henzinger Web Information Retrieval 39
  • 40. Problems & solutions l Some edges are “wrong” -- not a recommendation: Author A 1/3 Author B – multiple edges from same author – automatically generated 1/3 – spam, etc. Solution: Weight edges to limit influence 1/3 l Topic drift – Query: +jaguar +car Result: pages about cars in general Solution: Analyze content and assign topic scores to nodes M. Henzinger Web Information Retrieval 40
  • 41. Modified HITS algorithms Repeat until HUB and AUTH converge: Normalize HUB and AUTH HUB[v] := Σ AUTH[ui] TopicScore[ui] weight[v,ui] for all ui with Edge(v, ui) AUTH[v] := Σ HUB[wi] TopicScore[wi] weight[wi,v] for all wi with Edge(wi, v) [CDRRGK’98, BH’98, CDGKRRT’98] M. Henzinger Web Information Retrieval 41
  • 42. Output from modified HITS: +jaguar +car 1. www.jaguarcars.com/ - official website of Jaguar cars J 2. www.collection.co.uk/ - official Jaguar accessories J 3. home.sn.no/.../jaguar.html - the Jaguar Enthusiast Place J 4. www.terrysjag.com/ - Jaguar Parts J 5. www.jaguarvehicles.com/ - official website of Jaguar cars J 6. www.jagweb.com/ - for companies specializing in Jags. J 7. jagweb.com/jdht/jdht.html - articles about Jaguars and Daimler 8. www.jags.org/ - Oldest Jaguar Club J 9. connection.se/jagsport/ - Sports car version of Jaguar MK II 10. users.aol.com/.../jane.htm -Jaguar Association of New England Ltd. M. Henzinger Web Information Retrieval 42
  • 43. User study [BH’98] Valuable pages within 10 top answers (averaged over 28 topics) 10 9 8 Original 7 6 5 Edge 4 Weighting 3 2 EW + 1 Content 0 Analysis Authorities Hubs M. Henzinger Web Information Retrieval 43
  • 44. PageRank vs. HITS l Computation: l Computation: – Expensive – Expensive – Once for all documents – Requires computation and queries (offline) for each query l Query-independent – l Query-dependent requires combination with query-dependent criteria l Hard to spam l Relatively easy to spam l Quality depends on quality of start set l Gives hubs as well as authorities M. Henzinger Web Information Retrieval 44
  • 45. Open problems l Compare performance of query-dependent and query- independent connectivity analysis l Exploit order of links on the page (see e.g. [CDGKRRT’98],[DH’99]) l Both Google and HITS compute principal eigenvector. What about non-principal eigenvector? ([K’98]) l Derive other graphs from the hyperlink structure … M. Henzinger Web Information Retrieval 45
  • 46. Algorithmic issues related to search engines l Collecting l Processing and l Processing documents representing the queries data – Priority Query- Query- dependent – Load independent balancing ranking ranking • Internal – Graph – Duplicate representation elimination • External – Index building – Query – Trap refinement avoidance – Duplicate elimination – Clustering – ... – Categorization – ... –… M. Henzinger Web Information Retrieval 46
  • 47. More on graph representation l Graphs derived from the hyperlink structure of the Web: – Node =page – Edge (u,v) iff pages u and v are related in a specific way (directed or not) l Examples of edges: – iff u has hyperlink to v – iff there exists a page w pointing to both u and v – iff u is often retrieved within x seconds after v –… M. Henzinger Web Information Retrieval 47
  • 48. Graph representation usage l Ranking algorithms l Visualization/Navigation – PageRank – Mapuccino – HITS [MJSUZB’97] –… – WebCutter [MS’97] l Search-by-example –… [DH’99] l Structured Web query l Categorization of Web tools pages – WebSQL [AMM’97] – [CDI’98] – WebQuery [CK’97] –… M. Henzinger Web Information Retrieval 48
  • 49. Example: SRC Connectivity Server [BBHKV’98] Directed edges = Hyperlinks l Goal: Support two basic operations for all URLs collected by AltaVista – InEdges(URL u, int k) • Return k URLs pointing to u – OutEdges(URL u, int k) • Return k URLs that u points to l Difficulties: – Memory usage (~180 M nodes, 1B edges) – Preprocessing time (days …) – Query time (~ 0.0001s/result URL) M. Henzinger Web Information Retrieval 49
  • 50. URL database Sorted list of URLs is 8.7 GB (≈ 48 bytes/URL) Delta encoding reduces it to 3.8 GB (≈ 21 bytes/URL) Original text Delta Encoding www.foobar.com/ 0 www.foobar.com/ 1 www.foobar.com/gandalf.htm 15 gandalf.htm 26 www.foograb.com/ 7 grab.com/ 41 15 gandalf.htm 26 size of shared prefix Node ID M. Henzinger Web Information Retrieval
  • 51. Graph data structure Node Table ptr to URL ptr to inlist table ptr to outlist table ... ... URL Database ... Inlist Table Outlist Table ... ... M. Henzinger Web Information Retrieval 51
  • 52. Web graph factoid l >1B nodes (12/99), mean indegree is ~ 8 l Zipfian Degree Distributions [KRRT’99]: – Fin(i) = fraction of pages with indegree i 1 Fin(i)~ i 2 .1 – Fout(i) = fraction of pages with outdegree i 1 Fout (i)~ i 2 .38 M. Henzinger Web Information Retrieval 52
  • 53. Open problems l Graph compression: How much compression possible without significant run-time penalty? – Efficient algorithms to find frequently repeated small structures (e.g. wheels, K2,2) l External memory graph algorithms: How to assign the graph representation to pages so as to reduce paging? (see [NGV’96, AAMVV’98]) l Stringology: Less space for URL database? Faster algorithms for URL to node translation? l Dynamic data structures: How to make updates efficient at the same space cost? M. Henzinger Web Information Retrieval 53
  • 54. Algorithmic issues related to search engines l Collecting l Processing and l Processing documents representing the queries data – Priority Query- Query- dependent – Load independent balancing ranking ranking • Internal Graph – Duplicate representation elimination • External – Index building – Query – Trap refinement avoidance – Duplicate elimination – Clustering – ... – Categorization – ... –… M. Henzinger Web Information Retrieval 54
  • 55. Index building l Inverted index data structure: Consider all documents concatenated into one huge document – For each word keep an ordered array of all positions in document, potentially compressed Word 1 1st position … last position ... ... ... l Allows efficient implementation of AND, OR, and AND NOT operations M. Henzinger Web Information Retrieval 55
  • 56. Algorithmic issues related to search engines l Collecting l Processing and l Processing documents representing the queries data – Priority Query- Query- dependent – Load independent balancing ranking ranking • Internal Graph – Duplicate representation elimination • External Index building – Query – Trap refinement avoidance – Duplicate elimination – Clustering – ... – Categorization – ... –… M. Henzinger Web Information Retrieval 56
  • 57. Reasons for duplicates l Proliferation of almost equal documents on the Web: – Legitimate: Mirrors, local copies, updates, etc. – Malicious: Spammers, spider traps, dynamic URLs – Mistaken: Spider errors l Approximately 30% of the pages on the Web are (near) duplicates. [BGMZ’97,SG’98] M. Henzinger Web Information Retrieval 57
  • 58. Uses of duplicate information l Smarter crawlers l Smarter web proxies – Better caching – Handling broken links l Smarter search engines – no duplicate answers – smarter connectivity analysis – less RAM and disk M. Henzinger Web Information Retrieval 58
  • 59. 2 Types of duplicate filtering l Fine-grain: Finding near-duplicate documents l Coarse-grain: Finding near-duplicate hosts (mirrors) M. Henzinger Web Information Retrieval 59
  • 60. Fine-grain: Basic mechanism l Must filter both duplicate and near-duplicate documents l Computing pair-wise edit distance would take forever l Preferably to store only a short sketch for each document. M. Henzinger Web Information Retrieval
  • 61. The basics of a solution [B’97],[BGMZ’97] 1. Reduce the problem to a set intersection problem 2. Estimate intersections by sampling minima. M. Henzinger Web Information Retrieval
  • 62. Shingling l Shingle = Fixed size sequence of w contiguous words a rose is a rose is a rose a rose is a rose is a rose is a rose is a rose is a rose is a rose D Set of Set of Shingling Fingerprint 64 bit shingles fingerprints M. Henzinger Web Information Retrieval 62
  • 63. Defining resemblance D1 D2 S1 S2 | S1 I S2 | resemblance = | S I S | resemblance = U | S1 1 S2 | 2 | S1 U S 2 | M. Henzinger Web Information Retrieval 63
  • 64. Sampling minima l Apply a random permutation σ to the set [0..264] l Crucial fact Let α = min(σ ( S1 )) β = min(σ ( S 2 )) S1 S2 | S1 I S 2 | Pr(α = β ) = | S1 U S 2 | M. Henzinger Web Information Retrieval 64
  • 65. Implementation l Choose a set of t random permutations of U l For each document keep a sketch S(D) consisting of t minima = samples l Estimate resemblance of A and B by counting common samples l The permutations should be from a min-wise independent family of permutations. See [BCFM’97] for the theory of mwi permutations. M. Henzinger Web Information Retrieval
  • 66. If we need only high resemblance Sketch 1: Sketch 2: l Divide sketch into k groups of s samples (t = k * s) l Fingerprint each group ⇒ feature l Two documents are fungible if they have at least r common features. l Want Fungibility ⇔ Resemblance above fixed threshold ρ M. Henzinger Web Information Retrieval
  • 67. Real implementation l ρ = 90%. In a 1000 word page with shingle length = 8 this corresponds to • Delete a paragraph of about 50-60 words. • Change 5-6 random words. l Sketch size t = 84, divide into k = 6 groups of s = 14 samples l 8 bytes fingerprints → we store only 6 x 8 = 48 bytes/document l Threshold r = 2 M. Henzinger Web Information Retrieval
  • 68. Probability that two documents are deemed fungible Two documents with resemblance ρ l Using the full sketch k ⋅s k ⋅s i P = ∑  i ρ (1 − ρ ) k ⋅s −i  i = r ⋅s   l Using features  k  s⋅i P = ∑  ρ (1 − ρ ) k s k −i i =r  i  M. Henzinger Web Information Retrieval 68
  • 69. Features vs. full sketch Probability that two pages are deemed fungible Prob Using full sketch Using features Resemblance M. Henzinger Web Information Retrieval
  • 70. Fine-grain duplicate elimination: open problems and related work l Best way of grouping samples for a given threshold and/or for multiple thresholds? l Efficient ways to find in a data base pairs of records that share many attributes. Best approach? l Min-wise independent permutations -- lots of open questions. l Other applications possible (images, sounds, ...) -- need translation into set intersection problem. l Related work: M’94, BDG’95, SG’95, H’96, FSGMU’98 M. Henzinger Web Information Retrieval 70
  • 71. 2 Types of duplicate filtering Fine-grain: Finding near-duplicate documents l Coarse-grain: Finding near-duplicate hosts (mirrors) M. Henzinger Web Information Retrieval 71
  • 72. Input: set of URLs l Input: – Subset of URLs on various hosts, collected e.g. by search engine crawl or web proxy – No content of pages pointed to by URLs except each page is labeled with its out-links l Goal: Find pairs of hosts that mirror content M. Henzinger Web Information Retrieval 72
  • 73. Example www.synthesis.org synthesis.stanford.edu a b a b d d c c www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-classroom.html www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced-circ-anal.html www.synthesis.org/Docs/annual.report96.final.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-mechatron.html www.synthesis.org/Docs/cicee-berlin-paper.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-studies.html www.synthesis.org/Docs/myr5 synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html www.synthesis.org/Docs/myr5/cicee/bridge-gap.html synthesis.stanford.edu/Docs/annual.report96.final.html www.synthesis.org/Docs/myr5/cs/cs-meta.html synthesis.stanford.edu/Docs/annual.report96.final_fn.html www.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html synthesis.stanford.edu/Docs/myr5/assessment www.synthesis.org/Docs/myr5/mech/mech-take-home.html synthesis.stanford.edu/Docs/myr5/assessment/assessment-main.html www.synthesis.org/Docs/myr5/synsys/experiential-learning.html synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-A6-E25.html www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html www.synthesis.org/Docs/yr5ar synthesis.stanford.edu/Docs/myr5/assessment/not-available.html www.synthesis.org/Docs/yr5ar/assess synthesis.stanford.edu/Docs/myr5/cicee www.synthesis.org/Docs/yr5ar/cicee synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html www.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html www.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html M. Henzinger Web Information Retrieval 73
  • 74. Coarse-grain: Basic mechanism l Must filter both duplicate and near-duplicate mirrors l Pair-wise testing would take forever l Both high precision (not outputting wrong mirrors) and high recall (finding almost all mirrors) are important M. Henzinger Web Information Retrieval
  • 75. A definition of mirroring Host1 and Host2 are mirrors iff For all paths p such that http://Host1/p is a web page, http://Host2/p exists with duplicate (or near-duplicate) content, and vice versa. M. Henzinger Web Information Retrieval 75
  • 76. The basics of a solution [BBDH’99] 1. Pre-filter to create a small set of pairs of potential mirrors (pre-filtering step) 2. Test each pair of potential mirrors (testing step) 3. Use different pre-filtering algorithms to improve recall M. Henzinger Web Information Retrieval
  • 77. Testing step l Test root pages + x URLs from each host sample l If one test returns “not near-duplicate” then hosts are not mirrors l If root pages and > c$ x URLs from each host sample are near-identical then hosts are mirrors, else they are not mirrors M. Henzinger Web Information Retrieval 77
  • 78. Pre-filtering step l Goal: Output quickly list of pairs of potential mirrors containing – many true mirror pairs (high recall) – not many non-mirror pairs (high precision) l Note: 2-sided error is allowed – Type-1: true mirror pairs might be missing in output – Type-2: non-mirror pair might be output l Testing of host pairs will eliminate type-2 errors, but not type-1 errors M. Henzinger Web Information Retrieval 78
  • 79. Different pre-filtering techniques l IP-based l URL-string based l URL-string and hyperlink based l Hostname and hyperlink based M. Henzinger Web Information Retrieval 79
  • 80. Problem with IP addresses eliza-iii.ibex.co.nz 203.29.170.23 pixel.ibex.co.nz M. Henzinger Web Information Retrieval 80
  • 81. Number of host with same IP address vs mirror probability 0.9 0.8 Probability of Mirroring 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 6 7 8 9 >9 Number of hosts with same IP address M. Henzinger Web Information Retrieval 81
  • 82. IP based pre-filtering algorithms l IP4: Cluster hosts based on IP address – Enumerate pairs from clusters in increasing cluster size (max 200 pairs) l IP3: Cluster hosts based on first 3 octets of their IP address – Enumerate pairs from clusters in increasing cluster size (max 5 pairs) M. Henzinger Web Information Retrieval 82
  • 83. URL string based pre-filtering algorithms Information extracted from URL strings: l Similar hostnames: might belong to same organization l Similar paths: might have replicated directories Ł extract “features” for host from URL strings and test similarity Similarity Testing Approach: l Feature vector for each host similar to term vector for document: – Host corresponds to document – Feature corresponds to term l Similarity of hosts = Cosine of angle of feature vectors M. Henzinger Web Information Retrieval 83
  • 84. URL string based algorithms (cont.) – paths: Features are paths: e.g., /staff/homepages/~dilbert/foo – prefixes: Features are prefixes: e.g., /staff /staff/homepages /staff/homepages/~dilbert /staff/homepages/~dilbert/foo – Other variants: hosts and shingles M. Henzinger Web Information Retrieval 84
  • 85. Paths + connectivity (conn) l Take output from paths and filter thus: – Consider 10 common paths in sample with highest outdegree – Paths are equivalent if 90% of their combined out- edges are common to both – Keep host-pair if 75% of the paths are equivalent M. Henzinger Web Information Retrieval 85
  • 86. Hostname connectivity l Idea: Mirrors point to similar set of other hosts Ł Feature vector approach to test similarity: – features are hosts that are pointed to – 2 different ways of feature weighting: • hconn1 • hconn2 M. Henzinger Web Information Retrieval 86
  • 87. Experiments l Input: 140 million URLs on 233,035 hosts + out- edges – Original 179 million URLs reduced by considering only hosts with at least 100 URLs in set l For each of the above pre-filtering algorithms: – Compute list of 25,000 (100,000) ranked pairs of potential mirrors – Test each pair of potential mirrors (testing step) and output list of mirrors Determine precision and relative recall M. Henzinger Web Information Retrieval 87
  • 88. Precision up to rank 25,000 M. Henzinger Web Information Retrieval 88
  • 89. Relative recall up to rank 25,000 Relative Recall Rank M. Henzinger Web Information Retrieval 89
  • 90. Relative recall at 25,000 for combined output hosts IP3 IP4 conn hconn1 hconn2 paths prefix shingles hosts 17% IP3 39% 30% IP4 61% 58% 54% conn 58% 66% 80% 47% hconn1 40% 51% 69% 59% 26% hconn2 41% 52% 70% 60% 29% 27% paths 48% 59% 78% 55% 51% 52% 36% prefix 54% 61% 75% 65% 58% 58% 57% 44% shingles 53% 61% 75% 64% 57% 58% 57% 48% 44% M. Henzinger Web Information Retrieval 90
  • 91. Combined approach (combined) l Combines top 100,000 results from hosts, IP4, paths, prefix, and hconn1. l Sort host pairs by: – Number of algorithms that return the host pair – Use best rank for any algorithm to break ties between host pairs l At rank 100,000: relative recall of 86%, precision of 57% M. Henzinger Web Information Retrieval 91
  • 92. Precision vs relative recall M. Henzinger Web Information Retrieval 92
  • 93. Web host graph l A node for each host h l An undirected edge (h,h’) if h and h’ are output as mirrors Ł Each (connected) component gives a set of mirrors M. Henzinger Web Information Retrieval 93
  • 94. Example of a component Protein Data Bank M. Henzinger Web Information Retrieval 94
  • 95. Component size distribution 100000 43,491 mirrored 15650 hosts Number of Components 10000 2204 of 233,035 1000 553 considered 185 95 100 49 26 18 13 10 10 6 6 4 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Component Size M. Henzinger Web Information Retrieval 95
  • 96. Coarse-grain duplicate filtering: Summary and open problems l Mirroring is common (43,491 mirrored hosts out of 233,035 considered hosts) – Load balancing, franchises/branding, virtual hosting, spam l Mirror detection based on non-content attributes is feasible. l [CSG’00] use page content similarity based approach. Open Problem: Compare and combine content and non- content techniques. l Open Problem: Assume you can choose which URLs to visit at a host. Determine best technique. M. Henzinger Web Information Retrieval 96
  • 97. Algorithmic issues related to search engines l Collecting l Processing and l Processing documents representing the queries data – Priority Query- Query- dependent – Load independent balancing ranking ranking • Internal Graph Duplicate representation elimination • External Index building – Query – Trap refinement avoidance Duplicate elimination – Clustering – ... – Categorization – ... –… M. Henzinger Web Information Retrieval 97
  • 98. Adding pages to the index Crawling process Get link at top of queue Expired pages from index Queue of Fetch page links to explore Index page and parse links Add URL Add to queue M. Henzinger Web Information Retrieval 98
  • 99. Queuing discipline l Standard graph exploration: – Random – BFS – DFS (+ depth limits) l Goal: get “best” pages for a given index size – Priority based on query-independent ranking: • highest indegree [M’95] • highest potential PageRank [CGP’98] l Goal: keep index fresh – Priority based on rate of change [CLW’97] M. Henzinger Web Information Retrieval 99
  • 100. Load balancing l Internal -- can not handle too much retrieved data simultaneously, but – Response time is unpredictable – Size of answers is unpredictable – There are additional system constraints (# threads, # open connections, etc.) l External – Should not overload any server or connection – A well-connected crawler can saturate the entire outside bandwidth of some small countries – Any queuing discipline must be acceptable to the community M. Henzinger Web Information Retrieval 100
  • 101. Web IR Tools General-purpose search engines l Hierarchical directories l Specialized search engines (dealing with heterogeneous data sources) l Search-by-example l Collaborative filtering l Meta-information M. Henzinger Web Information Retrieval 101
  • 102. Hierarchical directories Building of hierarchical directories: l Manual: Yahoo!, LookSmart, Open Directory l Automatic: – Populating of hierarchy [CDRRGK’98]: For each node in the hierarchy formulate fine-tuned query and run modified HITS algorithm – Categorization: For each document find “best” placement in the hierarchy. Techniques are connectivity and/or text based [CDI’98, …] M. Henzinger Web Information Retrieval 102
  • 103. Web IR Tools General-purpose search engines Hierarchical directories l Specialized search engines (dealing with heterogeneous data sources) – Shopping robots – Home page finder [SLE’97] – Applet finders –… l Search-by-example l Collaborative filtering l Meta-information M. Henzinger Web Information Retrieval 103
  • 104. Dealing with heterogeneous sources l Modern life problem: Given information sources with various capabilities, query all of them and combine the output. l Examples – Inter-business e-commerce e.g. www.industry.net – Meta search engines – Shopping robots l Issues – Determining relevant sources -- the “identification” problem – Merging the results -- the “fusion” problem M. Henzinger Web Information Retrieval 104
  • 105. Example: a shopping robot l Input: A product description in some form Find: Merchants for that product on the Web l Jango [DEW’97] – preprocessing: Store vendor URLs in database; learn for each vendor: • the URL of the search form • how to fill in the search form and • how the answer is returned – request processing: fill out form at every vendor and test whether the result is a success – range of products is predetermined M. Henzinger Web Information Retrieval 105
  • 106. Jango input example M. Henzinger Web Information Retrieval 106
  • 107. Jango output M. Henzinger Web Information Retrieval 107
  • 108. Direct query to K&L Wine M. Henzinger Web Information Retrieval 108
  • 109. What price decadence? M. Henzinger Web Information Retrieval 109
  • 110. Open problems l Going beyond the lowest common capability l Learning problem: automatic understanding of new interfaces l Efficient ways to determine which DB is relevant to a particular query l “Partial knowledge indexing”: indexer has limited access to the full DB M. Henzinger Web Information Retrieval 110
  • 111. Web IR Tools General-purpose search engines Hierarchical directories Specialized search engines (dealing with heterogeneous data sources) l Search-by-example l Collaborative filtering l Meta-information M. Henzinger Web Information Retrieval 111
  • 112. Search-by-example l Given: set S of URLs l Find: URLs of similar pages l Various Approaches: – Connectivity-based – Usage based: related pages are pages visited frequently after S – Query refinement l Related Work:G’72, G’79, K’98, PP’97,S’97, CDI’98 M. Henzinger Web Information Retrieval 112
  • 113. Output from Google: related:www.ebay.com M. Henzinger Web Information Retrieval 113
  • 114. Output from Alexa: www.ebay.com M. Henzinger Web Information Retrieval 114
  • 115. Connectivity based solutions [DH’99] l Algorithm Companion l Algorithm Co-citation M. Henzinger Web Information Retrieval 115
  • 116. Algorithm Companion l Build modified neighborhood graph N. l Run modified HITS algorithm on N. Major Question: How to form neighborhood graph s.t. top returned pages are useful related pages M. Henzinger Web Information Retrieval 116
  • 117. Building neighborhood graph N l Node set: From URL u go ‘back’, ‘forward’, ‘back- forward’, and ‘forward-back’ l Edge set: Directed edge if there is a hyperlink between 2 nodes l Apply refinements to N u M. Henzinger Web Information Retrieval 117
  • 118. Refinement 1: Limit out-degree l Motivation: Some nodes have high out-degree => graph would become too large l Limit out-degree when going “forward” – Going forward from u : choose first 50 out-links on page – Going forward from other nodes: choose 8 out- links surrounding the in-link traversed to reach u u M. Henzinger Web Information Retrieval 118
  • 119. Co-citation algorithm l Determine 2000 arbitrary back-nodes of u. l Add to set S of siblings of u: For each back node 8 forward-nodes surrounding the link to u u l If there is enough co-citation with u then – return nodes in S in decreasing order of co- citations else – restart algorithm with one path element removed (http://…/X/Y/ -> http://…/X/) M. Henzinger Web Information Retrieval 119
  • 120. Alexa’s “What’s Related” l Uses: – Document Content – Usage Data – Connectivity l Removes path elements if no answer for u is found M. Henzinger Web Information Retrieval 120
  • 121. User study V a lua b le p a g e s within 1 0 to p a nswe rs a ve ra g e d o ve r 5 9 (=A L L ) o r 3 7 (=INTE RS E C T.) q ue rie s V a lua b le p a g e s within to p 1 0 10 9 8 7 A le xa 6 5 4 C o cita tio n 3 2 1 C o m p a nio n 0 A L L Q UE RIE S INTE RS E C TIO N Q UE RIE S M. Henzinger Web Information Retrieval 121
  • 122. Web IR Tools General-purpose search engines: Hierarchical directories Specialized search engines: Search-by-example l Collaborative filtering l Meta-information M. Henzinger Web Information Retrieval 122
  • 123. Collaborative filtering User Ex Collected input pli input cit pre fer en ce s prediction analysis phase phase Suggestions Model M. Henzinger Web Information Retrieval 123
  • 124. Lots of projects l Collaborative filtering seldom used in classic IR, big revival on the Web. Projects: – PHOAKS -- ATT labs → Web pages recommendation based on Usenet postings – GAB -- Bellcore → Web browsing – Grouplens -- U. Minnesota → Usenet newsgroups – EachToEach -- Compaq SRC → rating movies –… See http://sims.berkeley.edu/resources/collab/ M. Henzinger Web Information Retrieval 124
  • 125. Why do we care? l The ranking schemes that we discussed are also a form of collaborative ranking! – Connectivity = people vote with their links – Usage = people vote with their clicks l These schemes are used only for a global model building. Can it be combined with per-user data? Ideas: – Consider the graph induced by the user’s bookmarks. – Profile the user -- see www.globalbrain.net – Deal with privacy concerns! M. Henzinger Web Information Retrieval 125
  • 126. Web IR Tools General-purpose search engines: Hierarchical directories Specialized search engines: Search-by-example Collaborative filtering l Meta-information – Comparison of search engines – Log statistics M. Henzinger Web Information Retrieval 126
  • 127. Comparison of search engines – Ideal measure: User satisfaction Difficulty of – Number of user requests independent measurement; – Quality of search engine index Usefulness for Comparison – Size of search engine index M. Henzinger Web Information Retrieval 127
  • 128. Comparing Search Engine Sizes l Naïve Approaches – Get a list of URLs from each search engine and compare • Not practical or reliable. – Result Set Size Comparison • Reported sizes are approximate. –… l Better Approach – Statistical Sampling M. Henzinger Web Information Retrieval 128
  • 129. URL sampling l Ideal strategy: Generate a random URL and check for containment in each index. l Random URLs are hard to generate: – Random walks methods • Graph is directed • Stationary distribution is non-uniform • Must prove rapid mixing. – Pages in cache, query logs [LG’98a], etc. • Correlated to the interests of a particular group of users. l A simple way: collect all pages on the Web and pick one at random. M. Henzinger Web Information Retrieval 129
  • 130. Sampling via queries [BB’98] l Search engines have the best crawlers -- why not exploit them? l Method: – Sample from each engine in turn – Estimate the relative sizes of two search engines – Compute absolute sizes from a reference point M. Henzinger Web Information Retrieval 130
  • 131. Estimate relative sizes Select pages randomly from A (resp. B) Check if page contained in B (resp. A) |A ∩ B| ≈ (1/2) * |A| |A ∩ B| ≈ (1/6) * |B| ∴ |B| ≈ 3 * |A| Two steps: (i) Selecting (ii) Checking M. Henzinger Web Information Retrieval 131
  • 132. Selecting a random page l Generate random query – Build a lexicon of words that occur on the Web – Combine random words from lexicon to form queries l Get the first 100 query results from engine A l Select a random page out of the set l Distribution is biased -- the conjecture is that ∑ p( D) | A ∩ B | D∈ A∩ B ~ ∑ p( D) | A | D∈ A where p(D) is the probability that D is picked by this scheme M. Henzinger Web Information Retrieval 132
  • 133. Checking if an engine has a page l Create a “unique query” for the page: – Use 8 rare words. – E.g., for the Digital Systems Research Center Home Page: M. Henzinger Web Information Retrieval 133
  • 134. Results of the BB’98 study 35 0 S ta tic W e b 24 0 U nio n 12 5 A lta V is ta 10 0 Status as of July ‘98 H o tB o t Web size: 350 million pages 45 E xc ite Growth: 25M pages/month 37 Info s e e k Six largest engines cover: 2/3 75 Small overlap: 3M pages N o rthe rn Light 21 Ly c o s 0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 Size in M illions of P a ges Jun-9 7 N ov-97 M a r -98 Jul-98 M. Henzinger Web Information Retrieval 134
  • 135. Crawling strategies are different! Exclu siv e listin g s in m illio n s o f p ag es L yco s 5 Jul-98 N o rth ern Lig ht 20 A ltaV ista 50 H o tB o t 45 E xcite 13 Info seek 8 0 10 20 30 40 50 60 M. Henzinger Web Information Retrieval 135
  • 136. Comparison of search engines – Ideal measure: User satisfaction Difficulty of – Number of user requests independent measurement; – Quality of search engine index Usefulness for Comparison Size of search engine index M. Henzinger Web Information Retrieval 136
  • 137. Quality: A general definition [HHMN’99] l Assign each page p a weight w(p) such that ∑ w(p) = 1 all p Ł Can be thought of as probability distribution on pages l Quality of a search engine index S is w(S) = ∑ w( p) p∈S l Example: – If w is same for all pages, weight is proportional to total size (in pages). l Average page quality in index S is w(S)/|S|. l We use: weight w(p) of a page p is its PageRank M. Henzinger Web Information Retrieval 137
  • 138. Estimating quality by sampling Suppose we can choose random pages according to w (so that page p appears with probability w(p)) l Choose a sample of pages p1,p2,p3… pn l Check if the pages are in search engine index S l Estimate for quality of index S is the percentage of sampled pages that are in S, i.e. 1 w (S ) = ∑ I [ p j ∈ S ] n j where I[pj in S] = 1 if pj is in S and 0 otherwise M. Henzinger Web Information Retrieval 138
  • 139. Missing pieces l Sample pages according to the PageRank distribution. l Test whether page p is in search engine index S. → same methodology as [BB’98] M. Henzinger Web Information Retrieval 139
  • 140. Sampling pages (almost) according to PageRank l Perform a random walk and select n random pages from it. l Problems: – Starting state bias: finite walk only approximates PageRank. – Can’t jump to a random page; instead, jump to a random page on a random host seen so far. Ł Sampling pages according to a distribution that behaves similarly to PageRank, but it not identical to PageRank M. Henzinger Web Information Retrieval 140
  • 141. Experiments l Performed two long random walks with d=1/7 starting at www.yahoo.com Walk 1 Walk2 length 18 hours 54 hours attempted downloads 2,867,466 6,219,704 HTML pages 1,393,265 2,940,794 successfully downloaded unique HTML pages 509,279 1,002,745 sampled pages 1,025 1,100 M. Henzinger Web Information Retrieval 141
  • 142. Random walk effectiveness l Pages (or hosts) that are “highly-reachable” are visited often by the random walks l Initial bias for www.yahoo.com is reduced in longer walk l Results are consistent over the 2 walks l The average indegree of pages with indegree <= 1000 is high: – 53 in walk 1 – 60 in walk 2 M. Henzinger Web Information Retrieval 142
  • 143. Most frequently visited pages P a ge F re q. F re q. R a nk W a lk2 W a lk1 W a lk1 w w w .mic ro s o ft.c o m/ 3172 1600 1 w w w .mic ro s o ft.c o m/w indo w s /ie /de fa ult.htm 2064 1045 3 w w w .ne ts c a pe .c o m/ 1991 876 6 w w w .mic ro s o ft.c o m/ie / 1982 1017 4 w w w .mic ro s o ft.c o m/w indo w s /ie /do w nlo a d/ 1915 943 5 w w w .mic ro s o ft.c o m/w indo w s /ie /do w nlo a d/a ll.htm 1696 830 7 w w w .a do be .c o m/pro dinde x/a c ro ba t/re a ds te p.htm 1634 780 8 ho me .ne ts c a pe .c o m/ 1581 695 10 w w w .linke xc ha nge .c o m/ 1574 763 9 w w w .y a ho o .c o m/ 1527 1132 2 M. Henzinger Web Information Retrieval 143
  • 144. Most frequently visited hosts Site Frequency Frequency Rank Walk 2 Walk 1 Walk 1 www.microsoft.com 32452 16917 1 home.netscape.com 23329 11084 2 www.adobe.com 10884 5539 3 www.amazon.com 10146 5182 4 www.netscape.com 4862 2307 10 excite.netscape.com 4714 2372 9 www.real.com 4494 2777 5 www.lycos.com 4448 2645 6 www.zdnet.com 4038 2562 8 www.linkexchange.com 3738 1940 12 www.yahoo.com 3461 2595 7 M. Henzinger Web Information Retrieval 144
  • 145. Results for index quality 0.35 0.3 Walk 1 (exact match) 0.25 Walk 2 (exact Index quality 0.2 match) 0.15 Walk 1 (host 0.1 match) Walk 2 (host 0.05 match) 0 ) 5) 0) ) ) ) 7 5 1 5 (3 2 0 (4 (2 (2 (1 (1 k ) ta s te ee ot ta co e ci is B s (b Ex ot fo V Ly ta e In H gl Al oo G M. Henzinger Web Information Retrieval 145
  • 146. Results for index quality/page 0.8 Index Quality per Page 0.7 Walk 1 (exact 0.6 match) 0.5 Walk 2 (exact 0.4 match) 0.3 Walk 1 (host 0.2 match) 0.1 Walk 2 (host 0 match) ) 1) 7) 5) ) 5) 00 25 (2 (3 (2 (4 (1 (1 a) ek s ot te ta co et se ci is tb (b Ly Ex fo v Ho ta e In gl Al oo G M. Henzinger Web Information Retrieval 146
  • 147. Insights from the data l Our approach appears consistent over repeated tests Ł Random walks are a useful tool l Quality is different from size for search engine indices Ł Some search engines are apparently trying to index high quality pages M. Henzinger Web Information Retrieval 147
  • 148. Open problems l Random page generation via random walks l Cryptography based approach: want random pages from each engine but no cheating! (page should be chosen u.a.r. from the actual index) – Each search engine can commit to the set of pages it has without revealing it – Need to ensure that this set is the same as the set actually indexed – Need efficient oblivious protocol to obtain random page from search engine – See [NP‘98] for possible solution M. Henzinger Web Information Retrieval 148
  • 149. Web IR Tools General-purpose search engines: Hierarchical directories Specialized search engines: Search-by-example Collaborative filtering l Meta-information Comparison of search engines – Log statistics M. Henzinger Web Information Retrieval 149
  • 150. How often do people view a page? l Problems: – Web caches interfere with click counting – cheating pays (advertisers pay by the click) l Solutions: – naïve: forces caches to re-fetch for every click. • Lots of traffic, annoyed Web users – extend HTML with counters [ML’97] • requires compliance, down caches falsify results. – use sampling [P’97] • force refetches on random days • force refetches for random users and IP addresses – cryptographic audit bureaus [NP’98a] l Commercial providers: 100hot, Media Matrix, Relevant Knowledge, … M. Henzinger Web Information Retrieval 150
  • 151. Query log statistics [SHMM’98] request = new query or new result screen of old query session = a series of requests by one user close together in time l analyzed ~1B AltaVista requests consisting of: – ~840 M non-empty requests – ~575 M non-empty queries ➪1.5 requests per query in the average – ~153 M unique non-empty queries ➪ query is repeated 3.8 times in the average, but 64% of queries occur only once – ~285 M user sessions ➪2.9 requests and 2.0 queries per session in the average M. Henzinger Web Information Retrieval 151
  • 152. Lots of things we didn’t even touch... l Clustering = group similar items (documents or queries) together ↔ unsupervised learning l Categorization = assign items to predefined categories ↔ supervised learning l Classic IR issues that are not substantially different in the Web context: – Latent semantic indexing -- associate “concepts” to queries and documents and match on concepts – Summarization: abstract the most important parts of text content. (See [TS’98] for the Web context) l Many others … M. Henzinger Web Information Retrieval 152
  • 153. Final conclusions l We talked mostly about IR methods and tools that – take advantage of the Web particularities – mitigate some of the difficulties l Web IR offers plenty of interesting problems… … but not on a silver platter l Almost every area of algorithms research is relevant l Great need for good algorithm engineering! M. Henzinger Web Information Retrieval 153
  • 154. Acknowledgements l An earlier version of this talk was created in collaboration with Andrei Broder and was presented at the 39th IEEE Symposium on Foundations in Computer Science 1998. l Thanks for useful comments to Krishna Bharat, Moses Charikar, Edith Cohen, Jeff Dean,Uri Feige, Puneet Kumar, Hannes Marais, Michael Mitzenmacher, Prabhakar Raghavan, and Lyle Ramshaw. M. Henzinger Web Information Retrieval 154