SlideShare ist ein Scribd-Unternehmen logo
1 von 71
Downloaden Sie, um offline zu lesen
Web Crawling

                                   Carlos Castillo

                              Outline

                              Motivation

                              Behavior of a crawler
                              Selection policy
                              Re-visit policy
                              Politeness policy
                              Parallelization policy
Web Crawling                  Scheduling
                              Short-term scheduling
                              Long-term scheduling
                              When to stop crawling

                              Architecture
    Carlos Castillo           History
                              Classification
                              Implementation
  Center for Web Research
                              Practical issues
Computer Science Department
                              Summary
    University of Chile
        www.cwr.cl            References
Motivation                       Web Crawling

Behavior of a crawler            Carlos Castillo

   Selection policy         Outline

   Re-visit policy          Motivation

   Politeness policy        Behavior of a crawler
                            Selection policy
   Parallelization policy   Re-visit policy
                            Politeness policy
                            Parallelization policy
Scheduling                  Scheduling
   Short-term scheduling    Short-term scheduling
                            Long-term scheduling
   Long-term scheduling     When to stop crawling

                            Architecture
   When to stop crawling    History
                            Classification
Architecture                Implementation

                            Practical issues
   History
                            Summary
   Classification            References
   Implementation
Practical issues
Summary
References
An astronomer watching the sky        Web Crawling

                                      Carlos Castillo

                                 Outline

                                 Motivation

                                 Behavior of a crawler
                                 Selection policy
                                 Re-visit policy
                                 Politeness policy
                                 Parallelization policy

                                 Scheduling
                                 Short-term scheduling
                                 Long-term scheduling
                                 When to stop crawling

                                 Architecture
                                 History
                                 Classification
                                 Implementation

                                 Practical issues

                                 Summary

                                 References
The problem of abundance                                 Web Crawling

                                                         Carlos Castillo

                                                    Outline

                                                    Motivation

                                                    Behavior of a crawler
                                                    Selection policy
                                                    Re-visit policy

   5 exabytes of new information a year             Politeness policy
                                                    Parallelization policy

   [Lyman and Varian, 2003] (1 exabyte = 1018       Scheduling
                                                    Short-term scheduling
   bytes)                                           Long-term scheduling
                                                    When to stop crawling

   Most directories no longer encourage             Architecture
                                                    History
   administrators to submit their Web sites: they   Classification
                                                    Implementation
   have to find the page on their own                Practical issues

   Adversarial information retrieval                Summary

                                                    References
The problem of abundance                                 Web Crawling

                                                         Carlos Castillo

                                                    Outline

                                                    Motivation

                                                    Behavior of a crawler
                                                    Selection policy
                                                    Re-visit policy

   5 exabytes of new information a year             Politeness policy
                                                    Parallelization policy

   [Lyman and Varian, 2003] (1 exabyte = 1018       Scheduling
                                                    Short-term scheduling
   bytes)                                           Long-term scheduling
                                                    When to stop crawling

   Most directories no longer encourage             Architecture
                                                    History
   administrators to submit their Web sites: they   Classification
                                                    Implementation
   have to find the page on their own                Practical issues

   Adversarial information retrieval                Summary

                                                    References
The problem of abundance                                 Web Crawling

                                                         Carlos Castillo

                                                    Outline

                                                    Motivation

                                                    Behavior of a crawler
                                                    Selection policy
                                                    Re-visit policy

   5 exabytes of new information a year             Politeness policy
                                                    Parallelization policy

   [Lyman and Varian, 2003] (1 exabyte = 1018       Scheduling
                                                    Short-term scheduling
   bytes)                                           Long-term scheduling
                                                    When to stop crawling

   Most directories no longer encourage             Architecture
                                                    History
   administrators to submit their Web sites: they   Classification
                                                    Implementation
   have to find the page on their own                Practical issues

   Adversarial information retrieval                Summary

                                                    References
The bandwidth is expensive                                Web Crawling

                                                          Carlos Castillo

                                                     Outline

                                                     Motivation

                                                     Behavior of a crawler
    “Given that the bandwidth for conducting         Selection policy
                                                     Re-visit policy

    crawls is neither infinite nor free it is         Politeness policy
                                                     Parallelization policy

    becoming essential to crawl the Web in a         Scheduling
                                                     Short-term scheduling
    not only scalable, but efficient way if some       Long-term scheduling
                                                     When to stop crawling
    reasonable measure of quality or freshness is    Architecture
    to be maintained” [Edwards et al., 2001]         History
                                                     Classification
                                                     Implementation

The cost of a “complete” Web crawl is estimated in   Practical issues

                                                     Summary
$1.5 million USD [Craswell et al., 2004], only
                                                     References
considering network usage
The bandwidth is expensive                                Web Crawling

                                                          Carlos Castillo

                                                     Outline

                                                     Motivation

                                                     Behavior of a crawler
    “Given that the bandwidth for conducting         Selection policy
                                                     Re-visit policy

    crawls is neither infinite nor free it is         Politeness policy
                                                     Parallelization policy

    becoming essential to crawl the Web in a         Scheduling
                                                     Short-term scheduling
    not only scalable, but efficient way if some       Long-term scheduling
                                                     When to stop crawling
    reasonable measure of quality or freshness is    Architecture
    to be maintained” [Edwards et al., 2001]         History
                                                     Classification
                                                     Implementation

The cost of a “complete” Web crawl is estimated in   Practical issues

                                                     Summary
$1.5 million USD [Craswell et al., 2004], only
                                                     References
considering network usage
Combination of policies          Web Crawling

                                 Carlos Castillo

                            Outline

                            Motivation

                            Behavior of a crawler
                            Selection policy
                            Re-visit policy
                            Politeness policy
                            Parallelization policy

   Selection policy         Scheduling
                            Short-term scheduling

   Re-visit policy          Long-term scheduling
                            When to stop crawling

   Politeness policy        Architecture
                            History
                            Classification
   Parallelization policy   Implementation

                            Practical issues

                            Summary

                            References
Combination of policies          Web Crawling

                                 Carlos Castillo

                            Outline

                            Motivation

                            Behavior of a crawler
                            Selection policy
                            Re-visit policy
                            Politeness policy
                            Parallelization policy

   Selection policy         Scheduling
                            Short-term scheduling

   Re-visit policy          Long-term scheduling
                            When to stop crawling

   Politeness policy        Architecture
                            History
                            Classification
   Parallelization policy   Implementation

                            Practical issues

                            Summary

                            References
Combination of policies          Web Crawling

                                 Carlos Castillo

                            Outline

                            Motivation

                            Behavior of a crawler
                            Selection policy
                            Re-visit policy
                            Politeness policy
                            Parallelization policy

   Selection policy         Scheduling
                            Short-term scheduling

   Re-visit policy          Long-term scheduling
                            When to stop crawling

   Politeness policy        Architecture
                            History
                            Classification
   Parallelization policy   Implementation

                            Practical issues

                            Summary

                            References
Combination of policies          Web Crawling

                                 Carlos Castillo

                            Outline

                            Motivation

                            Behavior of a crawler
                            Selection policy
                            Re-visit policy
                            Politeness policy
                            Parallelization policy

   Selection policy         Scheduling
                            Short-term scheduling

   Re-visit policy          Long-term scheduling
                            When to stop crawling

   Politeness policy        Architecture
                            History
                            Classification
   Parallelization policy   Implementation

                            Practical issues

                            Summary

                            References
It is necessary to prioritize                            Web Crawling

                                                         Carlos Castillo

                                                    Outline

                                                    Motivation

                                                    Behavior of a crawler
                                                    Selection policy
                                                    Re-visit policy
                                                    Politeness policy
                                                    Parallelization policy
    No search engine indexes more than 16% of the   Scheduling
    Web [Lawrence and Giles, 2000]                  Short-term scheduling
                                                    Long-term scheduling
                                                    When to stop crawling
    Download only the “important” pages             Architecture

    Restrict to only a sub-domain                   History
                                                    Classification
                                                    Implementation
    Avoid spamming                                  Practical issues

                                                    Summary

                                                    References
Web Crawling
Selection based on links                                 Carlos Castillo

                                                    Outline

                                                    Motivation

                                                    Behavior of a crawler
                                                    Selection policy
                                                    Re-visit policy
                                                    Politeness policy
                                                    Parallelization policy
   Order by Pagerank [Cho et al., 1998]             Scheduling

   Depth-first search [Najork and Wiener, 2001]      Short-term scheduling
                                                    Long-term scheduling
                                                    When to stop crawling
   Focused crawling [Chakrabarti et al., 1999],     Architecture
   attempting to infer similarity to pages before   History
                                                    Classification
                                                    Implementation
   downloading them
                                                    Practical issues

                                                    Summary

                                                    References
Web Crawling
Events                                                   Carlos Castillo

                                                    Outline

                                                    Motivation

                                                    Behavior of a crawler
                                                    Selection policy
                                                    Re-visit policy
                                                    Politeness policy
   Creation, which requires a link                  Parallelization policy

                                                    Scheduling
   Update, can be either minor or major. Most of    Short-term scheduling
                                                    Long-term scheduling
   the changes are minor, but this is not easy to   When to stop crawling

   exploit                                          Architecture
                                                    History
                                                    Classification
   Deletion, which is more damaging to the search   Implementation

   engine’s reputation                              Practical issues

                                                    Summary

                                                    References
Web Crawling
Cost functions                                            Carlos Castillo

                                                     Outline

                                                     Motivation
   Freshness:
                                                     Behavior of a crawler
                                                     Selection policy

                 1 if p is not modified at time t     Re-visit policy
                                                     Politeness policy
     Fp (t) =                                        Parallelization policy
                 0 otherwise                         Scheduling
                                                     Short-term scheduling
                                                     Long-term scheduling
                                                     When to stop crawling

   Age:                                              Architecture
                                                     History
                                                     Classification
                                                     Implementation
                0              if p is not modified   Practical issues
   Ap (t) =
                t − lastmod(p) otherwise             Summary

                                                     References

   Depending on the cost function used, the
   behavior can be different
Web Crawling
Cost functions                                            Carlos Castillo

                                                     Outline

                                                     Motivation
   Freshness:
                                                     Behavior of a crawler
                                                     Selection policy

                 1 if p is not modified at time t     Re-visit policy
                                                     Politeness policy
     Fp (t) =                                        Parallelization policy
                 0 otherwise                         Scheduling
                                                     Short-term scheduling
                                                     Long-term scheduling
                                                     When to stop crawling

   Age:                                              Architecture
                                                     History
                                                     Classification
                                                     Implementation
                0              if p is not modified   Practical issues
   Ap (t) =
                t − lastmod(p) otherwise             Summary

                                                     References

   Depending on the cost function used, the
   behavior can be different
Evolution of freshness and age        Web Crawling

                                      Carlos Castillo

                                 Outline

                                 Motivation

                                 Behavior of a crawler
                                 Selection policy
                                 Re-visit policy
                                 Politeness policy
                                 Parallelization policy

                                 Scheduling
                                 Short-term scheduling
                                 Long-term scheduling
                                 When to stop crawling

                                 Architecture
                                 History
                                 Classification
                                 Implementation

                                 Practical issues

                                 Summary

                                 References
Estimating freshness and age                                Web Crawling

                                                            Carlos Castillo

                                                       Outline

                                                       Motivation

                                                       Behavior of a crawler
                                                       Selection policy
   Page changes can be modeled as a Poisson            Re-visit policy
                                                       Politeness policy
   process [Brewington et al., 2000]                   Parallelization policy

                                                       Scheduling
   Probability of a page being updated at time t is    Short-term scheduling
                                                       Long-term scheduling
                                                       When to stop crawling

                 P(Fp (t) = 1) = e −λp t               Architecture
                                                       History
                                                       Classification
                                                       Implementation
   λp can be estimated using historical data,          Practical issues
   specially if last-modification date is provided by   Summary

   the server [Cho and Garcia-Molina, 2003]            References
Web Crawling
Web robots can be a threat                               Carlos Castillo

                                                    Outline

                                                    Motivation

                                                    Behavior of a crawler
                                                    Selection policy
                                                    Re-visit policy
                                                    Politeness policy
   They consume network resources                   Parallelization policy

                                                    Scheduling
   They can cause server overload                   Short-term scheduling
                                                    Long-term scheduling
   The robot exclusion protocol should be honored   When to stop crawling

                                                    Architecture
   [Koster, 1996]                                   History
                                                    Classification
   The re-visiting period should be reasonable      Implementation


   (what is reasonable?)                            Practical issues

                                                    Summary

                                                    References
Web Crawling
Robot exclusion                                                           Carlos Castillo

                                                                     Outline

                                                                     Motivation

                                                                     Behavior of a crawler
                                                                     Selection policy
                                                                     Re-visit policy
                                                                     Politeness policy
Server exclusions                                                    Parallelization policy

D i s a l l o w : / c g i −b i n                                     Scheduling
                                                                     Short-term scheduling
                                                                     Long-term scheduling
                                                                     When to stop crawling

Page exclusions                                                      Architecture
                                                                     History
<meta name=” r o b o t s ”                                           Classification
                                                                     Implementation
 c o n t e n t =”n o i n d e x . n o f o l l o w , n o c a c h e”>   Practical issues

                                                                     Summary

                                                                     References
Web Crawling
Robot exclusion                                                           Carlos Castillo

                                                                     Outline

                                                                     Motivation

                                                                     Behavior of a crawler
                                                                     Selection policy
                                                                     Re-visit policy
                                                                     Politeness policy
Server exclusions                                                    Parallelization policy

D i s a l l o w : / c g i −b i n                                     Scheduling
                                                                     Short-term scheduling
                                                                     Long-term scheduling
                                                                     When to stop crawling

Page exclusions                                                      Architecture
                                                                     History
<meta name=” r o b o t s ”                                           Classification
                                                                     Implementation
 c o n t e n t =”n o i n d e x . n o f o l l o w , n o c a c h e”>   Practical issues

                                                                     Summary

                                                                     References
Objectives                                      Web Crawling

                                                Carlos Castillo

                                           Outline

                                           Motivation

                                           Behavior of a crawler
                                           Selection policy
                                           Re-visit policy
                                           Politeness policy
                                           Parallelization policy

   Distribute the Web crawling             Scheduling
                                           Short-term scheduling

   Ideally, no central control point       Long-term scheduling
                                           When to stop crawling

   Reduce overhead due to communications   Architecture
                                           History
                                           Classification
   Reduce overlap, ideally zero            Implementation

                                           Practical issues

                                           Summary

                                           References
Types of policies                                         Web Crawling

                                                          Carlos Castillo

                                                     Outline

                                                     Motivation

                                                     Behavior of a crawler
                                                     Selection policy
                                                     Re-visit policy
                                                     Politeness policy
                                                     Parallelization policy

   Static assignment: typically a hash function on   Scheduling
                                                     Short-term scheduling
   site names                                        Long-term scheduling
                                                     When to stop crawling

   Dynamic assignment: more complicated to           Architecture
                                                     History
   handle, usually requires central control          Classification
                                                     Implementation

                                                     Practical issues

                                                     Summary

                                                     References
Problem separation                                        Web Crawling

                                                          Carlos Castillo

                                                     Outline

                                                     Motivation

                                                     Behavior of a crawler
                                                     Selection policy
                                                     Re-visit policy
   Indexing, downloading, and distributed crawling   Politeness policy
                                                     Parallelization policy

   are done in batches – this can be exploited to    Scheduling
                                                     Short-term scheduling
   separate the problem                              Long-term scheduling
                                                     When to stop crawling

   Short-term scheduling: using the network          Architecture
                                                     History
   resources efficiently                               Classification
                                                     Implementation
   Long-term scheduling: ordering the crawling       Practical issues

   process to download important pages first          Summary

                                                     References
Problem separation                                        Web Crawling

                                                          Carlos Castillo

                                                     Outline

                                                     Motivation

                                                     Behavior of a crawler
                                                     Selection policy
                                                     Re-visit policy
   Indexing, downloading, and distributed crawling   Politeness policy
                                                     Parallelization policy

   are done in batches – this can be exploited to    Scheduling
                                                     Short-term scheduling
   separate the problem                              Long-term scheduling
                                                     When to stop crawling

   Short-term scheduling: using the network          Architecture
                                                     History
   resources efficiently                               Classification
                                                     Implementation
   Long-term scheduling: ordering the crawling       Practical issues

   process to download important pages first          Summary

                                                     References
Short-term scheduling                                  Web Crawling

                                                       Carlos Castillo

                                                  Outline

                                                  Motivation

If B is the bandwidth available, then Bp , the    Behavior of a crawler
                                                  Selection policy

downloading speed for page p, is                  Re-visit policy
                                                  Politeness policy
                                                  Parallelization policy

                              Sp                  Scheduling
                       Bp =                       Short-term scheduling

                              T∗                  Long-term scheduling
                                                  When to stop crawling

                                                  Architecture
Where T ∗ is the optimal time to use all of the   History
                                                  Classification
available bandwidth                               Implementation

                                                  Practical issues

                              p Sp                Summary
                     T∗ =                         References
                              B
Full parallelization        Web Crawling

                            Carlos Castillo

                       Outline

                       Motivation

                       Behavior of a crawler
                       Selection policy
                       Re-visit policy
                       Politeness policy
                       Parallelization policy

                       Scheduling
                       Short-term scheduling
                       Long-term scheduling
                       When to stop crawling

                       Architecture
                       History
                       Classification
                       Implementation

                       Practical issues

                       Summary

                       References
Web Crawling
Full serialization        Carlos Castillo

                     Outline

                     Motivation

                     Behavior of a crawler
                     Selection policy
                     Re-visit policy
                     Politeness policy
                     Parallelization policy

                     Scheduling
                     Short-term scheduling
                     Long-term scheduling
                     When to stop crawling

                     Architecture
                     History
                     Classification
                     Implementation

                     Practical issues

                     Summary

                     References
Web Crawling
Realistic scenario        Carlos Castillo

                     Outline

                     Motivation

                     Behavior of a crawler
                     Selection policy
                     Re-visit policy
                     Politeness policy
                     Parallelization policy

                     Scheduling
                     Short-term scheduling
                     Long-term scheduling
                     When to stop crawling

                     Architecture
                     History
                     Classification
                     Implementation

                     Practical issues

                     Summary

                     References
Web Crawling
Number of active crawlers        Carlos Castillo

                            Outline

                            Motivation

                            Behavior of a crawler
                            Selection policy
                            Re-visit policy
                            Politeness policy
                            Parallelization policy

                            Scheduling
                            Short-term scheduling
                            Long-term scheduling
                            When to stop crawling

                            Architecture
                            History
                            Classification
                            Implementation

                            Practical issues

                            Summary

                            References
Objective                                                  Web Crawling

                                                           Carlos Castillo

                                                      Outline

                                                      Motivation

                                                      Behavior of a crawler
                                                      Selection policy
                                                      Re-visit policy
                                                      Politeness policy
                                                      Parallelization policy
   Download “important” pages first                    Scheduling
                                                      Short-term scheduling
   Download X% of the top Y% pages                    Long-term scheduling
                                                      When to stop crawling

   Cumulative Pagerank vs fraction of the Web –       Architecture
                                                      History
   total Pagerank is 1, random strategy should give   Classification
                                                      Implementation
   a straight line                                    Practical issues

                                                      Summary

                                                      References
Objective                                                  Web Crawling

                                                           Carlos Castillo

                                                      Outline

                                                      Motivation

                                                      Behavior of a crawler
                                                      Selection policy
                                                      Re-visit policy
                                                      Politeness policy
                                                      Parallelization policy
   Download “important” pages first                    Scheduling
                                                      Short-term scheduling
   Download X% of the top Y% pages                    Long-term scheduling
                                                      When to stop crawling

   Cumulative Pagerank vs fraction of the Web –       Architecture
                                                      History
   total Pagerank is 1, random strategy should give   Classification
                                                      Implementation
   a straight line                                    Practical issues

                                                      Summary

                                                      References
Strategies                              Web Crawling

                                        Carlos Castillo

                                   Outline

                                   Motivation

                                   Behavior of a crawler
                                   Selection policy
                                   Re-visit policy
                                   Politeness policy
                                   Parallelization policy

   Oracle with Pagerank            Scheduling
                                   Short-term scheduling

   Depth-first search               Long-term scheduling
                                   When to stop crawling

   Bigger sites first               Architecture
                                   History
                                   Classification
   Partial pagerank calculations   Implementation

                                   Practical issues

                                   Summary

                                   References
Strategies                              Web Crawling

                                        Carlos Castillo

                                   Outline

                                   Motivation

                                   Behavior of a crawler
                                   Selection policy
                                   Re-visit policy
                                   Politeness policy
                                   Parallelization policy

   Oracle with Pagerank            Scheduling
                                   Short-term scheduling

   Depth-first search               Long-term scheduling
                                   When to stop crawling

   Bigger sites first               Architecture
                                   History
                                   Classification
   Partial pagerank calculations   Implementation

                                   Practical issues

                                   Summary

                                   References
Strategies                              Web Crawling

                                        Carlos Castillo

                                   Outline

                                   Motivation

                                   Behavior of a crawler
                                   Selection policy
                                   Re-visit policy
                                   Politeness policy
                                   Parallelization policy

   Oracle with Pagerank            Scheduling
                                   Short-term scheduling

   Depth-first search               Long-term scheduling
                                   When to stop crawling

   Bigger sites first               Architecture
                                   History
                                   Classification
   Partial pagerank calculations   Implementation

                                   Practical issues

                                   Summary

                                   References
Strategies                              Web Crawling

                                        Carlos Castillo

                                   Outline

                                   Motivation

                                   Behavior of a crawler
                                   Selection policy
                                   Re-visit policy
                                   Politeness policy
                                   Parallelization policy

   Oracle with Pagerank            Scheduling
                                   Short-term scheduling

   Depth-first search               Long-term scheduling
                                   When to stop crawling

   Bigger sites first               Architecture
                                   History
                                   Classification
   Partial pagerank calculations   Implementation

                                   Practical issues

                                   Summary

                                   References
Comparison of strategies        Web Crawling

                                Carlos Castillo

                           Outline

[Castillo et al., 2004]    Motivation

                           Behavior of a crawler
                           Selection policy
                           Re-visit policy
                           Politeness policy
                           Parallelization policy

                           Scheduling
                           Short-term scheduling
                           Long-term scheduling
                           When to stop crawling

                           Architecture
                           History
                           Classification
                           Implementation

                           Practical issues

                           Summary

                           References
Distribution of visits per level        Web Crawling

                                        Carlos Castillo

                                   Outline

                                   Motivation
[Baeza-Yates and Castillo, 2004]
                                   Behavior of a crawler
                                   Selection policy
                                   Re-visit policy
                                   Politeness policy
                                   Parallelization policy

                                   Scheduling
                                   Short-term scheduling
                                   Long-term scheduling
                                   When to stop crawling

                                   Architecture
                                   History
                                   Classification
                                   Implementation

                                   Practical issues

                                   Summary

                                   References
Pagerank and depth                                      Web Crawling

                                                        Carlos Castillo
Cumulative Pagerank by levels in the Chilean Web
                                                   Outline

                                                   Motivation

                                                   Behavior of a crawler
                                                   Selection policy
                                                   Re-visit policy
                                                   Politeness policy
                                                   Parallelization policy

                                                   Scheduling
                                                   Short-term scheduling
                                                   Long-term scheduling
                                                   When to stop crawling

                                                   Architecture
                                                   History
                                                   Classification
                                                   Implementation

                                                   Practical issues

                                                   Summary

                                                   References
Pagerank and depth                                               Web Crawling

                                                                 Carlos Castillo
Correlation of Pagerank and depth is low at deeper levels
                                                            Outline

                                                            Motivation

                                                            Behavior of a crawler
                                                            Selection policy
                                                            Re-visit policy
                                                            Politeness policy
                                                            Parallelization policy

                                                            Scheduling
                                                            Short-term scheduling
                                                            Long-term scheduling
                                                            When to stop crawling

                                                            Architecture
                                                            History
                                                            Classification
                                                            Implementation

                                                            Practical issues

                                                            Summary

                                                            References
Web Crawling
First crawlers                                         Carlos Castillo

                                                  Outline

                                                  Motivation

                                                  Behavior of a crawler
                                                  Selection policy
                                                  Re-visit policy
                                                  Politeness policy

   RBSE spider - size of the Web: 100,000 pages   Parallelization policy

                                                  Scheduling
   Internet archive crawler - www.archive.org     Short-term scheduling
                                                  Long-term scheduling
                                                  When to stop crawling
   Webcrawler - first search engine powered by a   Architecture
   Web crawler                                    History
                                                  Classification
                                                  Implementation
   Pages were a scarce resource                   Practical issues

                                                  Summary

                                                  References
Web Crawling
First crawlers                                         Carlos Castillo

                                                  Outline

                                                  Motivation

                                                  Behavior of a crawler
                                                  Selection policy
                                                  Re-visit policy
                                                  Politeness policy

   RBSE spider - size of the Web: 100,000 pages   Parallelization policy

                                                  Scheduling
   Internet archive crawler - www.archive.org     Short-term scheduling
                                                  Long-term scheduling
                                                  When to stop crawling
   Webcrawler - first search engine powered by a   Architecture
   Web crawler                                    History
                                                  Classification
                                                  Implementation
   Pages were a scarce resource                   Practical issues

                                                  Summary

                                                  References
Second generation                                      Web Crawling

                                                       Carlos Castillo

                                                  Outline

                                                  Motivation

                                                  Behavior of a crawler
                                                  Selection policy
                                                  Re-visit policy
                                                  Politeness policy
                                                  Parallelization policy

   Mercator, SPHINX - focused crawling            Scheduling
                                                  Short-term scheduling
                                                  Long-term scheduling
   Lycos, Excite, Google - large-scale crawling   When to stop crawling

                                                  Architecture
   Parallel crawlers                              History
                                                  Classification
   Problem of abundance                           Implementation

                                                  Practical issues

                                                  Summary

                                                  References
Second generation                                      Web Crawling

                                                       Carlos Castillo

                                                  Outline

                                                  Motivation

                                                  Behavior of a crawler
                                                  Selection policy
                                                  Re-visit policy
                                                  Politeness policy
                                                  Parallelization policy

   Mercator, SPHINX - focused crawling            Scheduling
                                                  Short-term scheduling
                                                  Long-term scheduling
   Lycos, Excite, Google - large-scale crawling   When to stop crawling

                                                  Architecture
   Parallel crawlers                              History
                                                  Classification
   Problem of abundance                           Implementation

                                                  Practical issues

                                                  Summary

                                                  References
Web Crawling
Standard architecture        Carlos Castillo

                        Outline

                        Motivation

                        Behavior of a crawler
                        Selection policy
                        Re-visit policy
                        Politeness policy
                        Parallelization policy

                        Scheduling
                        Short-term scheduling
                        Long-term scheduling
                        When to stop crawling

                        Architecture
                        History
                        Classification
                        Implementation

                        Practical issues

                        Summary

                        References
Different crawlers have different                            Web Crawling

                                                           Carlos Castillo

focus                                                 Outline

                                                      Motivation

                                                      Behavior of a crawler
                                                      Selection policy
                                                      Re-visit policy
                                                      Politeness policy
                                                      Parallelization policy

                                                      Scheduling
   Different issues                                    Short-term scheduling
                                                      Long-term scheduling
   Quality: having “good resources”                   When to stop crawling

                                                      Architecture
   Representation: having complete copies             History
                                                      Classification

   Freshnes: having updated copies                    Implementation

                                                      Practical issues
   A global-scale crawler tries to balance them all   Summary

                                                      References
Taxonomy of Web crawlers        Web Crawling

                                Carlos Castillo

                           Outline

                           Motivation

                           Behavior of a crawler
                           Selection policy
                           Re-visit policy
                           Politeness policy
                           Parallelization policy

                           Scheduling
                           Short-term scheduling
                           Long-term scheduling
                           When to stop crawling

                           Architecture
                           History
                           Classification
                           Implementation

                           Practical issues

                           Summary

                           References
Key operations                                            Web Crawling

                                                          Carlos Castillo

                                                     Outline

                                                     Motivation

                                                     Behavior of a crawler
                                                     Selection policy
                                                     Re-visit policy
                                                     Politeness policy
                                                     Parallelization policy
   Have I seen this URL ?
                                                     Scheduling
   Have I seen this page (or a very similar one) ?   Short-term scheduling
                                                     Long-term scheduling
                                                     When to stop crawling
   Which pages should I download next ?              Architecture
                                                     History
   Store this page                                   Classification
                                                     Implementation

   Download this batch of pages                      Practical issues

                                                     Summary

                                                     References
Key operations                                            Web Crawling

                                                          Carlos Castillo

                                                     Outline

                                                     Motivation

                                                     Behavior of a crawler
                                                     Selection policy
                                                     Re-visit policy
                                                     Politeness policy
                                                     Parallelization policy
   Have I seen this URL ?
                                                     Scheduling
   Have I seen this page (or a very similar one) ?   Short-term scheduling
                                                     Long-term scheduling
                                                     When to stop crawling
   Which pages should I download next ?              Architecture
                                                     History
   Store this page                                   Classification
                                                     Implementation

   Download this batch of pages                      Practical issues

                                                     Summary

                                                     References
Key operations                                            Web Crawling

                                                          Carlos Castillo

                                                     Outline

                                                     Motivation

                                                     Behavior of a crawler
                                                     Selection policy
                                                     Re-visit policy
                                                     Politeness policy
                                                     Parallelization policy
   Have I seen this URL ?
                                                     Scheduling
   Have I seen this page (or a very similar one) ?   Short-term scheduling
                                                     Long-term scheduling
                                                     When to stop crawling
   Which pages should I download next ?              Architecture
                                                     History
   Store this page                                   Classification
                                                     Implementation

   Download this batch of pages                      Practical issues

                                                     Summary

                                                     References
Key operations                                            Web Crawling

                                                          Carlos Castillo

                                                     Outline

                                                     Motivation

                                                     Behavior of a crawler
                                                     Selection policy
                                                     Re-visit policy
                                                     Politeness policy
                                                     Parallelization policy
   Have I seen this URL ?
                                                     Scheduling
   Have I seen this page (or a very similar one) ?   Short-term scheduling
                                                     Long-term scheduling
                                                     When to stop crawling
   Which pages should I download next ?              Architecture
                                                     History
   Store this page                                   Classification
                                                     Implementation

   Download this batch of pages                      Practical issues

                                                     Summary

                                                     References
Key operations                                            Web Crawling

                                                          Carlos Castillo

                                                     Outline

                                                     Motivation

                                                     Behavior of a crawler
                                                     Selection policy
                                                     Re-visit policy
                                                     Politeness policy
                                                     Parallelization policy
   Have I seen this URL ?
                                                     Scheduling
   Have I seen this page (or a very similar one) ?   Short-term scheduling
                                                     Long-term scheduling
                                                     When to stop crawling
   Which pages should I download next ?              Architecture
                                                     History
   Store this page                                   Classification
                                                     Implementation

   Download this batch of pages                      Practical issues

                                                     Summary

                                                     References
The architecture needs to be                            Web Crawling

                                                        Carlos Castillo

highly optimized                                   Outline

                                                   Motivation

                                                   Behavior of a crawler
                                                   Selection policy
                                                   Re-visit policy
   “While it is fairly easy to build a slow        Politeness policy
                                                   Parallelization policy

   crawler that downloads a few pages per          Scheduling
                                                   Short-term scheduling
   second for a short period of time, building a   Long-term scheduling
                                                   When to stop crawling
   high-performance system that can download       Architecture
   hundreds of millions of pages over several      History
                                                   Classification

   weeks presentsa number of challenges in         Implementation

                                                   Practical issues
   system design, I/O and network efficiency,        Summary
   and robustness and manegeability”               References
   [Shkapenyuk and Suel, 2002].
Problems arise in large crawls          Web Crawling

                                        Carlos Castillo

                                   Outline

                                   Motivation

                                   Behavior of a crawler
                                   Selection policy
                                   Re-visit policy
                                   Politeness policy
                                   Parallelization policy

                                   Scheduling
   Network and protocol problems   Short-term scheduling
                                   Long-term scheduling

   Page contents problems          When to stop crawling

                                   Architecture
   Server problems                 History
                                   Classification
                                   Implementation

                                   Practical issues

                                   Summary

                                   References
Network and protocol problems                      Web Crawling

                                                   Carlos Castillo

                                              Outline

                                              Motivation

                                              Behavior of a crawler
                                              Selection policy
                                              Re-visit policy
                                              Politeness policy
                                              Parallelization policy

   Variable quality of service                Scheduling
                                              Short-term scheduling

   Misconfigured firewalls                      Long-term scheduling
                                              When to stop crawling

   Crashing DNS servers                       Architecture
                                              History
                                              Classification
   Wrong DNS servers pointing to good hosts   Implementation

                                              Practical issues

                                              Summary

                                              References
Server problems                                       Web Crawling

                                                      Carlos Castillo

                                                 Outline

                                                 Motivation

                                                 Behavior of a crawler
                                                 Selection policy
                                                 Re-visit policy
                                                 Politeness policy
                                                 Parallelization policy
   Responses lacking headers                     Scheduling
                                                 Short-term scheduling
   Fancy “error” pages                           Long-term scheduling
                                                 When to stop crawling

   “Deeep Web” pages which could be accessible   Architecture
                                                 History
   otherwise                                     Classification
                                                 Implementation
   Embedded session-ids in URLs                  Practical issues

                                                 Summary

                                                 References
Page contents problems                     Web Crawling

                                           Carlos Castillo

                                      Outline

                                      Motivation

                                      Behavior of a crawler
                                      Selection policy
                                      Re-visit policy
                                      Politeness policy
                                      Parallelization policy

   High prevalence of duplicates      Scheduling
                                      Short-term scheduling

   Browsers are very tolerant         Long-term scheduling
                                      When to stop crawling

   Malformed markup                   Architecture
                                      History
                                      Classification
   Physical over logical formatting   Implementation

                                      Practical issues

                                      Summary

                                      References
Summary                                             Web Crawling

                                                    Carlos Castillo

                                               Outline

                                               Motivation

                                               Behavior of a crawler
                                               Selection policy
                                               Re-visit policy
                                               Politeness policy
                                               Parallelization policy

  Web crawling is studied at multiple levels   Scheduling
                                               Short-term scheduling

  Long-term scheduling, page selection         Long-term scheduling
                                               When to stop crawling

  Scalability, parallelization                 Architecture
                                               History
                                               Classification
  Practical issues, network usage              Implementation

                                               Practical issues

                                               Summary

                                               References
Summary                                             Web Crawling

                                                    Carlos Castillo

                                               Outline

                                               Motivation

                                               Behavior of a crawler
                                               Selection policy
                                               Re-visit policy
                                               Politeness policy
                                               Parallelization policy

  Web crawling is studied at multiple levels   Scheduling
                                               Short-term scheduling

  Long-term scheduling, page selection         Long-term scheduling
                                               When to stop crawling

  Scalability, parallelization                 Architecture
                                               History
                                               Classification
  Practical issues, network usage              Implementation

                                               Practical issues

                                               Summary

                                               References
Summary                                             Web Crawling

                                                    Carlos Castillo

                                               Outline

                                               Motivation

                                               Behavior of a crawler
                                               Selection policy
                                               Re-visit policy
                                               Politeness policy
                                               Parallelization policy

  Web crawling is studied at multiple levels   Scheduling
                                               Short-term scheduling

  Long-term scheduling, page selection         Long-term scheduling
                                               When to stop crawling

  Scalability, parallelization                 Architecture
                                               History
                                               Classification
  Practical issues, network usage              Implementation

                                               Practical issues

                                               Summary

                                               References
Summary                                             Web Crawling

                                                    Carlos Castillo

                                               Outline

                                               Motivation

                                               Behavior of a crawler
                                               Selection policy
                                               Re-visit policy
                                               Politeness policy
                                               Parallelization policy

  Web crawling is studied at multiple levels   Scheduling
                                               Short-term scheduling

  Long-term scheduling, page selection         Long-term scheduling
                                               When to stop crawling

  Scalability, parallelization                 Architecture
                                               History
                                               Classification
  Practical issues, network usage              Implementation

                                               Practical issues

                                               Summary

                                               References
Open problems                                     Web Crawling

                                                  Carlos Castillo

                                             Outline

                                             Motivation

                                             Behavior of a crawler
                                             Selection policy
                                             Re-visit policy
                                             Politeness policy
                                             Parallelization policy

   Scheduling using historical information   Scheduling
                                             Short-term scheduling
                                             Long-term scheduling
   Exploiting the Web’s structure            When to stop crawling

                                             Architecture
   Adversarial IR: Spam detection before     History

   downloading the pages                     Classification
                                             Implementation

                                             Practical issues

                                             Summary

                                             References
Open problems                                     Web Crawling

                                                  Carlos Castillo

                                             Outline

                                             Motivation

                                             Behavior of a crawler
                                             Selection policy
                                             Re-visit policy
                                             Politeness policy
                                             Parallelization policy

   Scheduling using historical information   Scheduling
                                             Short-term scheduling
                                             Long-term scheduling
   Exploiting the Web’s structure            When to stop crawling

                                             Architecture
   Adversarial IR: Spam detection before     History

   downloading the pages                     Classification
                                             Implementation

                                             Practical issues

                                             Summary

                                             References
Open problems                                     Web Crawling

                                                  Carlos Castillo

                                             Outline

                                             Motivation

                                             Behavior of a crawler
                                             Selection policy
                                             Re-visit policy
                                             Politeness policy
                                             Parallelization policy

   Scheduling using historical information   Scheduling
                                             Short-term scheduling
                                             Long-term scheduling
   Exploiting the Web’s structure            When to stop crawling

                                             Architecture
   Adversarial IR: Spam detection before     History

   downloading the pages                     Classification
                                             Implementation

                                             Practical issues

                                             Summary

                                             References
Baeza-Yates, R. and Castillo, C. (2004).                Web Crawling


Crawling the infinite Web: five levels are enough.        Carlos Castillo

In Proceedings of the third Workshop on Web        Outline

Graphs (WAW), volume 3243 of Lecture Notes in      Motivation

Computer Science, pages 156–167, Rome, Italy.      Behavior of a crawler
                                                   Selection policy
Springer.                                          Re-visit policy
                                                   Politeness policy
                                                   Parallelization policy

Brewington, B., Cybenko, G., Stata, R., Bharat,    Scheduling
                                                   Short-term scheduling
K., and Maghoul, F. (2000).                        Long-term scheduling
                                                   When to stop crawling
How dynamic is the web?                            Architecture
In Proceedings of the Ninth Conference on World    History
                                                   Classification

Wide Web, pages 257 – 276, Amsterdam,              Implementation

                                                   Practical issues
Netherlands.                                       Summary

Castillo, C., Marin, M., Rodriguez, A., and        References

Baeza-Yates, R. (2004).
Scheduling algorithms for Web crawling.
In Latin American Web Conference                        Web Crawling

(WebMedia/LA-WEB), Riberao Preto, Brazil.               Carlos Castillo
IEEE CS Press.
                                                   Outline
(To appear).                                       Motivation

                                                   Behavior of a crawler
Chakrabarti, S., van den Berg, M., and Dom, B.     Selection policy

(1999).                                            Re-visit policy
                                                   Politeness policy
                                                   Parallelization policy
Focused crawling: a new approach to                Scheduling
topic-specific web resource discovery.              Short-term scheduling
                                                   Long-term scheduling
Computer Networks, 31(11–16):1623–1640.            When to stop crawling

                                                   Architecture
                                                   History
Cho, J. and Garcia-Molina, H. (2003).              Classification
                                                   Implementation
Estimating frequency of change.                    Practical issues
ACM Transactions on Internet Technology, 3(3).     Summary

                                                   References
Cho, J., Garc´
             ıa-Molina, H., and Page, L. (1998).
Efficient crawling through URL ordering.
In Proceedings of the seventh conference on
World Wide Web, Brisbane, Australia.
Craswell, N., Crimmins, F., Hawking, D., and            Web Crawling


Moffat, A. (2004).                                       Carlos Castillo

Performance and cost tradeoffs in web search.       Outline

In Proceedings of the 15th Australasian Database   Motivation

Conference, pages 161–169, Dunedin, New            Behavior of a crawler
                                                   Selection policy
Zealand.                                           Re-visit policy
                                                   Politeness policy
                                                   Parallelization policy
Edwards, J., McCurley, K. S., and Tomlin, J. A.    Scheduling
(2001).                                            Short-term scheduling
                                                   Long-term scheduling
                                                   When to stop crawling
An adaptive model for optimizing performance of
                                                   Architecture
an incremental web crawler.                        History
                                                   Classification
In Proceedings of the Tenth Conference on World    Implementation

                                                   Practical issues
Wide Web, pages 106–113, Hong Kong. Elsevier
                                                   Summary
Science.                                           References

Koster, M. (1996).
A standard for robot exclusion.
http://www.robotstxt.org/wc/exclusion.html.
Lawrence, S. and Giles, C. L. (2000).
Accessibility of information on the web.                Web Crawling

Intelligence, 11(1):32–39.                              Carlos Castillo


Lyman, P. and Varian, H. R. (2003).                Outline

How much information.                              Motivation

                                                   Behavior of a crawler
http://www.sims.berkeley.edu/how-much-info-        Selection policy

2003.                                              Re-visit policy
                                                   Politeness policy
                                                   Parallelization policy

Najork, M. and Wiener, J. L. (2001).               Scheduling
                                                   Short-term scheduling
Breadth-first crawling yields high-quality pages.   Long-term scheduling
                                                   When to stop crawling
In Proceedings of the Tenth Conference on World    Architecture
Wide Web, pages 114–118, Hong Kong. Elsevier       History
                                                   Classification

Science.                                           Implementation

                                                   Practical issues
Shkapenyuk, V. and Suel, T. (2002).                Summary

Design and implementation of a high-performance    References

distributed web crawler.
In Proceedings of the 18th International
Conference on Data Engineering (ICDE), pages
357 – 368, San Jose, California. IEEE CS Press.
Web Crawling

     Carlos Castillo

Outline

Motivation

Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy

Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling

Architecture
History
Classification
Implementation

Practical issues

Summary

References
Web Crawling

     Carlos Castillo

Outline

Motivation

Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy

Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling

Architecture
History
Classification
Implementation

Practical issues

Summary

References

Weitere ähnliche Inhalte

Andere mochten auch

Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
Crawleando a web feito gente grande com o scrapy
Crawleando a web feito gente grande com o scrapyCrawleando a web feito gente grande com o scrapy
Crawleando a web feito gente grande com o scrapyBernardo Fontes
 
Personal area network (pan)
Personal area network (pan)Personal area network (pan)
Personal area network (pan)Kukuh Rahmadi
 
A comparative analysis on bajaj vs hero honda
A comparative analysis on bajaj vs hero hondaA comparative analysis on bajaj vs hero honda
A comparative analysis on bajaj vs hero hondaProjects Kart
 
Introduction to Category Management And Assortment Planning in the Retail Ind...
Introduction to Category Management And Assortment Planning in the Retail Ind...Introduction to Category Management And Assortment Planning in the Retail Ind...
Introduction to Category Management And Assortment Planning in the Retail Ind...KINDUZ Consulting
 

Andere mochten auch (9)

Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Crawleando a web feito gente grande com o scrapy
Crawleando a web feito gente grande com o scrapyCrawleando a web feito gente grande com o scrapy
Crawleando a web feito gente grande com o scrapy
 
Pan seminar
Pan seminarPan seminar
Pan seminar
 
Personal area network (pan)
Personal area network (pan)Personal area network (pan)
Personal area network (pan)
 
Personal area networks (PAN)
Personal area networks (PAN)Personal area networks (PAN)
Personal area networks (PAN)
 
A comparative analysis on bajaj vs hero honda
A comparative analysis on bajaj vs hero hondaA comparative analysis on bajaj vs hero honda
A comparative analysis on bajaj vs hero honda
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Introduction to Category Management And Assortment Planning in the Retail Ind...
Introduction to Category Management And Assortment Planning in the Retail Ind...Introduction to Category Management And Assortment Planning in the Retail Ind...
Introduction to Category Management And Assortment Planning in the Retail Ind...
 

Mehr von Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

Mehr von Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Kürzlich hochgeladen

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Kürzlich hochgeladen (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Web Crawling

  • 1. Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Web Crawling Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture Carlos Castillo History Classification Implementation Center for Web Research Practical issues Computer Science Department Summary University of Chile www.cwr.cl References
  • 2. Motivation Web Crawling Behavior of a crawler Carlos Castillo Selection policy Outline Re-visit policy Motivation Politeness policy Behavior of a crawler Selection policy Parallelization policy Re-visit policy Politeness policy Parallelization policy Scheduling Scheduling Short-term scheduling Short-term scheduling Long-term scheduling Long-term scheduling When to stop crawling Architecture When to stop crawling History Classification Architecture Implementation Practical issues History Summary Classification References Implementation Practical issues Summary References
  • 3. An astronomer watching the sky Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 4. The problem of abundance Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy 5 exabytes of new information a year Politeness policy Parallelization policy [Lyman and Varian, 2003] (1 exabyte = 1018 Scheduling Short-term scheduling bytes) Long-term scheduling When to stop crawling Most directories no longer encourage Architecture History administrators to submit their Web sites: they Classification Implementation have to find the page on their own Practical issues Adversarial information retrieval Summary References
  • 5. The problem of abundance Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy 5 exabytes of new information a year Politeness policy Parallelization policy [Lyman and Varian, 2003] (1 exabyte = 1018 Scheduling Short-term scheduling bytes) Long-term scheduling When to stop crawling Most directories no longer encourage Architecture History administrators to submit their Web sites: they Classification Implementation have to find the page on their own Practical issues Adversarial information retrieval Summary References
  • 6. The problem of abundance Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy 5 exabytes of new information a year Politeness policy Parallelization policy [Lyman and Varian, 2003] (1 exabyte = 1018 Scheduling Short-term scheduling bytes) Long-term scheduling When to stop crawling Most directories no longer encourage Architecture History administrators to submit their Web sites: they Classification Implementation have to find the page on their own Practical issues Adversarial information retrieval Summary References
  • 7. The bandwidth is expensive Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler “Given that the bandwidth for conducting Selection policy Re-visit policy crawls is neither infinite nor free it is Politeness policy Parallelization policy becoming essential to crawl the Web in a Scheduling Short-term scheduling not only scalable, but efficient way if some Long-term scheduling When to stop crawling reasonable measure of quality or freshness is Architecture to be maintained” [Edwards et al., 2001] History Classification Implementation The cost of a “complete” Web crawl is estimated in Practical issues Summary $1.5 million USD [Craswell et al., 2004], only References considering network usage
  • 8. The bandwidth is expensive Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler “Given that the bandwidth for conducting Selection policy Re-visit policy crawls is neither infinite nor free it is Politeness policy Parallelization policy becoming essential to crawl the Web in a Scheduling Short-term scheduling not only scalable, but efficient way if some Long-term scheduling When to stop crawling reasonable measure of quality or freshness is Architecture to be maintained” [Edwards et al., 2001] History Classification Implementation The cost of a “complete” Web crawl is estimated in Practical issues Summary $1.5 million USD [Craswell et al., 2004], only References considering network usage
  • 9. Combination of policies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy Scheduling Short-term scheduling Re-visit policy Long-term scheduling When to stop crawling Politeness policy Architecture History Classification Parallelization policy Implementation Practical issues Summary References
  • 10. Combination of policies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy Scheduling Short-term scheduling Re-visit policy Long-term scheduling When to stop crawling Politeness policy Architecture History Classification Parallelization policy Implementation Practical issues Summary References
  • 11. Combination of policies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy Scheduling Short-term scheduling Re-visit policy Long-term scheduling When to stop crawling Politeness policy Architecture History Classification Parallelization policy Implementation Practical issues Summary References
  • 12. Combination of policies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy Scheduling Short-term scheduling Re-visit policy Long-term scheduling When to stop crawling Politeness policy Architecture History Classification Parallelization policy Implementation Practical issues Summary References
  • 13. It is necessary to prioritize Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy No search engine indexes more than 16% of the Scheduling Web [Lawrence and Giles, 2000] Short-term scheduling Long-term scheduling When to stop crawling Download only the “important” pages Architecture Restrict to only a sub-domain History Classification Implementation Avoid spamming Practical issues Summary References
  • 14. Web Crawling Selection based on links Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Order by Pagerank [Cho et al., 1998] Scheduling Depth-first search [Najork and Wiener, 2001] Short-term scheduling Long-term scheduling When to stop crawling Focused crawling [Chakrabarti et al., 1999], Architecture attempting to infer similarity to pages before History Classification Implementation downloading them Practical issues Summary References
  • 15. Web Crawling Events Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Creation, which requires a link Parallelization policy Scheduling Update, can be either minor or major. Most of Short-term scheduling Long-term scheduling the changes are minor, but this is not easy to When to stop crawling exploit Architecture History Classification Deletion, which is more damaging to the search Implementation engine’s reputation Practical issues Summary References
  • 16. Web Crawling Cost functions Carlos Castillo Outline Motivation Freshness: Behavior of a crawler Selection policy 1 if p is not modified at time t Re-visit policy Politeness policy Fp (t) = Parallelization policy 0 otherwise Scheduling Short-term scheduling Long-term scheduling When to stop crawling Age: Architecture History Classification Implementation 0 if p is not modified Practical issues Ap (t) = t − lastmod(p) otherwise Summary References Depending on the cost function used, the behavior can be different
  • 17. Web Crawling Cost functions Carlos Castillo Outline Motivation Freshness: Behavior of a crawler Selection policy 1 if p is not modified at time t Re-visit policy Politeness policy Fp (t) = Parallelization policy 0 otherwise Scheduling Short-term scheduling Long-term scheduling When to stop crawling Age: Architecture History Classification Implementation 0 if p is not modified Practical issues Ap (t) = t − lastmod(p) otherwise Summary References Depending on the cost function used, the behavior can be different
  • 18. Evolution of freshness and age Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 19. Estimating freshness and age Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Page changes can be modeled as a Poisson Re-visit policy Politeness policy process [Brewington et al., 2000] Parallelization policy Scheduling Probability of a page being updated at time t is Short-term scheduling Long-term scheduling When to stop crawling P(Fp (t) = 1) = e −λp t Architecture History Classification Implementation λp can be estimated using historical data, Practical issues specially if last-modification date is provided by Summary the server [Cho and Garcia-Molina, 2003] References
  • 20. Web Crawling Web robots can be a threat Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy They consume network resources Parallelization policy Scheduling They can cause server overload Short-term scheduling Long-term scheduling The robot exclusion protocol should be honored When to stop crawling Architecture [Koster, 1996] History Classification The re-visiting period should be reasonable Implementation (what is reasonable?) Practical issues Summary References
  • 21. Web Crawling Robot exclusion Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Server exclusions Parallelization policy D i s a l l o w : / c g i −b i n Scheduling Short-term scheduling Long-term scheduling When to stop crawling Page exclusions Architecture History <meta name=” r o b o t s ” Classification Implementation c o n t e n t =”n o i n d e x . n o f o l l o w , n o c a c h e”> Practical issues Summary References
  • 22. Web Crawling Robot exclusion Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Server exclusions Parallelization policy D i s a l l o w : / c g i −b i n Scheduling Short-term scheduling Long-term scheduling When to stop crawling Page exclusions Architecture History <meta name=” r o b o t s ” Classification Implementation c o n t e n t =”n o i n d e x . n o f o l l o w , n o c a c h e”> Practical issues Summary References
  • 23. Objectives Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Distribute the Web crawling Scheduling Short-term scheduling Ideally, no central control point Long-term scheduling When to stop crawling Reduce overhead due to communications Architecture History Classification Reduce overlap, ideally zero Implementation Practical issues Summary References
  • 24. Types of policies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Static assignment: typically a hash function on Scheduling Short-term scheduling site names Long-term scheduling When to stop crawling Dynamic assignment: more complicated to Architecture History handle, usually requires central control Classification Implementation Practical issues Summary References
  • 25. Problem separation Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Indexing, downloading, and distributed crawling Politeness policy Parallelization policy are done in batches – this can be exploited to Scheduling Short-term scheduling separate the problem Long-term scheduling When to stop crawling Short-term scheduling: using the network Architecture History resources efficiently Classification Implementation Long-term scheduling: ordering the crawling Practical issues process to download important pages first Summary References
  • 26. Problem separation Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Indexing, downloading, and distributed crawling Politeness policy Parallelization policy are done in batches – this can be exploited to Scheduling Short-term scheduling separate the problem Long-term scheduling When to stop crawling Short-term scheduling: using the network Architecture History resources efficiently Classification Implementation Long-term scheduling: ordering the crawling Practical issues process to download important pages first Summary References
  • 27. Short-term scheduling Web Crawling Carlos Castillo Outline Motivation If B is the bandwidth available, then Bp , the Behavior of a crawler Selection policy downloading speed for page p, is Re-visit policy Politeness policy Parallelization policy Sp Scheduling Bp = Short-term scheduling T∗ Long-term scheduling When to stop crawling Architecture Where T ∗ is the optimal time to use all of the History Classification available bandwidth Implementation Practical issues p Sp Summary T∗ = References B
  • 28. Full parallelization Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 29. Web Crawling Full serialization Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 30. Web Crawling Realistic scenario Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 31. Web Crawling Number of active crawlers Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 32. Objective Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Download “important” pages first Scheduling Short-term scheduling Download X% of the top Y% pages Long-term scheduling When to stop crawling Cumulative Pagerank vs fraction of the Web – Architecture History total Pagerank is 1, random strategy should give Classification Implementation a straight line Practical issues Summary References
  • 33. Objective Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Download “important” pages first Scheduling Short-term scheduling Download X% of the top Y% pages Long-term scheduling When to stop crawling Cumulative Pagerank vs fraction of the Web – Architecture History total Pagerank is 1, random strategy should give Classification Implementation a straight line Practical issues Summary References
  • 34. Strategies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Oracle with Pagerank Scheduling Short-term scheduling Depth-first search Long-term scheduling When to stop crawling Bigger sites first Architecture History Classification Partial pagerank calculations Implementation Practical issues Summary References
  • 35. Strategies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Oracle with Pagerank Scheduling Short-term scheduling Depth-first search Long-term scheduling When to stop crawling Bigger sites first Architecture History Classification Partial pagerank calculations Implementation Practical issues Summary References
  • 36. Strategies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Oracle with Pagerank Scheduling Short-term scheduling Depth-first search Long-term scheduling When to stop crawling Bigger sites first Architecture History Classification Partial pagerank calculations Implementation Practical issues Summary References
  • 37. Strategies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Oracle with Pagerank Scheduling Short-term scheduling Depth-first search Long-term scheduling When to stop crawling Bigger sites first Architecture History Classification Partial pagerank calculations Implementation Practical issues Summary References
  • 38. Comparison of strategies Web Crawling Carlos Castillo Outline [Castillo et al., 2004] Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 39. Distribution of visits per level Web Crawling Carlos Castillo Outline Motivation [Baeza-Yates and Castillo, 2004] Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 40. Pagerank and depth Web Crawling Carlos Castillo Cumulative Pagerank by levels in the Chilean Web Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 41. Pagerank and depth Web Crawling Carlos Castillo Correlation of Pagerank and depth is low at deeper levels Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 42. Web Crawling First crawlers Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy RBSE spider - size of the Web: 100,000 pages Parallelization policy Scheduling Internet archive crawler - www.archive.org Short-term scheduling Long-term scheduling When to stop crawling Webcrawler - first search engine powered by a Architecture Web crawler History Classification Implementation Pages were a scarce resource Practical issues Summary References
  • 43. Web Crawling First crawlers Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy RBSE spider - size of the Web: 100,000 pages Parallelization policy Scheduling Internet archive crawler - www.archive.org Short-term scheduling Long-term scheduling When to stop crawling Webcrawler - first search engine powered by a Architecture Web crawler History Classification Implementation Pages were a scarce resource Practical issues Summary References
  • 44. Second generation Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Mercator, SPHINX - focused crawling Scheduling Short-term scheduling Long-term scheduling Lycos, Excite, Google - large-scale crawling When to stop crawling Architecture Parallel crawlers History Classification Problem of abundance Implementation Practical issues Summary References
  • 45. Second generation Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Mercator, SPHINX - focused crawling Scheduling Short-term scheduling Long-term scheduling Lycos, Excite, Google - large-scale crawling When to stop crawling Architecture Parallel crawlers History Classification Problem of abundance Implementation Practical issues Summary References
  • 46. Web Crawling Standard architecture Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 47. Different crawlers have different Web Crawling Carlos Castillo focus Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Different issues Short-term scheduling Long-term scheduling Quality: having “good resources” When to stop crawling Architecture Representation: having complete copies History Classification Freshnes: having updated copies Implementation Practical issues A global-scale crawler tries to balance them all Summary References
  • 48. Taxonomy of Web crawlers Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 49. Key operations Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Have I seen this URL ? Scheduling Have I seen this page (or a very similar one) ? Short-term scheduling Long-term scheduling When to stop crawling Which pages should I download next ? Architecture History Store this page Classification Implementation Download this batch of pages Practical issues Summary References
  • 50. Key operations Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Have I seen this URL ? Scheduling Have I seen this page (or a very similar one) ? Short-term scheduling Long-term scheduling When to stop crawling Which pages should I download next ? Architecture History Store this page Classification Implementation Download this batch of pages Practical issues Summary References
  • 51. Key operations Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Have I seen this URL ? Scheduling Have I seen this page (or a very similar one) ? Short-term scheduling Long-term scheduling When to stop crawling Which pages should I download next ? Architecture History Store this page Classification Implementation Download this batch of pages Practical issues Summary References
  • 52. Key operations Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Have I seen this URL ? Scheduling Have I seen this page (or a very similar one) ? Short-term scheduling Long-term scheduling When to stop crawling Which pages should I download next ? Architecture History Store this page Classification Implementation Download this batch of pages Practical issues Summary References
  • 53. Key operations Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Have I seen this URL ? Scheduling Have I seen this page (or a very similar one) ? Short-term scheduling Long-term scheduling When to stop crawling Which pages should I download next ? Architecture History Store this page Classification Implementation Download this batch of pages Practical issues Summary References
  • 54. The architecture needs to be Web Crawling Carlos Castillo highly optimized Outline Motivation Behavior of a crawler Selection policy Re-visit policy “While it is fairly easy to build a slow Politeness policy Parallelization policy crawler that downloads a few pages per Scheduling Short-term scheduling second for a short period of time, building a Long-term scheduling When to stop crawling high-performance system that can download Architecture hundreds of millions of pages over several History Classification weeks presentsa number of challenges in Implementation Practical issues system design, I/O and network efficiency, Summary and robustness and manegeability” References [Shkapenyuk and Suel, 2002].
  • 55. Problems arise in large crawls Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Network and protocol problems Short-term scheduling Long-term scheduling Page contents problems When to stop crawling Architecture Server problems History Classification Implementation Practical issues Summary References
  • 56. Network and protocol problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Variable quality of service Scheduling Short-term scheduling Misconfigured firewalls Long-term scheduling When to stop crawling Crashing DNS servers Architecture History Classification Wrong DNS servers pointing to good hosts Implementation Practical issues Summary References
  • 57. Server problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Responses lacking headers Scheduling Short-term scheduling Fancy “error” pages Long-term scheduling When to stop crawling “Deeep Web” pages which could be accessible Architecture History otherwise Classification Implementation Embedded session-ids in URLs Practical issues Summary References
  • 58. Page contents problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy High prevalence of duplicates Scheduling Short-term scheduling Browsers are very tolerant Long-term scheduling When to stop crawling Malformed markup Architecture History Classification Physical over logical formatting Implementation Practical issues Summary References
  • 59. Summary Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Web crawling is studied at multiple levels Scheduling Short-term scheduling Long-term scheduling, page selection Long-term scheduling When to stop crawling Scalability, parallelization Architecture History Classification Practical issues, network usage Implementation Practical issues Summary References
  • 60. Summary Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Web crawling is studied at multiple levels Scheduling Short-term scheduling Long-term scheduling, page selection Long-term scheduling When to stop crawling Scalability, parallelization Architecture History Classification Practical issues, network usage Implementation Practical issues Summary References
  • 61. Summary Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Web crawling is studied at multiple levels Scheduling Short-term scheduling Long-term scheduling, page selection Long-term scheduling When to stop crawling Scalability, parallelization Architecture History Classification Practical issues, network usage Implementation Practical issues Summary References
  • 62. Summary Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Web crawling is studied at multiple levels Scheduling Short-term scheduling Long-term scheduling, page selection Long-term scheduling When to stop crawling Scalability, parallelization Architecture History Classification Practical issues, network usage Implementation Practical issues Summary References
  • 63. Open problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling using historical information Scheduling Short-term scheduling Long-term scheduling Exploiting the Web’s structure When to stop crawling Architecture Adversarial IR: Spam detection before History downloading the pages Classification Implementation Practical issues Summary References
  • 64. Open problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling using historical information Scheduling Short-term scheduling Long-term scheduling Exploiting the Web’s structure When to stop crawling Architecture Adversarial IR: Spam detection before History downloading the pages Classification Implementation Practical issues Summary References
  • 65. Open problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling using historical information Scheduling Short-term scheduling Long-term scheduling Exploiting the Web’s structure When to stop crawling Architecture Adversarial IR: Spam detection before History downloading the pages Classification Implementation Practical issues Summary References
  • 66. Baeza-Yates, R. and Castillo, C. (2004). Web Crawling Crawling the infinite Web: five levels are enough. Carlos Castillo In Proceedings of the third Workshop on Web Outline Graphs (WAW), volume 3243 of Lecture Notes in Motivation Computer Science, pages 156–167, Rome, Italy. Behavior of a crawler Selection policy Springer. Re-visit policy Politeness policy Parallelization policy Brewington, B., Cybenko, G., Stata, R., Bharat, Scheduling Short-term scheduling K., and Maghoul, F. (2000). Long-term scheduling When to stop crawling How dynamic is the web? Architecture In Proceedings of the Ninth Conference on World History Classification Wide Web, pages 257 – 276, Amsterdam, Implementation Practical issues Netherlands. Summary Castillo, C., Marin, M., Rodriguez, A., and References Baeza-Yates, R. (2004). Scheduling algorithms for Web crawling.
  • 67. In Latin American Web Conference Web Crawling (WebMedia/LA-WEB), Riberao Preto, Brazil. Carlos Castillo IEEE CS Press. Outline (To appear). Motivation Behavior of a crawler Chakrabarti, S., van den Berg, M., and Dom, B. Selection policy (1999). Re-visit policy Politeness policy Parallelization policy Focused crawling: a new approach to Scheduling topic-specific web resource discovery. Short-term scheduling Long-term scheduling Computer Networks, 31(11–16):1623–1640. When to stop crawling Architecture History Cho, J. and Garcia-Molina, H. (2003). Classification Implementation Estimating frequency of change. Practical issues ACM Transactions on Internet Technology, 3(3). Summary References Cho, J., Garc´ ıa-Molina, H., and Page, L. (1998). Efficient crawling through URL ordering. In Proceedings of the seventh conference on World Wide Web, Brisbane, Australia.
  • 68. Craswell, N., Crimmins, F., Hawking, D., and Web Crawling Moffat, A. (2004). Carlos Castillo Performance and cost tradeoffs in web search. Outline In Proceedings of the 15th Australasian Database Motivation Conference, pages 161–169, Dunedin, New Behavior of a crawler Selection policy Zealand. Re-visit policy Politeness policy Parallelization policy Edwards, J., McCurley, K. S., and Tomlin, J. A. Scheduling (2001). Short-term scheduling Long-term scheduling When to stop crawling An adaptive model for optimizing performance of Architecture an incremental web crawler. History Classification In Proceedings of the Tenth Conference on World Implementation Practical issues Wide Web, pages 106–113, Hong Kong. Elsevier Summary Science. References Koster, M. (1996). A standard for robot exclusion. http://www.robotstxt.org/wc/exclusion.html. Lawrence, S. and Giles, C. L. (2000).
  • 69. Accessibility of information on the web. Web Crawling Intelligence, 11(1):32–39. Carlos Castillo Lyman, P. and Varian, H. R. (2003). Outline How much information. Motivation Behavior of a crawler http://www.sims.berkeley.edu/how-much-info- Selection policy 2003. Re-visit policy Politeness policy Parallelization policy Najork, M. and Wiener, J. L. (2001). Scheduling Short-term scheduling Breadth-first crawling yields high-quality pages. Long-term scheduling When to stop crawling In Proceedings of the Tenth Conference on World Architecture Wide Web, pages 114–118, Hong Kong. Elsevier History Classification Science. Implementation Practical issues Shkapenyuk, V. and Suel, T. (2002). Summary Design and implementation of a high-performance References distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 357 – 368, San Jose, California. IEEE CS Press.
  • 70. Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  • 71. Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References