SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Outline            Motivation             Algorithms       Experiments      Summary              References




                   Scheduling Algorithms for Web Crawling

               C. Castillo, M. Marin, A. Rodr´
                                             ıguez and R. Baeza-Yates

                                             Center for Web Research
                                                   www.cwr.cl


                                                LA-WEB 2004



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                   Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




      Motivation


      Algorithms


      Experiments


      Summary


      References




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Motivation



              Web search generates more than 13% of the traffic to Web
              sites [StatMarket, 2003].
              No search engine indexes more than one third of the publicly
              available Web [Lawrence and Giles, 1998].
              If we cannot download all of the pages, we should at least
              download the most “important” ones.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Motivation



              Web search generates more than 13% of the traffic to Web
              sites [StatMarket, 2003].
              No search engine indexes more than one third of the publicly
              available Web [Lawrence and Giles, 1998].
              If we cannot download all of the pages, we should at least
              download the most “important” ones.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Motivation



              Web search generates more than 13% of the traffic to Web
              sites [StatMarket, 2003].
              No search engine indexes more than one third of the publicly
              available Web [Lawrence and Giles, 1998].
              If we cannot download all of the pages, we should at least
              download the most “important” ones.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms          Experiments      Summary              References




The problem of Web crawling


      We must download pages with sizes given by Pi , over a connection
      of bandwidth B. Trivial solution: we download all the pages
      simultaneously at a speed proportional to the size of each page:

                                           Pi
                                                       Bi =
                                          T∗
      T ∗ is the optimal time to use all the available bandwidth:

                                                               Pi
                                                  T∗ =
                                                              B




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                      Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Optimal scenario




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Distribution of site sizes




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Realistic scenario




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Number of active robots in a batch




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Goal




      If each page has a certain score, capture most of the total value of
      this score downloading just a fraction of the pages.
      We will use the total Pagerank of the downloaded set vs. the
      fraction of downloaded pages as a measure of quality




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms




      Algorithms are based on a scheduler with two levels of queues:
              Queue of Web sites
              Queue of Web pages in each Web site




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms




      Algorithms are based on a scheduler with two levels of queues:
              Queue of Web sites
              Queue of Web pages in each Web site




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms




      Algorithms are based on a scheduler with two levels of queues:
              Queue of Web sites
              Queue of Web pages in each Web site




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Queues used for the scheduling




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms based on Pagerank


              Optimal/Oracle: crawler asks for the Pagerank value of each
              page in the frontier using an “Oracle”. This is not available in
              a real crawl as we do not have the entire graph
              The average relative error for estimating the Pagerank four
              months ahead is about 78% [Cho and Adams, 2004], so
              historical information from previous crawls is not too useful
              Batch-Pagerank: Pagerank calculations are executed over
              the subset of known pages [Cho et al., 1998]
              Partial-Pagerank: a “temporary” Pagerank value is assigned
              to the pages in between batch-Pagerank calculations



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms based on Pagerank


              Optimal/Oracle: crawler asks for the Pagerank value of each
              page in the frontier using an “Oracle”. This is not available in
              a real crawl as we do not have the entire graph
              The average relative error for estimating the Pagerank four
              months ahead is about 78% [Cho and Adams, 2004], so
              historical information from previous crawls is not too useful
              Batch-Pagerank: Pagerank calculations are executed over
              the subset of known pages [Cho et al., 1998]
              Partial-Pagerank: a “temporary” Pagerank value is assigned
              to the pages in between batch-Pagerank calculations



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms based on Pagerank


              Optimal/Oracle: crawler asks for the Pagerank value of each
              page in the frontier using an “Oracle”. This is not available in
              a real crawl as we do not have the entire graph
              The average relative error for estimating the Pagerank four
              months ahead is about 78% [Cho and Adams, 2004], so
              historical information from previous crawls is not too useful
              Batch-Pagerank: Pagerank calculations are executed over
              the subset of known pages [Cho et al., 1998]
              Partial-Pagerank: a “temporary” Pagerank value is assigned
              to the pages in between batch-Pagerank calculations



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms not based on Pagerank



              Depth: pages are given a priority based on their depths. This
              is graph traversal in breadth-first ordering
              [Najork and Wiener, 2001]
              Length: pages from the Web sites which seem to be bigger
              are crawled first. We do not know which are really the bigger
              Web sites until the end of the crawl. We use partial
              information




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms not based on Pagerank



              Depth: pages are given a priority based on their depths. This
              is graph traversal in breadth-first ordering
              [Najork and Wiener, 2001]
              Length: pages from the Web sites which seem to be bigger
              are crawled first. We do not know which are really the bigger
              Web sites until the end of the crawl. We use partial
              information




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Results with one robot




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Results with many robots




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Speed-ups with the “Length” strategy




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Crawling the real Web using the “Length” strategy




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Pagerank vs day of crawl




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Depth is not correlated with Pagerank
      When depth is ≥ 2 links from the home page




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Open problems




              Scheduling using historical information
              Exploiting the Web’s structure
              Adversarial IR: Spam detection before downloading the pages




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Open problems




              Scheduling using historical information
              Exploiting the Web’s structure
              Adversarial IR: Spam detection before downloading the pages




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Open problems




              Scheduling using historical information
              Exploiting the Web’s structure
              Adversarial IR: Spam detection before downloading the pages




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




             Baeza-Yates, R. and Castillo, C. (2002).
             Balancing volume, quality and freshness in web crawling.
             In Soft Computing Systems - Design, Management and
             Applications, pages 565–572, Santiago, Chile. IOS Press
             Amsterdam.
             Cho, J. and Adams, R. (2004).
             Page quality: In search of an unbiased Web ranking.
             Technical report, UCLA Computer Science.
             Cho, J., Garc´
                          ıa-Molina, H., and Page, L. (1998).
             Efficient crawling through URL ordering.
             In Proceedings of the seventh conference on World Wide Web,
             Brisbane, Australia.
             Koster, M. (1995).
             Robots in the web: threat or treat ?
             ConneXions, 9(4).
C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




             Lawrence, S. and Giles, C. L. (1998).
             Searching the World Wide Web.
             Science, 280(5360):98–100.
             Najork, M. and Wiener, J. L. (2001).
             Breadth-first crawling yields high-quality pages.
             In Proceedings of the Tenth Conference on World Wide Web,
             pages 114–118, Hong Kong. Elsevier Science.
             StatMarket (2003).
             Search engine referrals nearly double worldwide.
             http://websidestory.com/pressroom/pressreleases.html-
             ?id=181.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling

Weitere ähnliche Inhalte

Was ist angesagt?

Spoofing
SpoofingSpoofing
Spoofing
Sanjeev
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
Thanveen
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 

Was ist angesagt? (20)

11 Computer Privacy
11 Computer Privacy11 Computer Privacy
11 Computer Privacy
 
Biometric
BiometricBiometric
Biometric
 
Intrusion detection
Intrusion detectionIntrusion detection
Intrusion detection
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
 
Hacking ppt
Hacking pptHacking ppt
Hacking ppt
 
Spoofing
SpoofingSpoofing
Spoofing
 
Ethical hacking
Ethical hackingEthical hacking
Ethical hacking
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Cyber Hygiene
Cyber HygieneCyber Hygiene
Cyber Hygiene
 
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDFCS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
 
Introduction to foot printing
Introduction to foot printingIntroduction to foot printing
Introduction to foot printing
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Inetsecurity.in Ethical Hacking presentation
Inetsecurity.in Ethical Hacking presentationInetsecurity.in Ethical Hacking presentation
Inetsecurity.in Ethical Hacking presentation
 
12 security policies
12 security policies12 security policies
12 security policies
 
User authentication
User authenticationUser authentication
User authentication
 
Digital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research ChallengeDigital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research Challenge
 
Usable Security: When Security Meets Usability
Usable Security: When Security Meets UsabilityUsable Security: When Security Meets Usability
Usable Security: When Security Meets Usability
 
Corporate threat vector and landscape
Corporate threat vector and landscapeCorporate threat vector and landscape
Corporate threat vector and landscape
 
Cyber Space
Cyber SpaceCyber Space
Cyber Space
 
Internet anonymity and privacy
Internet anonymity and privacyInternet anonymity and privacy
Internet anonymity and privacy
 

Mehr von Carlos Castillo (ChaTo)

Mehr von Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Scheduling Algorithms for Web Crawling

  • 1. Outline Motivation Algorithms Experiments Summary References Scheduling Algorithms for Web Crawling C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl LA-WEB 2004 C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 2. Outline Motivation Algorithms Experiments Summary References Motivation Algorithms Experiments Summary References C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 3. Outline Motivation Algorithms Experiments Summary References Motivation Web search generates more than 13% of the traffic to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most “important” ones. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 4. Outline Motivation Algorithms Experiments Summary References Motivation Web search generates more than 13% of the traffic to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most “important” ones. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 5. Outline Motivation Algorithms Experiments Summary References Motivation Web search generates more than 13% of the traffic to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most “important” ones. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 6. Outline Motivation Algorithms Experiments Summary References The problem of Web crawling We must download pages with sizes given by Pi , over a connection of bandwidth B. Trivial solution: we download all the pages simultaneously at a speed proportional to the size of each page: Pi Bi = T∗ T ∗ is the optimal time to use all the available bandwidth: Pi T∗ = B C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 7. Outline Motivation Algorithms Experiments Summary References Optimal scenario C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 8. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 9. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 10. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 11. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 12. Outline Motivation Algorithms Experiments Summary References Distribution of site sizes C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 13. Outline Motivation Algorithms Experiments Summary References Realistic scenario C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 14. Outline Motivation Algorithms Experiments Summary References Number of active robots in a batch C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 15. Outline Motivation Algorithms Experiments Summary References Goal If each page has a certain score, capture most of the total value of this score downloading just a fraction of the pages. We will use the total Pagerank of the downloaded set vs. the fraction of downloaded pages as a measure of quality C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 16. Outline Motivation Algorithms Experiments Summary References Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 17. Outline Motivation Algorithms Experiments Summary References Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 18. Outline Motivation Algorithms Experiments Summary References Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 19. Outline Motivation Algorithms Experiments Summary References Queues used for the scheduling C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 20. Outline Motivation Algorithms Experiments Summary References Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an “Oracle”. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a “temporary” Pagerank value is assigned to the pages in between batch-Pagerank calculations C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 21. Outline Motivation Algorithms Experiments Summary References Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an “Oracle”. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a “temporary” Pagerank value is assigned to the pages in between batch-Pagerank calculations C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 22. Outline Motivation Algorithms Experiments Summary References Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an “Oracle”. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a “temporary” Pagerank value is assigned to the pages in between batch-Pagerank calculations C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 23. Outline Motivation Algorithms Experiments Summary References Algorithms not based on Pagerank Depth: pages are given a priority based on their depths. This is graph traversal in breadth-first ordering [Najork and Wiener, 2001] Length: pages from the Web sites which seem to be bigger are crawled first. We do not know which are really the bigger Web sites until the end of the crawl. We use partial information C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 24. Outline Motivation Algorithms Experiments Summary References Algorithms not based on Pagerank Depth: pages are given a priority based on their depths. This is graph traversal in breadth-first ordering [Najork and Wiener, 2001] Length: pages from the Web sites which seem to be bigger are crawled first. We do not know which are really the bigger Web sites until the end of the crawl. We use partial information C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 25. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 26. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 27. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 28. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 29. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 30. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 31. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 32. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 33. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 34. Outline Motivation Algorithms Experiments Summary References Results with one robot C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 35. Outline Motivation Algorithms Experiments Summary References Results with many robots C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 36. Outline Motivation Algorithms Experiments Summary References Speed-ups with the “Length” strategy C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 37. Outline Motivation Algorithms Experiments Summary References Crawling the real Web using the “Length” strategy C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 38. Outline Motivation Algorithms Experiments Summary References Pagerank vs day of crawl C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 39. Outline Motivation Algorithms Experiments Summary References Depth is not correlated with Pagerank When depth is ≥ 2 links from the home page C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 40. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 41. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 42. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 43. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 44. Outline Motivation Algorithms Experiments Summary References Open problems Scheduling using historical information Exploiting the Web’s structure Adversarial IR: Spam detection before downloading the pages C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 45. Outline Motivation Algorithms Experiments Summary References Open problems Scheduling using historical information Exploiting the Web’s structure Adversarial IR: Spam detection before downloading the pages C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 46. Outline Motivation Algorithms Experiments Summary References Open problems Scheduling using historical information Exploiting the Web’s structure Adversarial IR: Spam detection before downloading the pages C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 47. Outline Motivation Algorithms Experiments Summary References Baeza-Yates, R. and Castillo, C. (2002). Balancing volume, quality and freshness in web crawling. In Soft Computing Systems - Design, Management and Applications, pages 565–572, Santiago, Chile. IOS Press Amsterdam. Cho, J. and Adams, R. (2004). Page quality: In search of an unbiased Web ranking. Technical report, UCLA Computer Science. Cho, J., Garc´ ıa-Molina, H., and Page, L. (1998). Efficient crawling through URL ordering. In Proceedings of the seventh conference on World Wide Web, Brisbane, Australia. Koster, M. (1995). Robots in the web: threat or treat ? ConneXions, 9(4). C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 48. Outline Motivation Algorithms Experiments Summary References Lawrence, S. and Giles, C. L. (1998). Searching the World Wide Web. Science, 280(5360):98–100. Najork, M. and Wiener, J. L. (2001). Breadth-first crawling yields high-quality pages. In Proceedings of the Tenth Conference on World Wide Web, pages 114–118, Hong Kong. Elsevier Science. StatMarket (2003). Search engine referrals nearly double worldwide. http://websidestory.com/pressroom/pressreleases.html- ?id=181. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 49. Outline Motivation Algorithms Experiments Summary References C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 50. Outline Motivation Algorithms Experiments Summary References C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling