SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Design and Implementation of a High-
Performance Distributed Web Crawler


    Vladislav Shkapenyuk* Torsten Suel
                CIS Department
              Polytechnic University
               Brooklyn, NY 11201



* Currently at AT&T Research Labs, Florham Park
Overview
:
 1. Introduction
 2. PolyBot System Architecture
 3. Data Structures and Performance
 4. Experimental Results
 5. Discussion and Open Problems
1. Introduction
Web Crawler: (also called spider or robot)


     tool for data acquisition in search engines
     large engines need high-performance
     crawlers
     need to parallelize crawling task
     PolyBot: a parallel/distributed web crawler
     cluster vs. wide-area distributed
Basic structure of a search engine:


                               indexin
                               g

           Crawle
           r                             Index


                       disk
                       s

                       Search.com            look
 Query: “computer”                           up
Crawler
                             Crawle
                             r


                                          disk
   fetches pages from the web             s

   starts at set of “seed pages”
   parses fetched pages for
   hyperlinks
   then follows those links (e.g., BFS)
   variations:
- recrawling
- focused crawling
- random walks
Breadth-First Crawl:
  Basic idea:
- start at a set of known URLs
- explore in “concentric circles” around these URLs



  start
  pages
  distance-one
  pages
  distance-two
  pages



  used by broad web search engines
  balances load between servers
Crawling Strategy and Download Rate:
   crawling strategy: “What page to download next?”
   download rate: “How many pages per second?”
   different scenarios require different strategies
   lots of recent work on crawling strategy
   little published work on optimizing download rate
(main exception: Mercator from DEC/Compaq/HP?)
   somewhat separate issues
   building a slow crawler is (fairly) easy ...
Basic System
Architecture




    Application determines crawling
System Requirements:
   flexibility (different crawling strategies)
   scalabilty (sustainable high performance at low cost)
   robustness
(odd server content/behavior, crashes)
   crawling etiquette and speed control
(robot exclusion, 30 second intervals, domain level
throttling, speed control for other users)
   manageable and reconfigurable
(interface for statistics and control, system setup)
Details: (lots of ‘em)
       robot exclusion
   - robots.txt file and meta tags
   - robot exclusion adds overhead
       handling filetypes
   (exclude some extensions, and use mime types)
      URL extensions and CGI scripts
   (to strip or not to strip? Ignore?)
      frames, imagemaps
      black holes (robot traps)
   (limit maximum depth of a site)
      different names for same site?
   (could check IP address, but no perfect solution)
Crawling courtesy
    minimize load on crawled server
    no more than one outstanding request per
    site
    better: wait 30 seconds between accesses to
    site
 (this number is not fixed)
    problems:
 - one server may have many sites
 - one site may be on many servers
 - 3 years to crawl a 3-million page site!
    give contact info for large crawls
Contributions:
    distributed architecture based on collection of services
- separation of concerns
- efficient interfaces
    I/O efficient techniques for URL handling
- lazy URL -seen structure
- manager data structures
    scheduling policies
- manager scheduling and shuffling
    resulting system limited by network and parsing
performane
    detailed description and how-to (limited experiments)
2. PolyBot System
Architecture
Structure:
  separation of crawling strategy and basic
  system
  collection of scalable distributed services
(DNS, downloading, scheduling, strategy)
  for clusters and wide-area distributed
  optimized per-node performance
  no random disk accesses (no per-page access)
Basic Architecture (ctd):
     application issues
requests to manager
     manager does DNS
and robot exclusion
     manager schedules
URL on downloader
     downloader gets
     file
and puts it on disk
     application is
     notified
of new files
     application parses
     new
files for hyperlinks
     application sends
     data
to storage component
(indexing done later)
System components:
   downloader: optimized HTTP client written in Python
(everything else in C++)
   DNS resolver: uses asynchronous DNS library
   manager uses Berkeley DB and STL for external and
internal data structures
   manager does robot exclusion by generating requests
to downloaders and parsing files
   application does parsing and handling of URLs
(has this page already been downloaded?)
Scaling the system:
   small system on previous pages:
3-5 workstations and 250-400 pages/sec peak
   can scale up by adding downloaders and DNS
   resolvers
   at 400-600 pages/sec, application becomes bottleneck
   at 8 downloaders manager becomes bottleneck
need to replicate application and manager
   hash-based technique (Internet Archive crawler)
partitions URLs and hosts among application parts
   data transfer batched and via file system (NFS)
Scaling
up:
    20
    machines
    1500
    pages/s?
    depends on
crawl strategy
    hash to
    nodes
based on site
(b/c robot ex)
3. Data Structures and Techniques
Crawling
Application pcre
  parsing using       library
   NFS eventually bottleneck
   URL-seen problem:
- need to check if file has been parsed or downloaded before
- after 20 million pages, we have “seen” over 100 million URLs
- each URL is 50 to 75 bytes on average
   Options: compress URLs in main memory, or use disk
- prefix+huffman coding (DEC, JY01) or Bloom Filter (Archive)
- disk access with caching (Mercator)
- we use lazy/bulk operations on disk
Implementation of URL-seen check:
- while less than a few million URLs seen, keep in main memory
- then write URLs to file in alphabetic, prefix-compressed order
- collect new URLs in memory and periodically reform bulk
check by merging new URLs into the file on disk
   When is a newly a parsed URL downloaded?
   Reordering request stream
- want to space ot requests from same subdomain
- needed due to load on small domains and due to security tools
- sort URLs with hostname reversed (e.g., com.amazon.www),
and then “unshuffle” the stream provable load balance
Crawling Manager
  large stream of incoming URL request files
  goal: schedule URLs roughly in the order that they
come, while observing time-out rule (30 seconds)
and maintaining high speed
  must do DNS and robot excl. “right before”download
  keep requests on disk as long as possible!
- otherwise, structures grow too large after few million pages
(performance killer)
Manager Data Structures:




  when to insert new URLs into internal structures?
URL Loading
Policy new request file from disk whenever less
  read
than x hosts in ready queue
  choose x > speed * timeout (e.g., 100 pages/s * 30s)
  # of current host data structures is
x + speed * timeout + n_down + n_transit
which is usually < 2x
  nice behavior for BDB caching policy
  performs reordering only when necessary!
4. Experimental Results
   crawl of 120 million pages over 19 days
161 million HTTP request
16 million robots.txt requests
138 million successful non-robots requests
17 million HTTP errors (401, 403, 404 etc)
121 million pages retrieved
   slow during day, fast at night
   peak about 300 pages/s over T3
   many downtimes due to attacks, crashes, revisions
   “slow tail” of requests at the end (4 days)
   lots of things happen
Experimental Results ctd.




    bytes in bytes out frames out

    Poly T3 connection over 24 hours of 5/28/01
    (courtesy of AppliedTheory)
Experimental Results ctd.
   sustaining performance:
- will find out when data structures hit disk
- I/O-efficiency vital
   speed control tricky
- vary number of connections based on feedback
- also upper bound on connections
- complicated interactions in system
- not clear what we should want
  other configuration: 140 pages/sec sustained
on 2 Ultra10 with 60GB EIDE and 1GB/768MB
  similar for Linux on Intel
More Detailed Evaluation (to be done)
   Problems
- cannot get commercial crawlers
- need simulation systen to find system bottlenecks
- often not much of a tradeoff (get it right!)
   Example: manager data structures
- with our loading policy, manager can feed several
downloaders
- naĂŻve policy: disk access per page
   parallel communication overhead
- low for limited number of nodes (URL exchange)
- wide-area distributed: where do yu want the data?
- more relevant for highly distributed systems
5. Discussion and Open Problems
Related work
    Mercator (Heydon/Najork from DEC/Compaq)
- used in altaVista
- centralized system (2-CPU Alpha with RAID disks)
- URL-seen test by fast disk access and caching
- one thread per HTTP connection
- completely in Java, with pluggable components
    Atrax: very recent distributed extension to Mercator
- combines several Mercators
- URL hashing, and off-line URL check (as we do)
Related work (ctd.)
    early Internet Archive crawler (circa 96)
- uses hashing to partition URLs between crawlers
- bloom filter for “URL seen” structure
    early Google crawler (1998)
    P2P crawlers (grub.org and others)
    Cho/Garcia-Molina (WWW 2002)
- study of overhead/quality tradeoff in parallel crawlers
- difference: we scale services separately, and focus on
single-node performance
- in our experience, parallel overhead low
Open Problems:
    Measuring and tuning peak performance
- need simulation environment
- eventually reduces to parsing and network
- to be improved: space, fault-tolerance (Xactions?)
    Highly Distributed crawling
- highly distributed (e.g., grub.org) ? (maybe)
- bybrid? (different services)
- few high-performance sites? (several Universities)
    Recrawling and focused crawling strategies
- what strategies?
- how to express?
- how to implement?

Weitere Àhnliche Inhalte

Was ist angesagt?

High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
Databricks
 
MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6
MYXPLAIN
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 

Was ist angesagt? (20)

Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 
Low latencytradingsystem by_barunsharma
Low latencytradingsystem by_barunsharmaLow latencytradingsystem by_barunsharma
Low latencytradingsystem by_barunsharma
 
Size Matters-Best Practices for Trillion Row Datasets on ClickHouse-2202-08-1...
Size Matters-Best Practices for Trillion Row Datasets on ClickHouse-2202-08-1...Size Matters-Best Practices for Trillion Row Datasets on ClickHouse-2202-08-1...
Size Matters-Best Practices for Trillion Row Datasets on ClickHouse-2202-08-1...
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
 
MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
[Pgday.Seoul 2021] 1. ì˜ˆì œëĄœ ì‚ŽíŽŽëłŽëŠ” íŹìŠ€íŠžê·žë ˆìŠ€íì—˜ì˜ 독íŠč한 SQL
[Pgday.Seoul 2021] 1. ì˜ˆì œëĄœ ì‚ŽíŽŽëłŽëŠ” íŹìŠ€íŠžê·žë ˆìŠ€íì—˜ì˜ 독íŠč한 SQL[Pgday.Seoul 2021] 1. ì˜ˆì œëĄœ ì‚ŽíŽŽëłŽëŠ” íŹìŠ€íŠžê·žë ˆìŠ€íì—˜ì˜ 독íŠč한 SQL
[Pgday.Seoul 2021] 1. ì˜ˆì œëĄœ ì‚ŽíŽŽëłŽëŠ” íŹìŠ€íŠžê·žë ˆìŠ€íì—˜ì˜ 독íŠč한 SQL
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
ClickHouse Materialized Views: The Magic Continues
ClickHouse Materialized Views: The Magic ContinuesClickHouse Materialized Views: The Magic Continues
ClickHouse Materialized Views: The Magic Continues
 
Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
 Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
Exciting New Alfresco REST APIs
Exciting New Alfresco REST APIsExciting New Alfresco REST APIs
Exciting New Alfresco REST APIs
 

Andere mochten auch

Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
Dan McKinley
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 

Andere mochten auch (7)

Web crawler
Web crawlerWeb crawler
Web crawler
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Scaling GIS Data in Non-relational Data Stores
Scaling GIS Data in Non-relational Data StoresScaling GIS Data in Non-relational Data Stores
Scaling GIS Data in Non-relational Data Stores
 
ćˆ†ćžƒćŒKey Value Store挫谈
ćˆ†ćžƒćŒKey Value StoreæŒ«è°ˆćˆ†ćžƒćŒKey Value Store挫谈
ćˆ†ćžƒćŒKey Value Store挫谈
 
High Performance Web Applications
High Performance Web ApplicationsHigh Performance Web Applications
High Performance Web Applications
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 

Ähnlich wie Design and Implementation of a High- Performance Distributed Web Crawler

Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato
 

Ähnlich wie Design and Implementation of a High- Performance Distributed Web Crawler (20)

Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
How Web Browsers Work
How Web Browsers WorkHow Web Browsers Work
How Web Browsers Work
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Jagmohancrawl
JagmohancrawlJagmohancrawl
Jagmohancrawl
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Technique
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
 
E017624043
E017624043E017624043
E017624043
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
Remix
RemixRemix
Remix
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 

Mehr von George Ang

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
George Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
George Ang
 
Do not crawl in the dust ‹different ur ls similar text
Do not crawl in the dust ‹different ur ls similar textDo not crawl in the dust ‹different ur ls similar text
Do not crawl in the dust ‹different ur ls similar text
George Ang
 
ć€§è§„æšĄæ•°æźć€„ç†çš„é‚Łäș›äș‹ć„ż
ć€§è§„æšĄæ•°æźć€„ç†çš„é‚Łäș›äș‹ć„żć€§è§„æšĄæ•°æźć€„ç†çš„é‚Łäș›äș‹ć„ż
ć€§è§„æšĄæ•°æźć€„ç†çš„é‚Łäș›äș‹ć„ż
George Ang
 
è…ŸèźŻć€§èźČ栂02 䌑é—Čæžžæˆć‘ć±•çš„æ–‡ćŒ–è¶‹ćŠż
è…ŸèźŻć€§èźČ栂02 䌑é—Čæžžæˆć‘ć±•çš„æ–‡ćŒ–è¶‹ćŠżè…ŸèźŻć€§èźČ栂02 䌑é—Čæžžæˆć‘ć±•çš„æ–‡ćŒ–è¶‹ćŠż
è…ŸèźŻć€§èźČ栂02 䌑é—Čæžžæˆć‘ć±•çš„æ–‡ćŒ–è¶‹ćŠż
George Ang
 
è…ŸèźŻć€§èźČ栂03 qqé‚źçź±æˆé•żćŽ†çš‹
è…ŸèźŻć€§èźČ栂03 qqé‚źçź±æˆé•żćŽ†çš‹è…ŸèźŻć€§èźČ栂03 qqé‚źçź±æˆé•żćŽ†çš‹
è…ŸèźŻć€§èźČ栂03 qqé‚źçź±æˆé•żćŽ†çš‹
George Ang
 
è…ŸèźŻć€§èźČ栂04 im qq
è…ŸèźŻć€§èźČ栂04 im qqè…ŸèźŻć€§èźČ栂04 im qq
è…ŸèźŻć€§èźČ栂04 im qq
George Ang
 
è…ŸèźŻć€§èźČ栂05 靱搑ćŻčè±Ąćș”ćŻčäč‹é“
è…ŸèźŻć€§èźČ栂05 靱搑ćŻčè±Ąćș”ćŻčäč‹é“è…ŸèźŻć€§èźČ栂05 靱搑ćŻčè±Ąćș”ćŻčäč‹é“
è…ŸèźŻć€§èźČ栂05 靱搑ćŻčè±Ąćș”ćŻčäč‹é“
George Ang
 
è…ŸèźŻć€§èźČ栂06 qqé‚źçź±æ€§èƒœäŒ˜ćŒ–
è…ŸèźŻć€§èźČ栂06 qqé‚źçź±æ€§èƒœäŒ˜ćŒ–è…ŸèźŻć€§èźČ栂06 qqé‚źçź±æ€§èƒœäŒ˜ćŒ–
è…ŸèźŻć€§èźČ栂06 qqé‚źçź±æ€§èƒœäŒ˜ćŒ–
George Ang
 
è…ŸèźŻć€§èźČ栂07 qqç©ș问
è…ŸèźŻć€§èźČ栂07 qqç©șé—Žè…ŸèźŻć€§èźČ栂07 qqç©ș问
è…ŸèźŻć€§èźČ栂07 qqç©ș问
George Ang
 
è…ŸèźŻć€§èźČ栂08 ćŻæ‰©ć±•webæž¶æž„æŽąèźš
è…ŸèźŻć€§èźČ栂08 ćŻæ‰©ć±•webæž¶æž„æŽąèźšè…ŸèźŻć€§èźČ栂08 ćŻæ‰©ć±•webæž¶æž„æŽąèźš
è…ŸèźŻć€§èźČ栂08 ćŻæ‰©ć±•webæž¶æž„æŽąèźš
George Ang
 
è…ŸèźŻć€§èźČ栂09 ćŠ‚äœ•ć»șèźŸé«˜æ€§èƒœçœ‘ç«™
è…ŸèźŻć€§èźČ栂09 ćŠ‚äœ•ć»șèźŸé«˜æ€§èƒœçœ‘ç«™è…ŸèźŻć€§èźČ栂09 ćŠ‚äœ•ć»șèźŸé«˜æ€§èƒœçœ‘ç«™
è…ŸèźŻć€§èźČ栂09 ćŠ‚äœ•ć»șèźŸé«˜æ€§èƒœçœ‘ç«™
George Ang
 
è…ŸèźŻć€§èźČ栂01 移抚qqäș§ć“ć‘汕掆皋
è…ŸèźŻć€§èźČ栂01 移抚qqäș§ć“ć‘ć±•ćŽ†çš‹è…ŸèźŻć€§èźČ栂01 移抚qqäș§ć“ć‘汕掆皋
è…ŸèźŻć€§èźČ栂01 移抚qqäș§ć“ć‘汕掆皋
George Ang
 
è…ŸèźŻć€§èźČ栂10 customer engagement
è…ŸèźŻć€§èźČ栂10 customer engagementè…ŸèźŻć€§èźČ栂10 customer engagement
è…ŸèźŻć€§èźČ栂10 customer engagement
George Ang
 
è…ŸèźŻć€§èźČ栂11 拍拍ceć·„äœœç»éȘŒćˆ†äș«
è…ŸèźŻć€§èźČ栂11 拍拍ceć·„äœœç»éȘŒćˆ†äș«è…ŸèźŻć€§èźČ栂11 拍拍ceć·„äœœç»éȘŒćˆ†äș«
è…ŸèźŻć€§èźČ栂11 拍拍ceć·„äœœç»éȘŒćˆ†äș«
George Ang
 
è…ŸèźŻć€§èźČ栂14 qq盎播(qq live) 介绍
è…ŸèźŻć€§èźČ栂14 qq盎播(qq live) ä»‹ç»è…ŸèźŻć€§èźČ栂14 qq盎播(qq live) 介绍
è…ŸèźŻć€§èźČ栂14 qq盎播(qq live) 介绍
George Ang
 
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
George Ang
 
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
George Ang
 
è…ŸèźŻć€§èźČ栂16 äș§ć“ç»ç†ć·„äœœćżƒćŸ—ćˆ†äș«
è…ŸèźŻć€§èźČ栂16 äș§ć“ç»ç†ć·„äœœćżƒćŸ—ćˆ†äș«è…ŸèźŻć€§èźČ栂16 äș§ć“ç»ç†ć·„äœœćżƒćŸ—ćˆ†äș«
è…ŸèźŻć€§èźČ栂16 äș§ć“ç»ç†ć·„äœœćżƒćŸ—ćˆ†äș«
George Ang
 

Mehr von George Ang (20)

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust ‹different ur ls similar text
Do not crawl in the dust ‹different ur ls similar textDo not crawl in the dust ‹different ur ls similar text
Do not crawl in the dust ‹different ur ls similar text
 
ć€§è§„æšĄæ•°æźć€„ç†çš„é‚Łäș›äș‹ć„ż
ć€§è§„æšĄæ•°æźć€„ç†çš„é‚Łäș›äș‹ć„żć€§è§„æšĄæ•°æźć€„ç†çš„é‚Łäș›äș‹ć„ż
ć€§è§„æšĄæ•°æźć€„ç†çš„é‚Łäș›äș‹ć„ż
 
è…ŸèźŻć€§èźČ栂02 䌑é—Čæžžæˆć‘ć±•çš„æ–‡ćŒ–è¶‹ćŠż
è…ŸèźŻć€§èźČ栂02 䌑é—Čæžžæˆć‘ć±•çš„æ–‡ćŒ–è¶‹ćŠżè…ŸèźŻć€§èźČ栂02 䌑é—Čæžžæˆć‘ć±•çš„æ–‡ćŒ–è¶‹ćŠż
è…ŸèźŻć€§èźČ栂02 䌑é—Čæžžæˆć‘ć±•çš„æ–‡ćŒ–è¶‹ćŠż
 
è…ŸèźŻć€§èźČ栂03 qqé‚źçź±æˆé•żćŽ†çš‹
è…ŸèźŻć€§èźČ栂03 qqé‚źçź±æˆé•żćŽ†çš‹è…ŸèźŻć€§èźČ栂03 qqé‚źçź±æˆé•żćŽ†çš‹
è…ŸèźŻć€§èźČ栂03 qqé‚źçź±æˆé•żćŽ†çš‹
 
è…ŸèźŻć€§èźČ栂04 im qq
è…ŸèźŻć€§èźČ栂04 im qqè…ŸèźŻć€§èźČ栂04 im qq
è…ŸèźŻć€§èźČ栂04 im qq
 
è…ŸèźŻć€§èźČ栂05 靱搑ćŻčè±Ąćș”ćŻčäč‹é“
è…ŸèźŻć€§èźČ栂05 靱搑ćŻčè±Ąćș”ćŻčäč‹é“è…ŸèźŻć€§èźČ栂05 靱搑ćŻčè±Ąćș”ćŻčäč‹é“
è…ŸèźŻć€§èźČ栂05 靱搑ćŻčè±Ąćș”ćŻčäč‹é“
 
è…ŸèźŻć€§èźČ栂06 qqé‚źçź±æ€§èƒœäŒ˜ćŒ–
è…ŸèźŻć€§èźČ栂06 qqé‚źçź±æ€§èƒœäŒ˜ćŒ–è…ŸèźŻć€§èźČ栂06 qqé‚źçź±æ€§èƒœäŒ˜ćŒ–
è…ŸèźŻć€§èźČ栂06 qqé‚źçź±æ€§èƒœäŒ˜ćŒ–
 
è…ŸèźŻć€§èźČ栂07 qqç©ș问
è…ŸèźŻć€§èźČ栂07 qqç©șé—Žè…ŸèźŻć€§èźČ栂07 qqç©ș问
è…ŸèźŻć€§èźČ栂07 qqç©ș问
 
è…ŸèźŻć€§èźČ栂08 ćŻæ‰©ć±•webæž¶æž„æŽąèźš
è…ŸèźŻć€§èźČ栂08 ćŻæ‰©ć±•webæž¶æž„æŽąèźšè…ŸèźŻć€§èźČ栂08 ćŻæ‰©ć±•webæž¶æž„æŽąèźš
è…ŸèźŻć€§èźČ栂08 ćŻæ‰©ć±•webæž¶æž„æŽąèźš
 
è…ŸèźŻć€§èźČ栂09 ćŠ‚äœ•ć»șèźŸé«˜æ€§èƒœçœ‘ç«™
è…ŸèźŻć€§èźČ栂09 ćŠ‚äœ•ć»șèźŸé«˜æ€§èƒœçœ‘ç«™è…ŸèźŻć€§èźČ栂09 ćŠ‚äœ•ć»șèźŸé«˜æ€§èƒœçœ‘ç«™
è…ŸèźŻć€§èźČ栂09 ćŠ‚äœ•ć»șèźŸé«˜æ€§èƒœçœ‘ç«™
 
è…ŸèźŻć€§èźČ栂01 移抚qqäș§ć“ć‘汕掆皋
è…ŸèźŻć€§èźČ栂01 移抚qqäș§ć“ć‘ć±•ćŽ†çš‹è…ŸèźŻć€§èźČ栂01 移抚qqäș§ć“ć‘汕掆皋
è…ŸèźŻć€§èźČ栂01 移抚qqäș§ć“ć‘汕掆皋
 
è…ŸèźŻć€§èźČ栂10 customer engagement
è…ŸèźŻć€§èźČ栂10 customer engagementè…ŸèźŻć€§èźČ栂10 customer engagement
è…ŸèźŻć€§èźČ栂10 customer engagement
 
è…ŸèźŻć€§èźČ栂11 拍拍ceć·„äœœç»éȘŒćˆ†äș«
è…ŸèźŻć€§èźČ栂11 拍拍ceć·„äœœç»éȘŒćˆ†äș«è…ŸèźŻć€§èźČ栂11 拍拍ceć·„äœœç»éȘŒćˆ†äș«
è…ŸèźŻć€§èźČ栂11 拍拍ceć·„äœœç»éȘŒćˆ†äș«
 
è…ŸèźŻć€§èźČ栂14 qq盎播(qq live) 介绍
è…ŸèźŻć€§èźČ栂14 qq盎播(qq live) ä»‹ç»è…ŸèźŻć€§èźČ栂14 qq盎播(qq live) 介绍
è…ŸèźŻć€§èźČ栂14 qq盎播(qq live) 介绍
 
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
 
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
è…ŸèźŻć€§èźČ栂15 澂ćœșç ”ç©¶ćŠæ•°æźćˆ†æžç†ćż”ćŠæ–čæł•æŠ‚èŠä»‹ç»
 
è…ŸèźŻć€§èźČ栂16 äș§ć“ç»ç†ć·„äœœćżƒćŸ—ćˆ†äș«
è…ŸèźŻć€§èźČ栂16 äș§ć“ç»ç†ć·„äœœćżƒćŸ—ćˆ†äș«è…ŸèźŻć€§èźČ栂16 äș§ć“ç»ç†ć·„äœœćżƒćŸ—ćˆ†äș«
è…ŸèźŻć€§èźČ栂16 äș§ć“ç»ç†ć·„äœœćżƒćŸ—ćˆ†äș«
 

KĂŒrzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

KĂŒrzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Design and Implementation of a High- Performance Distributed Web Crawler

  • 1. Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk* Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 * Currently at AT&T Research Labs, Florham Park
  • 2. Overview : 1. Introduction 2. PolyBot System Architecture 3. Data Structures and Performance 4. Experimental Results 5. Discussion and Open Problems
  • 3. 1. Introduction Web Crawler: (also called spider or robot) tool for data acquisition in search engines large engines need high-performance crawlers need to parallelize crawling task PolyBot: a parallel/distributed web crawler cluster vs. wide-area distributed
  • 4. Basic structure of a search engine: indexin g Crawle r Index disk s Search.com look Query: “computer” up
  • 5. Crawler Crawle r disk fetches pages from the web s starts at set of “seed pages” parses fetched pages for hyperlinks then follows those links (e.g., BFS) variations: - recrawling - focused crawling - random walks
  • 6. Breadth-First Crawl: Basic idea: - start at a set of known URLs - explore in “concentric circles” around these URLs start pages distance-one pages distance-two pages used by broad web search engines balances load between servers
  • 7. Crawling Strategy and Download Rate: crawling strategy: “What page to download next?” download rate: “How many pages per second?” different scenarios require different strategies lots of recent work on crawling strategy little published work on optimizing download rate (main exception: Mercator from DEC/Compaq/HP?) somewhat separate issues building a slow crawler is (fairly) easy ...
  • 8. Basic System Architecture Application determines crawling
  • 9. System Requirements: flexibility (different crawling strategies) scalabilty (sustainable high performance at low cost) robustness (odd server content/behavior, crashes) crawling etiquette and speed control (robot exclusion, 30 second intervals, domain level throttling, speed control for other users) manageable and reconfigurable (interface for statistics and control, system setup)
  • 10. Details: (lots of ‘em) robot exclusion - robots.txt file and meta tags - robot exclusion adds overhead handling filetypes (exclude some extensions, and use mime types) URL extensions and CGI scripts (to strip or not to strip? Ignore?) frames, imagemaps black holes (robot traps) (limit maximum depth of a site) different names for same site? (could check IP address, but no perfect solution)
  • 11. Crawling courtesy minimize load on crawled server no more than one outstanding request per site better: wait 30 seconds between accesses to site (this number is not fixed) problems: - one server may have many sites - one site may be on many servers - 3 years to crawl a 3-million page site! give contact info for large crawls
  • 12. Contributions: distributed architecture based on collection of services - separation of concerns - efficient interfaces I/O efficient techniques for URL handling - lazy URL -seen structure - manager data structures scheduling policies - manager scheduling and shuffling resulting system limited by network and parsing performane detailed description and how-to (limited experiments)
  • 13. 2. PolyBot System Architecture Structure: separation of crawling strategy and basic system collection of scalable distributed services (DNS, downloading, scheduling, strategy) for clusters and wide-area distributed optimized per-node performance no random disk accesses (no per-page access)
  • 14. Basic Architecture (ctd): application issues requests to manager manager does DNS and robot exclusion manager schedules URL on downloader downloader gets file and puts it on disk application is notified of new files application parses new files for hyperlinks application sends data to storage component (indexing done later)
  • 15. System components: downloader: optimized HTTP client written in Python (everything else in C++) DNS resolver: uses asynchronous DNS library manager uses Berkeley DB and STL for external and internal data structures manager does robot exclusion by generating requests to downloaders and parsing files application does parsing and handling of URLs (has this page already been downloaded?)
  • 16. Scaling the system: small system on previous pages: 3-5 workstations and 250-400 pages/sec peak can scale up by adding downloaders and DNS resolvers at 400-600 pages/sec, application becomes bottleneck at 8 downloaders manager becomes bottleneck need to replicate application and manager hash-based technique (Internet Archive crawler) partitions URLs and hosts among application parts data transfer batched and via file system (NFS)
  • 17. Scaling up: 20 machines 1500 pages/s? depends on crawl strategy hash to nodes based on site (b/c robot ex)
  • 18. 3. Data Structures and Techniques Crawling Application pcre parsing using library NFS eventually bottleneck URL-seen problem: - need to check if file has been parsed or downloaded before - after 20 million pages, we have “seen” over 100 million URLs - each URL is 50 to 75 bytes on average Options: compress URLs in main memory, or use disk - prefix+huffman coding (DEC, JY01) or Bloom Filter (Archive) - disk access with caching (Mercator) - we use lazy/bulk operations on disk
  • 19. Implementation of URL-seen check: - while less than a few million URLs seen, keep in main memory - then write URLs to file in alphabetic, prefix-compressed order - collect new URLs in memory and periodically reform bulk check by merging new URLs into the file on disk When is a newly a parsed URL downloaded? Reordering request stream - want to space ot requests from same subdomain - needed due to load on small domains and due to security tools - sort URLs with hostname reversed (e.g., com.amazon.www), and then “unshuffle” the stream provable load balance
  • 20. Crawling Manager large stream of incoming URL request files goal: schedule URLs roughly in the order that they come, while observing time-out rule (30 seconds) and maintaining high speed must do DNS and robot excl. “right before”download keep requests on disk as long as possible! - otherwise, structures grow too large after few million pages (performance killer)
  • 21. Manager Data Structures: when to insert new URLs into internal structures?
  • 22. URL Loading Policy new request file from disk whenever less read than x hosts in ready queue choose x > speed * timeout (e.g., 100 pages/s * 30s) # of current host data structures is x + speed * timeout + n_down + n_transit which is usually < 2x nice behavior for BDB caching policy performs reordering only when necessary!
  • 23. 4. Experimental Results crawl of 120 million pages over 19 days 161 million HTTP request 16 million robots.txt requests 138 million successful non-robots requests 17 million HTTP errors (401, 403, 404 etc) 121 million pages retrieved slow during day, fast at night peak about 300 pages/s over T3 many downtimes due to attacks, crashes, revisions “slow tail” of requests at the end (4 days) lots of things happen
  • 24. Experimental Results ctd. bytes in bytes out frames out Poly T3 connection over 24 hours of 5/28/01 (courtesy of AppliedTheory)
  • 25. Experimental Results ctd. sustaining performance: - will find out when data structures hit disk - I/O-efficiency vital speed control tricky - vary number of connections based on feedback - also upper bound on connections - complicated interactions in system - not clear what we should want other configuration: 140 pages/sec sustained on 2 Ultra10 with 60GB EIDE and 1GB/768MB similar for Linux on Intel
  • 26. More Detailed Evaluation (to be done) Problems - cannot get commercial crawlers - need simulation systen to find system bottlenecks - often not much of a tradeoff (get it right!) Example: manager data structures - with our loading policy, manager can feed several downloaders - naĂŻve policy: disk access per page parallel communication overhead - low for limited number of nodes (URL exchange) - wide-area distributed: where do yu want the data? - more relevant for highly distributed systems
  • 27. 5. Discussion and Open Problems Related work Mercator (Heydon/Najork from DEC/Compaq) - used in altaVista - centralized system (2-CPU Alpha with RAID disks) - URL-seen test by fast disk access and caching - one thread per HTTP connection - completely in Java, with pluggable components Atrax: very recent distributed extension to Mercator - combines several Mercators - URL hashing, and off-line URL check (as we do)
  • 28. Related work (ctd.) early Internet Archive crawler (circa 96) - uses hashing to partition URLs between crawlers - bloom filter for “URL seen” structure early Google crawler (1998) P2P crawlers (grub.org and others) Cho/Garcia-Molina (WWW 2002) - study of overhead/quality tradeoff in parallel crawlers - difference: we scale services separately, and focus on single-node performance - in our experience, parallel overhead low
  • 29. Open Problems: Measuring and tuning peak performance - need simulation environment - eventually reduces to parsing and network - to be improved: space, fault-tolerance (Xactions?) Highly Distributed crawling - highly distributed (e.g., grub.org) ? (maybe) - bybrid? (different services) - few high-performance sites? (several Universities) Recrawling and focused crawling strategies - what strategies? - how to express? - how to implement?