SlideShare a Scribd company logo
1 of 14
Download to read offline
1




      IRLbot: Scaling to 6 Billion Pages and Beyond
                          Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov



                                                                           The final term refers to the CPU and RAM resources Σ
   Abstract—This paper shares our experience in designing a
web crawler that can download billions of pages using a single-            that are required to sustain the download of N pages at an
server implementation and models its performance. We show                  average speed S. In most crawlers, larger N implies higher
that with the quadratically increasing complexity of verifying
                                                                           complexity of checking URL uniqueness, verifying robots.txt,
URL uniqueness, BFS crawl order, and fixed per-host rate-
                                                                           and scanning the DNS cache, which ultimately results in lower
limiting, current crawling algorithms cannot effectively cope
                                                                           S and higher Σ. At the same time, higher speed S requires
with the sheer volume of URLs generated in large crawls,
highly-branching spam, legitimate multi-million-page blog sites,           smaller data structures, which often can be satisfied only by
and infinite loops created by server-side scripts. We offer a               either lowering N or increasing Σ.
set of techniques for dealing with these issues and test their
                                                                              Current research literature [2], [4], [7], [9], [14], [20], [22],
performance in an implementation we call IRLbot. In our recent
                                                                           [23], [25], [26], [27], [16] generally provides techniques that
experiment that lasted 41 days, IRLbot running on a single server
                                                                           can solve a subset of the problem and achieve a combination
successfully crawled 6.3 billion valid HTML pages (7.6 billion
connection requests) and sustained an average download rate                of any two objectives (i.e., large slow crawls, small fast crawls,
of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with            or large fast crawls with unbounded resources). They also do
algorithms proposed in related work, this version of IRLbot did
                                                                           not analyze how the proposed algorithms scale for very large
not experience any bottlenecks and successfully handled content
                                                                           N given fixed S and Σ. Even assuming sufficient Internet
from over 117 million hosts, parsed out 394 billion links, and
                                                                           bandwidth and enough disk space, the problem of designing a
discovered a subset of the web graph with 41 billion unique
nodes.                                                                     web crawler that can support large N (hundreds of billions
                                                                           of pages), sustain reasonably high speed S (thousands of
                                                                           pages/s), and operate with fixed resources Σ remains open.
                        I. INTRODUCTION
   Over the last decade, the World Wide Web (WWW) has
                                                                           B. Reputation and Spam
evolved from a handful of pages to billions of diverse objects.
                                                                              The web has changed significantly since the days of early
In order to harvest this enormous data repository, search
                                                                           crawlers [4], [23], [25], mostly in the area of dynamically gen-
engines download parts of the existing web and offer Inter-
                                                                           erated pages and web spam. With server-side scripts that can
net users access to this database through keyword search.
                                                                           create infinite loops, high-density link farms, and unlimited
Search engines consist of two fundamental components –
                                                                           number of hostnames, the task of web crawling has changed
web crawlers, which find, download, and parse content in
                                                                           from simply doing a BFS scan of the WWW [24] to deciding
the WWW, and data miners, which extract keywords from
                                                                           in real time which sites contain useful information and giving
pages, rank document importance, and answer user queries.
                                                                           them higher priority as the crawl progresses.
This paper does not deal with data miners, but instead focuses
                                                                              Our experience shows that BFS eventually becomes trapped
on the design of web crawlers that can scale to the size of
                                                                           in useless content, which manifests itself in multiple ways:
the current1 and future web, while implementing consistent
                                                                           a) the queue of pending URLs contains a non-negligible
per-website and per-server rate-limiting policies and avoiding
                                                                           fraction of links from spam sites that threaten to overtake
being trapped in spam farms and infinite webs. We next discuss
                                                                           legitimate URLs due to their high branching factor; b) the
our assumptions and explain why this is a challenging issue.
                                                                           DNS resolver succumbs to the rate at which new hostnames
                                                                           are dynamically created within a single domain; and c) the
A. Scalability                                                             crawler becomes vulnerable to the delay attack from sites that
                                                                           purposely introduce HTTP and DNS delays in all requests
   With the constant growth of the web, discovery of user-
                                                                           originating from the crawler’s IP address.
created content by web crawlers faces an inherent tradeoff
                                                                              No prior research crawler has attempted to avoid spam or
between scalability, performance, and resource usage. The
                                                                           document its impact on the collected data. Thus, designing
first term refers to the number of pages N a crawler can handle
                                                                           low-overhead and robust algorithms for computing site repu-
without becoming “bogged down” by the various algorithms
                                                                           tation during the crawl is the second open problem that we
and data structures needed to support the crawl. The second
                                                                           aim to address in this work.
term refers to the speed S at which the crawler discovers the
web as a function of the number of pages already crawled.
                                                                           C. Politeness
   All authors are with the Department of Computer Science, Texas A&M
                                                                              Even today, webmasters become easily annoyed when web
University, College Station, TX 77843 USA (email: {h0l9314, dleonard,
xmwang, dmitri}@cs.tamu.edu)                                               crawlers slow down their servers, consume too much Internet
   1 Adding the size of all top-level domains using site queries (e.g.,
                                                                           bandwidth, or simply visit pages with “too much” frequency.
“site:.com”), Google’s index size in January 2008 can be estimated at 30
                                                                           This leads to undesirable consequences including blocking
billion pages and Yahoo’s at 37 billion.
2



of the crawler from accessing the site in question, various              links from assets that spammers cannot grow to infinity.
complaints to the ISP hosting the crawler, and even threats              Our algorithm, which we call Spam Tracking and Avoidance
of legal action. Incorporating per-website and per-IP hit limits         through Reputation (STAR), dynamically allocates the budget
into a crawler is easy; however, preventing the crawler from             of allowable pages for each domain and all of its subdomains
“choking” when its entire RAM gets filled up with URLs pend-              in proportion to the number of in-degree links from other
ing for a small set of hosts is much more challenging. When              domains. This computation can be done in real time with
N grows into the billions, the crawler ultimately becomes                little overhead using DRUM even for millions of domains
bottlenecked by its own politeness and is then faced with                in the Internet. Once the budgets are known, the rates at
a decision to suffer significant slowdown, ignore politeness              which pages can be downloaded from each domain are scaled
considerations for certain URLs (at the risk of crashing target          proportionally to the corresponding budget.
servers or wasting valuable bandwidth on huge spam farms),                   The final issue we faced in later stages of the crawl was
or discard a large fraction of backlogged URLs, none of which            how to prevent live-locks in processing URLs that exceed
is particularly appealing.                                               their budget. Periodically re-scanning the queue of over-budget
   While related work [2], [7], [14], [23], [27] has proposed            URLs produces only a handful of good links at the cost of
several algorithms for rate-limiting host access, none of these          huge overhead. As N becomes large, the crawler ends up
studies have addressed the possibility that a crawler may stall          spending all of its time cycling through failed URLs and makes
due to its politeness restrictions or discussed management of            very little progress. The solution to this problem, which we
rate-limited URLs that do not fit into RAM. This is the third             call Budget Enforcement with Anti-Spam Tactics (BEAST),
open problem that we aim to solve in this paper.                         involves a dynamically increasing number of disk queues
                                                                         among which the crawler spreads the URLs based on whether
                                                                         they fit within the budget or not. As a result, almost all pages
D. Our Contributions
                                                                         from sites that significantly exceed their budgets are pushed
   The first part of the paper presents a set of web-crawler
                                                                         into the last queue and are examined with lower frequency as
algorithms that address the issues raised above and the second
                                                                         N increases. This keeps the overhead of reading spam at some
part briefly examines their performance in an actual web
                                                                         fixed level and effectively prevents it from “snowballing.”
crawl.2 Our design stems from three years of web crawling
                                                                             The above algorithms were deployed in IRLbot [17] and
experience at Texas A&M University using an implementation
                                                                         tested on the Internet in June-August 2007 using a single
we call IRLbot [17] and the various challenges posed in
                                                                         server attached to a 1 gb/s backbone of Texas A&M. Over
simultaneously: 1) sustaining a fixed crawling rate of several
                                                                         a period of 41 days, IRLbot issued 7, 606, 109, 371 con-
thousand pages/s; 2) downloading billions of pages; and 3)
                                                                         nection requests, received 7, 437, 281, 300 HTTP responses
operating with the resources of a single server.
                                                                         from 117, 576, 295 hosts, and successfully downloaded N =
   The first performance bottleneck we faced was caused by
                                                                         6, 380, 051, 942 unique HTML pages at an average rate of
the complexity of verifying uniqueness of URLs and their
                                                                         319 mb/s (1, 789 pages/s). After handicapping quickly branch-
compliance with robots.txt. As N scales into many billions,
                                                                         ing spam and over 30 million low-ranked domains, IRLbot
even the disk algorithms of [23], [27] no longer keep up with
                                                                         parsed out 394, 619, 023, 142 links and found 41, 502, 195, 631
the rate at which new URLs are produced by our crawler
                                                                         unique pages residing on 641, 982, 061 hosts, which explains
(i.e., up to 184K per second). To understand this problem, we
                                                                         our interest in crawlers that scale to tens and hundreds of
analyze the URL-check methods proposed in the literature and
                                                                         billions of pages as we believe a good fraction of 35B URLs
show that all of them exhibit severe performance limitations
                                                                         not crawled in this experiment contains useful content.
when N becomes sufficiently large. We then introduce a new
                                                                             The rest of the paper is organized as follows. Section II
technique called Disk Repository with Update Management
                                                                         overviews related work. Section III defines our objectives
(DRUM) that can store large volumes of arbitrary hashed
                                                                         and classifies existing approaches. Section IV discusses how
data on disk and implement very fast check, update, and
                                                                         checking URL uniqueness scales with crawl size and proposes
check+update operations using bucket sort. We model the
                                                                         our technique. Section V models caching and studies its rela-
various approaches and show that DRUM’s overhead remains
                                                                         tionship with disk overhead. Section VI discusses our approach
close to the best theoretically possible as N reaches into the
                                                                         to ranking domains and Section VII introduces a scalable
trillions of pages and that for common disk and RAM size,
                                                                         method of enforcing budgets. Section VIII summarizes our
DRUM can be thousands of times faster than prior disk-based
                                                                         experimental statistics and Section IX concludes the paper.
methods.
   The second bottleneck we faced was created by multi-
                                                                                              II. RELATED WORK
million-page sites (both spam and legitimate), which became
backlogged in politeness rate-limiting to the point of over-
                                                                            There is only a limited number of papers describing de-
flowing the RAM. This problem was impossible to overcome
                                                                         tailed web-crawler algorithms and offering their experimental
unless politeness was tightly coupled with site reputation. In
                                                                         performance. First-generation designs [9], [22], [25], [26] were
order to determine the legitimacy of a given domain, we use
                                                                         developed to crawl the infant web and commonly reported col-
a very simple algorithm based on the number of incoming
                                                                         lecting less than 100, 000 pages. Second-generation crawlers
                                                                         [2], [7], [15], [14], [23], [27] often pulled several hundred
  2 A separate paper will present a much more detailed analysis of the
                                                                         million pages and involved multiple agents in the crawling
collected data.
3


                                                              TABLE I
                                       COMPARISON OF PRIOR CRAWLERS AND THEIR DATA STRUCTURES

               Crawler                Year     Crawl size         URLseen           RobotsCache          DNScache        Q
                                             (HTML pages)      RAM          Disk      RAM         Disk
               WebCrawler [25]        1994        50K              database                  –                –       database
               Internet Archive [6]   1997         –        site-based        –    site-based      –     site-based    RAM
               Mercator-A [14]        1999       41M           LRU          seek      LRU          –          –         disk
               Mercator-B [23]        2001       473M          LRU         batch      LRU          –          –         disk
               Polybot [27]           2001       120M           tree       batch         database         database      disk
               WebBase [7]            2001       125M       site-based        –    site-based      –     site-based    RAM
               UbiCrawler [2]         2002       45M        site-based        –    site-based      –     site-based    RAM



process. We discuss their design and scalability issues in the         servers can maintain speed mS and scale to mN pages by
next section.                                                          simply partitioning the subset of all URLs and data structures
   Another direction was undertaken by the Internet Archive            between themselves (we assume that the bandwidth needed to
[6], [16], which maintains a history of the Internet by down-          shuffle the URLs between the servers is also well provisioned).
loading the same set of pages over and over. In the last 10            Therefore, the aggregate performance of a server farm is
years, this database has collected over 85 billion pages, but          ultimately governed by the characteristics of individual servers
only a small fraction of them are unique. Additional crawlers          and their local limitations. We explore these limits in detail
are [4], [8], [13], [20], [28], [29]; however, their focus usually     throughout the paper.
does not include the large scale assumed in this paper and their
fundamental crawling algorithms are not presented in sufficient
                                                                       B. Crawler Operation
detail to be analyzed here.
                                                                          The functionality of a basic web crawler can be broken
   The largest prior crawl using a fully-disclosed implemen-
                                                                       down into several phases: 1) removal of the next URL u from
tation appeared in [23], where Mercator obtained N = 473
                                                                       the queue Q of pending pages; 2) download of u and extraction
million HTML pages in 17 days (we exclude non-HTML
                                                                       of new URLs u1 , . . . , uk from u’s HTML tags; 3) for each ui ,
content since it has no effect on scalability). The fastest
                                                                       verification of uniqueness against some structure URLseen
reported crawler was [13] with 816 pages/s, but the scope
                                                                       and checking compliance with robots.txt using some other
of their experiment was only N = 25 million. Finally, to
                                                                       structure RobotsCache; 4) addition of passing URLs to Q
our knowledge, the largest webgraph used in any paper was
                                                                       and URLseen; 5) update of RobotsCache if necessary. The
AltaVista’s 2003 crawl with 1.4B pages and 6.6B links [11].
                                                                       crawler may also maintain its own DNScache structure in
                                                                       cases when the local DNS server is not able to efficiently
          III. OBJECTIVES AND CLASSIFICATION
                                                                       cope with the load (e.g., its RAM cache does not scale to the
   This section formalizes the purpose of web crawling and
                                                                       number of hosts seen by the crawler or it becomes very slow
classifies algorithms in related work, some of which we study
                                                                       after caching hundreds of millions of records).
later in the paper.
                                                                          A summary of prior crawls and their methods in managing
                                                                       URLseen, RobotsCache, DNScache, and queue Q is
A. Crawler Objectives                                                  shown in Table I. The table demonstrates that two approaches
                                                                       to storing visited URLs have emerged in the literature: RAM-
   We assume that the ideal task of a crawler is to start from a
                                                                       only and hybrid RAM-disk. In the former case [2], [6], [7],
set of seed URLs Ω0 and eventually crawl the set of all pages
                                                                       crawlers keep a small subset of hosts in memory and visit
Ω∞ that can be discovered from Ω0 using HTML links. The
                                                                       them repeatedly until a certain depth or some target number
crawler is allowed to dynamically change the order in which
                                                                       of pages have been downloaded from each site. URLs that
URLs are downloaded in order to achieve a reasonably good
                                                                       do not fit in memory are discarded and sites are assumed
coverage of “useful” pages ΩU ⊆ Ω∞ in some finite amount of
                                                                       to never have more than some fixed volume of pages. This
time. Due to the existence of legitimate sites with hundreds of
                                                                       methodology results in truncated web crawls that require
millions of pages (e.g., ebay.com, yahoo.com, blogspot.com),
                                                                       different techniques from those studied here and will not be
the crawler cannot make any restricting assumptions on the
                                                                       considered in our comparison. In the latter approach [14], [23],
maximum number of pages per host, the number of hosts per
                                                                       [25], [27], URLs are first checked against a buffer of popular
domain, the number of domains in the Internet, or the number
                                                                       links and those not found are examined using a disk file. The
of pages in the crawl. We thus classify algorithms as non-
                                                                       RAM buffer may be an LRU cache [14], [23], an array of
scalable if they impose hard limits on any of these metrics or
                                                                       recently added URLs [14], [23], a general-purpose database
are unable to maintain crawling speed when these parameters
                                                                       with RAM caching [25], and a balanced tree of URLs pending
become very large.
                                                                       a disk check [27].
   We should also explain why this paper focuses on the
performance of a single server rather than some distributed               Most prior approaches keep RobotsCache in RAM and
architecture. If one server can scale to N pages and maintain          either crawl each host to exhaustion [2], [6], [7] or use an
speed S, then with sufficient bandwidth it follows that m               LRU cache in memory [14], [23]. The only hybrid approach
4



is used in [27], which employs a general-purpose database for      verify that this holds in practice later in the paper). Further
storing downloaded robots.txt. Finally, with the exception of      assume that the current size of URLseen is U entries, the size
[27], prior crawlers do not perform DNS caching and rely on        of RAM allocated to URL checks is R, the average number
the local DNS server to store these records for them.              of links per downloaded page is l, the average URL length
                                                                   is b, the URL compression ratio is q, and the crawler expects
                                                                   to visit N pages. It then follows that n = lN links must
           IV. SCALABILITY OF DISK METHODS
                                                                   pass through URL check, np of them are unique, and bq is
   This section describes algorithms proposed in prior litera-
                                                                   the average number of bytes in a compressed URL. Finally,
ture, analyzes their performance, and introduces our approach.
                                                                   denote by H the size of URL hashes used by the crawler and
                                                                   P the size of a memory pointer. Then we have the following
A. Algorithms                                                      result.
                                                                      Theorem 1: The overhead of URLseen batch disk check
   In Mercator-A [14], URLs that are not found in memory
                                                                   is ω(n, R) = α(n, R)bn bytes, where for Mercator-B:
cache are looked up on disk by seeking within the URLseen
file and loading the relevant block of hashes. The method                                 2(2U H + pHn)(H + P )
                                                                           α(n, R) =                           +2+p              (1)
clusters URLs by their site hash and attempts to resolve                                           bR
multiple in-memory links from the same site in one seek.
                                                                   and for Polybot:
However, in general, locality of parsed out URLs is not
                                                                                         2(2U bq + pbqn)(b + 4P )
guaranteed and the worst-case delay of this method is one
                                                                             α(n, R) =                                + p.        (2)
seek/URL and the worst-case read overhead is one block/URL.                                          bR
A similar approach is used in WebCrawler [25], where a                   Proof: To prevent locking on URL check, both Mercator-
general-purpose database performs multiple seeks (assuming         B and Polybot must use two buffers of accumulated URLs (i.e.,
a common B-tree implementation) to find URLs on disk.               one for checking the disk and the other for newly arriving
   Even with RAID, disk seeking cannot be reduced to below         data). Assume this half-buffer allows storage of m URLs (i.e.,
3 − 5 ms, which is several orders of magnitude slower than         m = R/[2(H+P )] for Mercator-B and m = R/[2(b+4P )] for
required in actual web crawls (e.g., 5 − 10 microseconds in        Polybot) and the size of the initial disk file is f (i.e., f = U H
IRLbot). General-purpose databases that we have examined           for Mercator-B and f = U bq for Polybot).
are much worse and experience a significant slowdown (i.e.,            For Mercator-B, the i-th iteration requires writing/reading
10 − 50 ms per lookup) after about 100 million inserted            of mb bytes of arriving URL strings, reading the current
records. Therefore, these approaches do not appear viable          URLseen, writing it back, and appending mp hashes to it,
unless RAM caching can achieve some enormously high hit            i.e., 2f + 2mb + 2mpH(i − 1) + mpH bytes. This leads to the
rates (i.e., 99.7% for IRLbot). We examine whether this is         following after adding the final overhead to store pbn bytes of
possible in the next section when studying caching.                unique URLs in the queue:
   Mercator-B [23] and Polybot [27] use a so-called batch disk
                                                                                 n/m
check – they accumulate a buffer of URLs in memory and then
                                                                        ω(n) =          (2f + 2mb + 2mpHi − mpH) + pbn
merge it with a sorted URLseen file in one pass. Mercator-
                                                                                  i=1
B stores only hashes of new URLs in RAM and places their
                                                                                        2(2U H + pHn)(H + P )
text on disk. In order to retain the mapping from hashes to                   = nb                            +2+p .             (3)
                                                                                                  bR
the text, a special pointer is attached to each hash. After the
memory buffer is full, it is sorted in place and then compared        For Polybot, the i-th iteration has overhead 2f + 2mpbq(i −
with blocks of URLseen as they are read from disk. Non-            1) + mpbq, which yields:
duplicate URLs are merged with those already on disk and
                                                                                      n/m
written into the new version of URLseen. Pointers are then
                                                                           ω(n) =           (2f + 2mpbqi − mpbq) + pbn
used to recover the text of unique URLs and append it to the
                                                                                      i=1
disk queue.
                                                                                            2(2U bq + pbqn)(b + 4P )
   Polybot keeps the entire URLs (i.e., actual strings) in                       = nb                                +p          (4)
                                                                                                       bR
memory and organizes them into a binary search tree. Once
the tree size exceeds some threshold, it is merged with the disk   and leads to (2).
file URLseen, which contains compressed URLs already seen              This result shows that ω(n, R) is a product of two elements:
by the crawler. Besides being enormously CPU intensive (i.e.,      the number of bytes bn in all parsed URLs and how many
compression of URLs and search in binary string trees are          times α(n, R) they are written to/read from disk. If α(n, R)
rather slow in our experience), this method has to perform         grows with n, the crawler’s overhead will scale super-linearly
more frequent scans of URLseen than Mercator-B due to the          and may eventually become overwhelming to the point of
less-efficient usage of RAM.                                        stalling the crawler. As n → ∞, the quadratic term in
                                                                   ω(n, R) dominates the other terms, which places Mercator-
                                                                   B’s asymptotic performance at
B. Modeling Prior Methods
                                                                                                  2(H + P )pH 2
   Assume the crawler is in some steady state where the proba-
                                                                                      ω(n, R) =              n                   (5)
bility of uniqueness p among new URLs remains constant (we                                             R
5



and that of Polybot at                                                                  RAM
                                                                                           <key,value> buffer 1
                            2(b + 4P )pbq 2                                                    aux buffer 1
                 ω(n, R) =               n.              (6)          <key,value,aux>
                                 R                                         tuples                   …
   The ratio of these two terms is (H + P )H/[bq(b + 4P )],                                <key,value> buffer k
which for the IRLbot case with H = 8 bytes/hash, P = 4                                         aux buffer k
bytes/pointer, b = 110 bytes/URL, and using very optimistic                                bucket
bq = 5 bytes/URL shows that Mercator-B is roughly 7.2 times                                                             cache Z
                                                                                           buffer
                                                                                                                                      disk
faster than Polybot as n → ∞.
   The best performance of any method that stores the text         Fig. 1.   Operation of DRUM.
of URLs on disk before checking them against URLseen
(e.g., Mercator-B) is αmin = 2 + p, which is the overhead
needed to write all bn bytes to disk, read them back for           2) the disk cache Z is sequentially read in chunks of ∆ bytes
processing, and then append bpn bytes to the queue. Methods        and compared with the keys in bucket QH to determine their
                                                                                                             i
with memory-kept URLs (e.g., Polybot) have an absolute             uniqueness; 3) those <key,value> pairs in QH that require
                                                                                                                    i
lower bound of αmin = p, which is the overhead needed to           an update are merged with the contents of the disk cache
write the unique URLs to disk. Neither bound is achievable         and written to the updated version of Z; 4) after all unique
in practice, however.                                              keys in QH are found, their original order is restored, QT
                                                                               i                                                i
                                                                   is sequentially read into memory in blocks of size ∆, and
                                                                   the corresponding aux portion of each unique key is sent for
C. DRUM
                                                                   further processing (see below). An important aspect of this
   We now describe the URL-check algorithm used in IRLbot,
                                                                   algorithm is that all buckets are checked in one pass through
which belongs to a more general framework we call Disk
                                                                   disk cache Z.3
Repository with Update Management (DRUM). The purpose
                                                                      We now explain how DRUM is used for storing crawler
of DRUM is to allow for efficient storage of large collections
                                                                   data. The most important DRUM object is URLseen, which
of <key,value> pairs, where key is a unique identifier
                                                                   implements only one operation – check+update. Incoming
(hash) of some data and value is arbitrary information
                                                                   tuples are <URLhash,-,URLtext>, where the key is an 8-
attached to the key. There are three supported operations on
                                                                   byte hash of each URL, the value is empty, and the auxiliary
these pairs – check, update, and check+update. In the
                                                                   data is the URL string. After all unique URLs are found,
first case, the incoming set of data contains keys that must be
                                                                   their text strings (aux data) are sent to the next queue
checked against those stored in the disk cache and classified
                                                                   for possible crawling. For caching robots.txt, we have an-
as being duplicate or unique. For duplicate keys, the value
                                                                   other DRUM structure called RobotsCache, which supports
associated with each key can be optionally retrieved from
                                                                   asynchronous check and update operations. For checks, it
disk and used for some processing. In the second case, the
                                                                   receives tuples <HostHash,-,URLtext> and for updates
incoming list contains <key,value> pairs that need to be
                                                                   <HostHash,HostData,->, where HostData contains
merged into the existing disk cache. If a given key exists, its
                                                                   the robots.txt file, IP address of the host, and optionally
value is updated (e.g., overridden or incremented); if it does
                                                                   other host-related information. The last DRUM object of
not, a new entry is created in the disk file. Finally, the third
                                                                   this section is called RobotsRequested and is used for
operation performs both check and update in one pass through
                                                                   storing the hashes of sites for which a robots.txt has been
the disk cache. Also note that DRUM may be supplied with
                                                                   requested. Similar to URLseen, it only supports simultaneous
a mixed list where some entries require just a check, while
                                                                   check+update and its incoming tuples are <HostHash,
others need an update.
                                                                   -,HostText>.
   A high-level overview of DRUM is shown in Figure 1. In the
                                                                      Figure 2 shows the flow of new URLs produced by the
figure, a continuous stream of tuples <key,value,aux>
                                                                   crawling threads. They are first sent directly to URLseen
arrives into DRUM, where aux is some auxiliary data associ-
                                                                   using check+update. Duplicate URLs are discarded and
ated with each key. DRUM spreads pairs <key,value>
                                                                   unique ones are sent for verification of their compliance with
between k disk buckets QH , . . . , QH based on their key
                              1         k
                                                                   the budget (both STAR and BEAST are discussed later in the
(i.e., all keys in the same bucket have the same bit-prefix).
                                                                   paper). URLs that pass the budget are queued to be checked
This is accomplished by feeding pairs <key,value> into k
                                                                   against robots.txt using RobotsCache. URLs that have a
memory arrays of size M each and then continuously writing
                                                                   matching robots.txt file are classified immediately as passing
them to disk as the buffers fill up. The aux portion of each key
                                                                   or failing. Passing URLs are queued in Q and later downloaded
(which usually contains the text of URLs) from the i-th bucket
                                                                   by the crawling threads. Failing URLs are discarded.
is kept in a separate file QT in the same FIFO order as pairs
                            i
                                                                      URLs that do not have a matching robots.txt are sent to the
<key,value> in QH . Note that to maintain fast sequential
                       i
                                                                   back of queue QR and their hostnames are passed through
writing/reading, all buckets are pre-allocated on disk before
                                                                   RobotsRequested using check+update. Sites whose
they are used.
   Once the largest bucket reaches a certain size r < R, the         3 Note that disk bucket sort is a well-known technique that exploits
following process is repeated for i = 1, . . . , k: 1) bucket QH   uniformity of keys; however, its usage in checking URL uniqueness and the
                                                               i
is read into the bucket buffer shown in Figure 1 and sorted;       associated performance model of web crawling has not been explored before.
6



                                                                                                       Our disk restriction then gives us that the size of all buckets
                                new URLs
                     crawling                                     unique URLs
                                                 DRUM
                                                                                                     kr and their text krb/H must be equal to D:
                     threads
                                                URLseen
                                check +
                                update
                                                                                                                       krb   k(H + b)(R − 2M k)
                                                                            STAR budget
                                                                                                                kr +       =                    = D.                 (9)
                                                                               check
                                                                                                                       H          H +P
           robots &                                       unique
                                    robots download
          DNS threads                                   hostnames
                                       queue QD
                                                                                                        It turns out that not all pairs (R, k) are feasible. The reason
                        unable to
            update
                                                                                                     is that if R is set too small, we are not able to fill all of D with
                                                      hostnames
                                     robots request
                         check                                        DRUM
                                       queue QE                   RobotsRequested
                                                                                                     buckets since 2M k will leave no room for r ≥ ∆. Re-writing
                                                      check +
                  DRUM                                update
                                                                                                     (9), we obtain a quadratic equation 2M k 2 − Rk + A = 0,
                RobotsCache check
    pass

                                                                                                     where A = (H + P )D/(H + b). If R2 < 8M A, we have no
   robots
                                              URLs                              BEAST budget
                       fail                                                      enforcement
                                                                                                     solution and thus R is insufficient to support D. In that case,
                     robots
                                               robots-check
                                                                                                     we need to maximize k(R − 2M k) subject to k ≤ km , where
                                                 queue QR         pass budget
          ready queue Q

                                                                                                                             1           ∆(H + P )
                                                                                                                     km =          R−                               (10)
Fig. 2.     High level organization of IRLbot.
                                                                                                                            2M              H
                                                                                                     is the maximum number of buckets that still leave room for
hash is not already present in this file are fed through queue                                        r ≥ ∆. Maximizing k(R −2M k), we obtain the optimal point
QD into a special set of threads that perform DNS lookups                                            k0 = R/(4M ). Assuming that R ≥ 2∆(1 + P/H), condition
and download robots.txt. They subsequently issue a batch                                             k0 ≤ km is always satisfied. Using k0 buckets brings our disk
update to RobotsCache using DRUM. Since in steady-                                                   usage to D = (H + b)R2 /[8M (H + P )], which is always
state (i.e., excluding the initial phase) the time needed to                                         less than D.
download robots.txt is much smaller than the average delay in                                           In the case R2 ≥ 4M A, we can satisfy D and the correct
QR (i.e., 1-2 days), each URL makes no more than one cycle                                           number of buckets k is given by two choices:
                                                                                                                                   √
through this loop. In addition, when RobotsCache detects
                                                                                                                              R ± R2 − 8M A
that certain robots.txt or DNS records have become outdated,                                                             k=                      .              (11)
                                                                                                                                      4M
it marks all corresponding URLs as “unable to check, outdated
                                                                                                        The reason why we have two values is that we can achieve
records,” which forces RobotsRequested to pull a new set
                                                                                                     D either by using few buckets (i.e., k is small and r is large)
of exclusion rules and/or perform another DNS lookup. Old
                                                                                                     or many buckets (i.e., k is large and r is small). The correct
records are automatically expunged during the update when
                                                                                                     solution is to take the smaller root to minimize the number of
RobotsCache is re-written.
                                                                                                     open handles and disk fragmentation. Putting things together:
   It should be noted that URLs are kept in memory only
                                                                                                                                   √
when they are needed for immediate action and all queues in
                                                                                                                              R − R2 − 8M A
Figure 2 are stored on disk. We should also note that DRUM                                                              k1 =                      .             (12)
                                                                                                                                      4M
data structures can support as many hostnames, URLs, and
                                                                                                        Note that we still need to ensure k1 ≤ km , which holds
robots.txt exception rules as disk space allows.
                                                                                                     when:
D. DRUM Model                                                                                                              ∆(H + P )        2M AH
                                                                                                                     R≥                +            .           (13)
                                                                                                                               H          ∆(H + P )
  Assume that the crawler maintains a buffer of size M = 256
KB for each open file and that the hash bucket size r must
                                                                                                     Given that R ≥ 2∆(1 + P/H) from the statement of the
be at least ∆ = 32 MB to support efficient reading during the
                                                                                                     theorem, it is easy to verify that (13) is always satisfied.
check-merge phase. Further assume that the crawler can use
                                                                                                        Next, for the i-th iteration that fills up all k buckets, we need
up to D bytes of disk space for this process. Then we have
                                                                                                     to write/read QT once (overhead 2krb/H) and read/write each
                                                                                                                     i
the following result.
                                                                                                     bucket once as well (overhead 2kr). The remaining overhead
  Theorem 2: Assuming that R ≥ 2∆(1 + P/H), DRUM’s
                                                                                                     is reading/writing URLseen (overhead 2f + 2krp(i − 1))
URLseen overhead is ω(n, R) = α(n, R)bn bytes, where:
                                                                                                     and appending the new URL hashes (overhead krp). We thus
                        8M (H+P )(2U H+pHn)                                                          obtain that we need nH/(kr) iterations and:
                                            + 2 + p + 2H                            R2 < Λ
                                bR2                    b
 α(n, R) =              (H+b)(2U H+pHn)
                                         + 2 + p + 2H                               R2 ≥ Λ                       nH/(kr)
                               bD                   b                                                                             2krb
                                                                                         (7)         ω(n, R) =             2f +        + 2kr + 2krpi − krp + pbn
                                                                                                                                   H
and Λ = 8M D(H + P )/(H + b).                                                                                      i=1
     Proof: Memory R needs to support 2k open file buffers                                                              (2U H + pHn)H       2H
                                                                                                              = nb                   +2+p+                   .      (14)
and one block of URL hashes that are loaded from QH . In                                                                     bkr            b
                                                       i
order to compute block size r, recall that it gets expanded
                                                                                                      Recalling our two conditions, we use                   k0 r     =
by a factor of (H + P )/H when read into RAM due to the
                                                                                                     HR2 /[8M (H + P )] for R2 < 8M A to obtain:
addition of a pointer to each hash value. We thus obtain that
r(H + P )/H + 2M k = R or:                                                                                             8M (H + P )(2U H + pHn)       2H
                                                                                                     ω(n, R) = nb                              +2+p+       .
                                                                                                                                 bR2                  b
                                          (R − 2M k)H
                                    r=                .                                        (8)                                                    (15)
                                             H +P
7


                           TABLE II                                                            TABLE III
          OVERHEAD α(n, R) FOR R = 1 GB, D = 4.39 TB                                 OVERHEAD α(n, R) FOR D = D(n)

             N       Mercator-B    Polybot   DRUM                          N               R = 4 GB                 R = 8 GB
                                                                                     Mercator-B DRUM          Mercator-B DRUM
             800M       11.6         69       2.26
             8B          93         663       2.35                         800M         4.48       2.30          3.29       2.30
             80B        917        6, 610     3.3                          8B            25         2.7          13.5        2.7
             800B      9, 156      66, 082    12.5                         80B          231         3.3          116         3.3
             8T       91, 541     660, 802    104                          800B        2, 290       3.3         1, 146       3.3
                                                                           8T         22, 887       8.1        11, 444       3.7

  For the other case R2 ≥ 8M A, we have k1 r = DH/(H +b)
and thus get:                                                        Theorem 3: Maximum download rate (in pages/s) sup-
                                                                   ported by the disk portion of URL uniqueness checks is:
                    (H + b)(2U H + pHn)       2H
 ω(n, R) = nb                           +2+p+                ,
                             bD                b                                                         W
                                                                                            Sdisk =             .                       (18)
                                                           (16)                                       α(n, R)bl
which leads to the statement of the theorem.                             Proof: The time needed to perform uniqueness checks
   It follows from the proof of Theorem 2 that in order to         for n new URLs is spent in disk I/O involving ω(n, R) =
match D to a given RAM size R and avoid unnecessary                α(n, R)bn = α(n, R)blN bytes. Assuming that W is the
allocation of disk space, one should operate at the optimal        average disk I/O speed, it takes N/S seconds to generate n
point given by R2 = Λ:                                             new URLs and ω(n, R)/W seconds to check their uniqueness.
                              R2 (H + b)                           Equating the two entities, we have (18).
                     Dopt =               .                (17)       We use IRLbot’s parameters to illustrate the applicability
                             8M (H + P )
                                                                   of this theorem. Neglecting the process of appending new
   For example, R = 1 GB produces Dopt = 4.39 TB and
                                                                   URLs to the queue, the crawler’s read and write overhead is
R = 2 GB produces Dopt = 17 TB. For D = Dopt , the
                                                                   symmetric. Then, assuming IRLbot’s 1-GB/s read speed and
corresponding number of buckets is kopt = R/(4M ), the size
                                                                   350-MB/s write speed (24-disk RAID-5), we obtain that its
of the bucket buffer is ropt = RH/[2(H + P )] ≈ 0.33R, and
                                                                   average disk read-write speed is equal to 675 MB/s. Allocating
the leading quadratic term of ω(n, R) in (7) is now R/(4M )
                                                                   15% of this rate for checking URL uniqueness4 , the effective
times smaller than in Mercator-B. This ratio is 1, 000 for R =
                                                                   disk bandwidth of the server can be estimated at W = 101.25
1 GB and 8, 000 for R = 8 GB. The asymptotic speed-up in
                                                                   MB/s. Given the conditions of Table III for R = 8 GB and
either case is significant.
                                                                   assuming N = 8 trillion pages, DRUM yields a sustained
   Finally, observe that the best possible performance of any
                                                                   download rate of Sdisk = 4, 192 pages/s (i.e., 711 mb/s using
method that stores both hashes and URLs on disk is αmin =
                                                                   IRLbot’s average HTML page size of 21.2 KB). In crawls of
2 + p + 2H/b.
                                                                   the same scale, Mercator-B would be 3, 075 times slower and
                                                                   would admit an average rate of only 1.4 pages/s. Since with
E. Comparison
                                                                   these parameters Polybot is 7.2 times slower than Mercator-B,
   We next compare disk performance of the studied methods         its average crawling speed would be 0.2 pages/s.
when non-quadratic terms in ω(n, R) are non-negligible. Table         With just 10 DRUM servers and a 10-gb/s Internet link, one
II shows α(n, R) of the three studied methods for fixed RAM         can create a search engine with a download capacity of 100
size R and disk D as N increases from 800 million to 8             billion pages per month and scalability to trillions of pages.
trillion (p = 1/9, U = 100M pages, b = 110 bytes, l = 59
links/page). As N reaches into the trillions, both Mercator-
                                                                                               V. CACHING
B and Polybot exhibit overhead that is thousands of times
                                                                      To understand whether caching provides improved per-
larger than the optimal and invariably become “bogged down”
                                                                   formance, one must consider a complex interplay between
in re-writing URLseen. On the other hand, DRUM stays
                                                                   the available CPU capacity, spare RAM size, disk speed,
within a factor of 50 from the best theoretically possible value
                                                                   performance of the caching algorithm, and crawling rate. This
(i.e., αmin = 2.256) and does not sacrifice nearly as much
                                                                   is a three-stage process – we first examine how cache size
performance as the other two methods.
                                                                   and crawl speed affect the hit rate, then analyze the CPU
   Since disk size D is likely to be scaled with N in order
                                                                   restrictions of caching, and finally couple them with RAM/disk
to support the newly downloaded pages, we assume for the
                                                                   limitations using analysis in the previous section.
next example that D(n) is the maximum of 1 TB and the
size of unique hashes appended to URLseen during the crawl
of N pages, i.e., D(n) = max(pHn, 1012 ). Table III shows          A. Cache Hit Rate
how dynamically scaling disk size allows DRUM to keep the
                                                                     Assume that c bytes of RAM are available to a cache whose
overhead virtually constant as N increases.
                                                                   entries incur fixed overhead γ bytes. Then E = c/γ is the
   To compute the average crawling rate that the above meth-
ods support, assume that W is the average disk I/O speed and         4 Additional disk I/O is needed to verify robots.txt, perform reputation
consider the next result.                                          analysis, and enforce budgets.
8


                          TABLE IV                                                          TABLE V
         LRU HIT RATES STARTING AT N0 CRAWLED PAGES                    INSERTION RATE AND MAXIMUM CRAWLING SPEED FROM (21)

             Cache elements E   N0 = 1B    N0 = 4B                              Method              µ(c) URLs/s     Size   Scache
                  256K           19%        16%                          Balanced tree (strings)       113K       2.2 GB   1, 295
                    4M           26%        22%                       Tree-based LRU (8-byte int)      185K       1.6 GB   1, 757
                    8M           68%        59%                        Balanced tree (8-byte int)      416K       768 MB   2, 552
                   16M           71%        67%                           CLOCK (8-byte int)            2M        320 MB   3, 577
                   64M           73%        73%
                  512M           80%        78%
                                                                   would match the rate of URL production to that of the cache,
                                                                   i.e.,
maximum number of elements stored in the cached at any                                lS = µ(c)(1 − φ(S)).                 (20)
time. Then define π(c, S) to be the cache miss rate under
                                                                      Re-writing (20) using g(x) = lx/(1 − φ(x)), we have
crawling speed S pages/s and cache size c. The reason why
                                                                   g(S) = µ(c), which has a unique solution S = g −1 (µ(c))
π depends on S is that the faster the crawl, the more pages
                                                                   since g(x) is a strictly increasing function with a proper
it produces between visits to the same site, which is where
                                                                   inverse.
duplicate links are most prevalent. Defining τh to be the per-
                                                                      For the common case φ(S) = S/Smax , where Smax is the
host visit delay, common sense suggests that π(c, S) should
                                                                   server’s maximum (i.e., CPU-limited) crawling rate in pages/s,
depend not only on c, but also on τh lS.
                                                                   (19) yields a very simple expression:
   Table IV shows LRU cache hit rates 1 − π(c, S) during
several stages of our crawl. We seek in the trace file to                                               µ(c)Smax
                                                                                      Scache (c) =                .             (21)
the point where the crawler has downloaded N0 pages and
                                                                                                     lSmax + µ(c)
then simulate LRU hit rates by passing the next 10E URLs
                                                                      To show how to use the above result, Table V compares
discovered by the crawler through the cache. As the table
                                                                   the speed of several memory structures on the IRLbot server
shows, a significant jump in hit rates happens between 4M
                                                                   using E = 16M elements and displays model (21) for Smax =
and 8M entries. This is consistent with IRLbot’s peak value
                                                                   4, 000 pages/s. As can be seen in the table, insertion of text
of τh lS ≈ 7.3 million. Note that before cache size reaches this
                                                                   URLs into a balanced tree (used in Polybot [27]) is the slowest
value, most hits in the cache stem from redundant links within
                                                                   operation that also consumes the most memory. The speed of
the same page. As E starts to exceed τh lS, popular URLs on
                                                                   classical LRU caching (185K/s) and search-trees with 8-byte
each site survive between repeat visits and continue staying in
                                                                   keys (416K/s) is only slightly better since both use multiple
the cache as long as the corresponding site is being crawled.
                                                                   (i.e., log2 E) jumps through memory. CLOCK [5], which is a
Additional simulations confirming this effect are omitted for
                                                                   space and time optimized approximation to LRU, achieves a
brevity.
                                                                   much better speed (2M/s), requires less RAM, and is suitable
   Unlike [5], which suggests that E be set 100 − 500 times
                                                                   for crawling rates up to 3, 577 pages/s on this server.
larger than the number of threads, our results show that E
                                                                      After experimentally determining µ(c) and φ(S), one can
must be slightly larger than τh lS to achieve a 60% hit rate
                                                                   easily compute Scache from (19); however, this metric by itself
and as high as 10τh lS to achieve 73%.
                                                                   does not determine whether caching should be enabled or
                                                                   even how to select the optimal cache size c. Even though
B. Cache Speed
                                                                   caching reduces the disk overhead by sending πn rather than
   Another aspect of keeping a RAM cache is the speed at           n URLs to be checked against the disk, it also consumes more
which potentially large memory structures must be checked          memory and leaves less space for the buffer of URLs in RAM,
and updated as new URLs keep pouring in. Since searching           which in turn results in more scans through disk to determine
large trees in RAM usually results in misses in the CPU cache,     URL uniqueness. Understanding this tradeoff involves careful
some of these algorithms can become very slow as the depth         modeling of hybrid RAM-disk algorithms, which we perform
of the search increases. Define 0 ≤ φ(S) ≤ 1 to be the average      next.
CPU utilization of the server crawling at S pages/s and µ(c)
to be the number of URLs/s that a cache of size c can process
                                                                   C. Hybrid Performance
on an unloaded server. Then, we have the following result.
   Theorem 4: Assuming φ(S) is monotonically non-                    We now address the issue of how to assess the performance
decreasing, the maximum download rate Scache (in pages/s)          of disk-based methods with RAM caching. Mercator-A im-
supported by URL caching is:                                       proves performance by a factor of 1/π since only πn URLs are
                                                                   sought from disk. Given common values of π ∈ [0.25, 0.35]
                    Scache (c) = g −1 (µ(c)),              (19)
                                                                   in Table IV, this optimization results in a 2.8 − 4 times
        −1                                                         speed-up, which is clearly insufficient for making this method
where g is the inverse of g(x) = lx/(1 − φ(x)).
                                                                   competitive with the other approaches.
     Proof: We assume that caching performance linearly
depends on the available CPU capacity, i.e., if fraction φ(S)        Mercator-B, Polybot, and DRUM all exhibit new overhead
of the CPU is allocated to crawling, then the caching speed is     α(π(c, S)n, R − c)bπ(c, S)n with α(n, R) taken from the
µ(c)(1 − φ(S)) URLs/s. Then, the maximum crawling speed            appropriate model. As n → ∞ and assuming c                 R,
9


                          TABLE VI                                                         TABLE VII
   OVERHEAD α(πn, R − c) FOR D = D(n), π = 0.33, c = 320 MB          MAXIMUM HYBRID CRAWLING RATE maxc Shybrid (c) FOR D = D(n)

       N             R = 4 GB             R = 8 GB                            N              R = 4 GB                         R=8     GB
                Mercator-B DRUM      Mercator-B DRUM                                   Mercator-B   DRUM                Mercator-B     DRUM
       800M       3.02       2.27      2.54       2.27                        800M      18, 051     26, 433              23, 242       26, 433
       8B         10.4        2.4       6.1        2.4                        8B         6, 438     25, 261              10, 742       25, 261
       80B         84         3.3        41        3.3                        80B        1, 165     18, 023               2, 262       18, 023
       800B        823        3.3       395        3.3                        800B        136       18, 023                274         18, 023
       8T        8, 211       4.5     3, 935       3.3                        8T          13.9      11, 641                27.9        18, 023

                                                                                         Shybrid(c)
                                                                      Smax
all three methods decrease ω by a factor of π −2 ∈ [8, 16]                                                                 min(Shybrid(c), Scache(c))
for π ∈ [0.25, 0.35]. For n           ∞, however, only the                                                Sopt
                                                                      Sdisk
linear factor bπ(c, S)n enjoys an immediate reduction, while                               Scache(c)
α(π(c, S)n, R − c) may or may not change depending on
the dominance of the first term in (1), (2), and (7), as well
as the effect of reduced RAM size R − c on the overhead.                                                         copt
                                                                                                      c                                    c
                                                                                           R                                      R
Table VI shows one example where c = 320 MB (E = 16M
elements, γ = 20 bytes/element, π = 0.33) occupies only             Fig. 3.   Finding optimal cache size copt and optimal crawling speed Sopt .
a small fraction of R. Notice in the table that caching can
make Mercator-B’s disk overhead close to optimal for small
N , which nevertheless does not change its scaling performance
                                                                               copt = arg max min(Scache (c), Shybrid (c)),                      (24)
as N → ∞.
                                                                                            c∈[0,R]
   Since π(c, S) depends on S, determining the maximum
                                                                    which is illustrated in Figure 3. On the left of the figure,
speed a hybrid approach supports is no longer straightforward.
                                                                    we plot some hypothetical functions Scache (c) and Shybrid (c)
   Theorem 5: Assuming π(c, S) is monotonically non-
                                                                    for c ∈ [0, R]. Assuming that µ(0) = ∞, the former curve
decreasing in S, the maximum download rate Shybrid sup-
                                                                    always starts at Scache (0) = Smax and is monotonically
ported by disk algorithms with RAM caching is:
                                                                    non-increasing. For π(0, S) = 1, the latter function starts
                                        W
                   Shybrid (c) = h−1                                at Shybrid (0) = Sdisk and tends to zero as c → R, but
                                           ,                (22)
                                        bl                          not necessarily monotonically. On the right of the figure, we
where h−1 is the inverse of                                         show the supported crawling rate min(Scache (c), Shybrid (c))
                                                                    whose maximum point corresponds to the pair (copt , Sopt ). If
              h(x) = xα(π(c, x)n, R − c)π(c, x).
                                                                    Sopt > Sdisk , then caching should be enabled with c = copt ;
     Proof: From (18), we have:                                     otherwise, it should be disabled. The most common case when
                                                                    the crawler benefits from disabling the cache is when R is
                               W
              S=                               ,            (23)    small compared to γτh lS or the CPU is the bottleneck (i.e.,
                   α(π(c, S)n, R − c)bπ(c, S)l
                                                                    Scache < Sdisk ).
which can be written as h(S) = W/(bl). The solution to this
equation is S = h−1 (W/(bl)) where as before the inverse h−1                            VI. SPAM AND REPUTATION
exists due to the strict monotonicity of h(x).
                                                                      This section explains the necessity for detecting spam
   To better understand (22), we show an example of finding
                                                                    during crawls and proposes a simple technique for computing
the best cache size c that maximizes Shybrid (c) assuming
                                                                    domain reputation in real-time.
π(c, S) is a step function of hit rates derived from Table
IV. Specifically, π(c, S) = 1 if c = 0, π(c, S) = 0.84 if
                                                                    A. Problems with BFS
0 < c < γτh lS, 0.41 if c < 4γτh lS, 0.27 if c < 10γτh lS, and
0.22 for larger c. Table VII shows the resulting crawling speed        Prior crawlers [7], [14], [23], [27] have no documented
in pages/s after maximizing (22) with respect to c. As before,      spam-avoidance algorithms and are typically assumed to per-
Mercator-B is close to optimal for small N and large R, but         form BFS traversals of the web graph. Several studies [1], [3]
for N → ∞ its performance degrades. DRUM, on the other              have examined in simulations the effect of changing crawl
hand, maintains at least 11, 000 pages/s over the entire range      order by applying bias towards more popular pages. The
of N . Since these examples use large R in comparison to the        conclusions are mixed and show that PageRank order [4] can
cache size needed to achieve non-trivial hit rates, the values in   be sometimes marginally better than BFS [1] and sometimes
this table are almost inversely proportional to those in Table      marginally worse [3], where the metric by which they are
VI, which can be used to ballpark the maximum value of (22)         compared is the rate at which the crawler discovers popular
without inverting h(x).                                             pages.
   Knowing function Shybrid from (22), one needs to couple             While BFS works well in simulations, its performance on
it with the performance of the caching algorithm to obtain the      infinite graphs and/or in the presence of spam farms remains
true optimal value of c:                                            unknown. Our early experiments show that crawlers eventually
10



encounter a quickly branching site that will start to dominate                                                   check +
                                                                                    new URLs                     update
                                                                        crawling
the queue after 3 − 4 levels in the BFS tree. Some of these                                         DRUM
                                                                        threads                                                  DRUM
                                                                                    check +        URLseen        unique
sites are spam-related with the aim of inflating the page rank                                                                  PLDindegree
                                                                                    update                        URLs
of target hosts, while others are created by regular users
sometimes for legitimate purposes (e.g., calendars, testing of                        PLD links                            update

asp/php engines), sometimes for questionable purposes (e.g.,            STAR
                                                                                                             URLs & budgets
intentional trapping of unwanted robots), and sometimes for
                                                                                   robots-check
no apparent reason at all. What makes these pages similar is                                                               BEAST budget
                                                                                     queue QR
                                                                                                                            enforcement
the seemingly infinite number of dynamically generated pages                                        pass budget

and/or hosts within a given domain. Crawling these massive
                                                                    Fig. 4.   Operation of STAR.
webs or performing DNS lookups on millions of hosts from
a given domain not only places a significant burden on the
crawler, but also wastes bandwidth on downloading largely
                                                                    these methods become extremely disk intensive in large-scale
useless content.
                                                                    applications (e.g., 41 billion pages and 641 million hosts found
   Simply restricting the branching factor or the maximum
                                                                    in our crawl) and arguably with enough effort can be manipu-
number of pages/hosts per domain is not a viable solution
                                                                    lated [12] by huge link farms (i.e., millions of pages and sites
since there is a number of legitimate sites that contain over
                                                                    pointing to a target spam page). In fact, strict page-level rank is
a hundred million pages and over a dozen million virtual
                                                                    not absolutely necessary for controlling massively branching
hosts (i.e., various blog sites, hosting services, directories,
                                                                    spam. Instead, we found that spam could be “deterred” by
and forums). For example, Yahoo currently reports indexing
                                                                    budgeting the number of allowed pages per PLD based on
1.2 billion objects just within its own domain and blogspot
                                                                    domain reputation, which we determine by domain in-degree
claims over 50 million users, each with a unique hostname.
                                                                    from resources that spammers must pay for. There are two
Therefore, differentiating between legitimate and illegitimate
                                                                    options for these resources – PLDs and IP addresses. We chose
web “monsters” becomes a fundamental task of any crawler.
                                                                    the former since classification based on IPs (first suggested
   Note that this task does not entail assigning popularity to
                                                                    in Lycos [21]) has proven less effective since large subnets
each potential page as would be the case when returning query
                                                                    inside link farms could be given unnecessarily high priority
results to a user; instead, the crawler needs to decide whether a
                                                                    and multiple independent sites co-hosted on the same IP were
given domain or host should be allowed to massively branch
                                                                    improperly discounted.
or not. Indeed, spam-sites and various auto-generated webs
                                                                       While it is possible to classify each site and even each
with a handful of pages are not a problem as they can be
                                                                    subdirectory based on their PLD in-degree, our current im-
downloaded with very little effort and later classified by data-
                                                                    plementation uses a coarse-granular approach of only limiting
miners using PageRank or some other appropriate algorithm.
                                                                    spam at the PLD level. Each PLD x starts with a default budget
The problem only occurs when the crawler assigns to domain
                                                                    B0 , which is dynamically adjusted using some function F (dx )
x download bandwidth that is disproportionate to the value of
                                                                    as x’s in-degree dx changes. Budget Bx represents the number
x’s content.
                                                                    of pages that are allowed to pass from x (including all hosts
   Another aspect of spam classification is that it must be
                                                                    and subdomains in x) to crawling threads every T time units.
performed with very little CPU/RAM/disk effort and run in
                                                                       Figure 4 shows how our system, which we call Spam
real-time at speed SL links per second, where L is the number
                                                                    Tracking and Avoidance through Reputation (STAR), is orga-
of unique URLs per page.
                                                                    nized. In the figure, crawling threads aggregate PLD-PLD link
                                                                    information and send it to a DRUM structure PLDindegree,
B. Controlling Massive Sites
                                                                    which uses a batch update to store for each PLD x its
   Before we introduce our algorithm, several definitions are
                                                                    hash hx , in-degree dx , current budget Bx , and hashes of
in order. Both host and site refer to Fully Qualified Domain
                                                                    all in-degree neighbors in the PLD graph. Unique URLs
Names (FQDNs) on which valid pages reside (e.g., motors.
                                                                    arriving from URLseen perform a batch check against
ebay.com). A server is a physical host that accepts TCP
                                                                    PLDindegree, and are given Bx on their way to BEAST,
connections and communicates content to the crawler. Note
                                                                    which we discuss in the next section.
that multiple hosts may be co-located on the same server.
                                                                       Note that by varying the budget function F (dx ), one can
A top-level domain (TLD) or a country-code TLD (cc-TLD)
                                                                    implement a number of policies – crawling of only popular
is a domain one level below the root in the DNS tree (e.g.,
                                                                    pages (i.e., zero budget for low-ranked domains and maximum
.com, .net, .uk). A pay-level domain (PLD) is any domain
                                                                    budget for high-ranked domains), equal distribution between
that requires payment at a TLD or cc-TLD registrar. PLDs
                                                                    all domains (i.e., budget Bx = B0 for all x), and crawling
are usually one level below the corresponding TLD (e.g.,
                                                                    with a bias toward popular/unpopular pages (i.e., budget
amazon.com), with certain exceptions for cc-TLDs (e.g.,
                                                                    directly/inversely proportional to the PLD in-degree).
ebay.co.uk, det.wa.edu.au). We use a comprehensive
list of custom rules for identifying PLDs, which have been
                                                                                     VII. POLITENESS AND BUDGETS
compiled as part of our ongoing DNS project.
   While computing PageRank [19], BlockRank [18], or Sit-              This section discusses how to enable polite crawler opera-
eRank [10], [30] is a potential solution to the spam problem,       tion and scalably enforce budgets.
2008_IRLbot
2008_IRLbot
2008_IRLbot
2008_IRLbot

More Related Content

Similar to 2008_IRLbot

Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Techniqueijsrd.com
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the webVan-Duyet Le
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexingKhwaja Aamer
 
How can we develop an ideal website.pptx
How can we develop an ideal website.pptxHow can we develop an ideal website.pptx
How can we develop an ideal website.pptxPradeepK199981
 
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...Rana Jayant
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Geographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application PerformanceGeographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application Performancekkjjkevin03
 
Research on Key Technology of Web Reptile
Research on Key Technology of Web ReptileResearch on Key Technology of Web Reptile
Research on Key Technology of Web ReptileIRJESJOURNAL
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Rana Jayant
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawlerRishikesh Pathak
 

Similar to 2008_IRLbot (20)

Web crawler
Web crawlerWeb crawler
Web crawler
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Technique
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexing
 
Spring Into the Cloud
Spring Into the CloudSpring Into the Cloud
Spring Into the Cloud
 
How can we develop an ideal website.pptx
How can we develop an ideal website.pptxHow can we develop an ideal website.pptx
How can we develop an ideal website.pptx
 
I03
I03I03
I03
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
As25266269
As25266269As25266269
As25266269
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
Geographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application PerformanceGeographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application Performance
 
Research on Key Technology of Web Reptile
Research on Key Technology of Web ReptileResearch on Key Technology of Web Reptile
Research on Key Technology of Web Reptile
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 

More from Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipediaHiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitectureHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfHiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdfHiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdfHiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfHiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 

More from Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

2008_IRLbot

  • 1. 1 IRLbot: Scaling to 6 Billion Pages and Beyond Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov The final term refers to the CPU and RAM resources Σ Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a single- that are required to sustain the download of N pages at an server implementation and models its performance. We show average speed S. In most crawlers, larger N implies higher that with the quadratically increasing complexity of verifying complexity of checking URL uniqueness, verifying robots.txt, URL uniqueness, BFS crawl order, and fixed per-host rate- and scanning the DNS cache, which ultimately results in lower limiting, current crawling algorithms cannot effectively cope S and higher Σ. At the same time, higher speed S requires with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, smaller data structures, which often can be satisfied only by and infinite loops created by server-side scripts. We offer a either lowering N or increasing Σ. set of techniques for dealing with these issues and test their Current research literature [2], [4], [7], [9], [14], [20], [22], performance in an implementation we call IRLbot. In our recent [23], [25], [26], [27], [16] generally provides techniques that experiment that lasted 41 days, IRLbot running on a single server can solve a subset of the problem and achieve a combination successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of any two objectives (i.e., large slow crawls, small fast crawls, of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with or large fast crawls with unbounded resources). They also do algorithms proposed in related work, this version of IRLbot did not analyze how the proposed algorithms scale for very large not experience any bottlenecks and successfully handled content N given fixed S and Σ. Even assuming sufficient Internet from over 117 million hosts, parsed out 394 billion links, and bandwidth and enough disk space, the problem of designing a discovered a subset of the web graph with 41 billion unique nodes. web crawler that can support large N (hundreds of billions of pages), sustain reasonably high speed S (thousands of pages/s), and operate with fixed resources Σ remains open. I. INTRODUCTION Over the last decade, the World Wide Web (WWW) has B. Reputation and Spam evolved from a handful of pages to billions of diverse objects. The web has changed significantly since the days of early In order to harvest this enormous data repository, search crawlers [4], [23], [25], mostly in the area of dynamically gen- engines download parts of the existing web and offer Inter- erated pages and web spam. With server-side scripts that can net users access to this database through keyword search. create infinite loops, high-density link farms, and unlimited Search engines consist of two fundamental components – number of hostnames, the task of web crawling has changed web crawlers, which find, download, and parse content in from simply doing a BFS scan of the WWW [24] to deciding the WWW, and data miners, which extract keywords from in real time which sites contain useful information and giving pages, rank document importance, and answer user queries. them higher priority as the crawl progresses. This paper does not deal with data miners, but instead focuses Our experience shows that BFS eventually becomes trapped on the design of web crawlers that can scale to the size of in useless content, which manifests itself in multiple ways: the current1 and future web, while implementing consistent a) the queue of pending URLs contains a non-negligible per-website and per-server rate-limiting policies and avoiding fraction of links from spam sites that threaten to overtake being trapped in spam farms and infinite webs. We next discuss legitimate URLs due to their high branching factor; b) the our assumptions and explain why this is a challenging issue. DNS resolver succumbs to the rate at which new hostnames are dynamically created within a single domain; and c) the A. Scalability crawler becomes vulnerable to the delay attack from sites that purposely introduce HTTP and DNS delays in all requests With the constant growth of the web, discovery of user- originating from the crawler’s IP address. created content by web crawlers faces an inherent tradeoff No prior research crawler has attempted to avoid spam or between scalability, performance, and resource usage. The document its impact on the collected data. Thus, designing first term refers to the number of pages N a crawler can handle low-overhead and robust algorithms for computing site repu- without becoming “bogged down” by the various algorithms tation during the crawl is the second open problem that we and data structures needed to support the crawl. The second aim to address in this work. term refers to the speed S at which the crawler discovers the web as a function of the number of pages already crawled. C. Politeness All authors are with the Department of Computer Science, Texas A&M Even today, webmasters become easily annoyed when web University, College Station, TX 77843 USA (email: {h0l9314, dleonard, xmwang, dmitri}@cs.tamu.edu) crawlers slow down their servers, consume too much Internet 1 Adding the size of all top-level domains using site queries (e.g., bandwidth, or simply visit pages with “too much” frequency. “site:.com”), Google’s index size in January 2008 can be estimated at 30 This leads to undesirable consequences including blocking billion pages and Yahoo’s at 37 billion.
  • 2. 2 of the crawler from accessing the site in question, various links from assets that spammers cannot grow to infinity. complaints to the ISP hosting the crawler, and even threats Our algorithm, which we call Spam Tracking and Avoidance of legal action. Incorporating per-website and per-IP hit limits through Reputation (STAR), dynamically allocates the budget into a crawler is easy; however, preventing the crawler from of allowable pages for each domain and all of its subdomains “choking” when its entire RAM gets filled up with URLs pend- in proportion to the number of in-degree links from other ing for a small set of hosts is much more challenging. When domains. This computation can be done in real time with N grows into the billions, the crawler ultimately becomes little overhead using DRUM even for millions of domains bottlenecked by its own politeness and is then faced with in the Internet. Once the budgets are known, the rates at a decision to suffer significant slowdown, ignore politeness which pages can be downloaded from each domain are scaled considerations for certain URLs (at the risk of crashing target proportionally to the corresponding budget. servers or wasting valuable bandwidth on huge spam farms), The final issue we faced in later stages of the crawl was or discard a large fraction of backlogged URLs, none of which how to prevent live-locks in processing URLs that exceed is particularly appealing. their budget. Periodically re-scanning the queue of over-budget While related work [2], [7], [14], [23], [27] has proposed URLs produces only a handful of good links at the cost of several algorithms for rate-limiting host access, none of these huge overhead. As N becomes large, the crawler ends up studies have addressed the possibility that a crawler may stall spending all of its time cycling through failed URLs and makes due to its politeness restrictions or discussed management of very little progress. The solution to this problem, which we rate-limited URLs that do not fit into RAM. This is the third call Budget Enforcement with Anti-Spam Tactics (BEAST), open problem that we aim to solve in this paper. involves a dynamically increasing number of disk queues among which the crawler spreads the URLs based on whether they fit within the budget or not. As a result, almost all pages D. Our Contributions from sites that significantly exceed their budgets are pushed The first part of the paper presents a set of web-crawler into the last queue and are examined with lower frequency as algorithms that address the issues raised above and the second N increases. This keeps the overhead of reading spam at some part briefly examines their performance in an actual web fixed level and effectively prevents it from “snowballing.” crawl.2 Our design stems from three years of web crawling The above algorithms were deployed in IRLbot [17] and experience at Texas A&M University using an implementation tested on the Internet in June-August 2007 using a single we call IRLbot [17] and the various challenges posed in server attached to a 1 gb/s backbone of Texas A&M. Over simultaneously: 1) sustaining a fixed crawling rate of several a period of 41 days, IRLbot issued 7, 606, 109, 371 con- thousand pages/s; 2) downloading billions of pages; and 3) nection requests, received 7, 437, 281, 300 HTTP responses operating with the resources of a single server. from 117, 576, 295 hosts, and successfully downloaded N = The first performance bottleneck we faced was caused by 6, 380, 051, 942 unique HTML pages at an average rate of the complexity of verifying uniqueness of URLs and their 319 mb/s (1, 789 pages/s). After handicapping quickly branch- compliance with robots.txt. As N scales into many billions, ing spam and over 30 million low-ranked domains, IRLbot even the disk algorithms of [23], [27] no longer keep up with parsed out 394, 619, 023, 142 links and found 41, 502, 195, 631 the rate at which new URLs are produced by our crawler unique pages residing on 641, 982, 061 hosts, which explains (i.e., up to 184K per second). To understand this problem, we our interest in crawlers that scale to tens and hundreds of analyze the URL-check methods proposed in the literature and billions of pages as we believe a good fraction of 35B URLs show that all of them exhibit severe performance limitations not crawled in this experiment contains useful content. when N becomes sufficiently large. We then introduce a new The rest of the paper is organized as follows. Section II technique called Disk Repository with Update Management overviews related work. Section III defines our objectives (DRUM) that can store large volumes of arbitrary hashed and classifies existing approaches. Section IV discusses how data on disk and implement very fast check, update, and checking URL uniqueness scales with crawl size and proposes check+update operations using bucket sort. We model the our technique. Section V models caching and studies its rela- various approaches and show that DRUM’s overhead remains tionship with disk overhead. Section VI discusses our approach close to the best theoretically possible as N reaches into the to ranking domains and Section VII introduces a scalable trillions of pages and that for common disk and RAM size, method of enforcing budgets. Section VIII summarizes our DRUM can be thousands of times faster than prior disk-based experimental statistics and Section IX concludes the paper. methods. The second bottleneck we faced was created by multi- II. RELATED WORK million-page sites (both spam and legitimate), which became backlogged in politeness rate-limiting to the point of over- There is only a limited number of papers describing de- flowing the RAM. This problem was impossible to overcome tailed web-crawler algorithms and offering their experimental unless politeness was tightly coupled with site reputation. In performance. First-generation designs [9], [22], [25], [26] were order to determine the legitimacy of a given domain, we use developed to crawl the infant web and commonly reported col- a very simple algorithm based on the number of incoming lecting less than 100, 000 pages. Second-generation crawlers [2], [7], [15], [14], [23], [27] often pulled several hundred 2 A separate paper will present a much more detailed analysis of the million pages and involved multiple agents in the crawling collected data.
  • 3. 3 TABLE I COMPARISON OF PRIOR CRAWLERS AND THEIR DATA STRUCTURES Crawler Year Crawl size URLseen RobotsCache DNScache Q (HTML pages) RAM Disk RAM Disk WebCrawler [25] 1994 50K database – – database Internet Archive [6] 1997 – site-based – site-based – site-based RAM Mercator-A [14] 1999 41M LRU seek LRU – – disk Mercator-B [23] 2001 473M LRU batch LRU – – disk Polybot [27] 2001 120M tree batch database database disk WebBase [7] 2001 125M site-based – site-based – site-based RAM UbiCrawler [2] 2002 45M site-based – site-based – site-based RAM process. We discuss their design and scalability issues in the servers can maintain speed mS and scale to mN pages by next section. simply partitioning the subset of all URLs and data structures Another direction was undertaken by the Internet Archive between themselves (we assume that the bandwidth needed to [6], [16], which maintains a history of the Internet by down- shuffle the URLs between the servers is also well provisioned). loading the same set of pages over and over. In the last 10 Therefore, the aggregate performance of a server farm is years, this database has collected over 85 billion pages, but ultimately governed by the characteristics of individual servers only a small fraction of them are unique. Additional crawlers and their local limitations. We explore these limits in detail are [4], [8], [13], [20], [28], [29]; however, their focus usually throughout the paper. does not include the large scale assumed in this paper and their fundamental crawling algorithms are not presented in sufficient B. Crawler Operation detail to be analyzed here. The functionality of a basic web crawler can be broken The largest prior crawl using a fully-disclosed implemen- down into several phases: 1) removal of the next URL u from tation appeared in [23], where Mercator obtained N = 473 the queue Q of pending pages; 2) download of u and extraction million HTML pages in 17 days (we exclude non-HTML of new URLs u1 , . . . , uk from u’s HTML tags; 3) for each ui , content since it has no effect on scalability). The fastest verification of uniqueness against some structure URLseen reported crawler was [13] with 816 pages/s, but the scope and checking compliance with robots.txt using some other of their experiment was only N = 25 million. Finally, to structure RobotsCache; 4) addition of passing URLs to Q our knowledge, the largest webgraph used in any paper was and URLseen; 5) update of RobotsCache if necessary. The AltaVista’s 2003 crawl with 1.4B pages and 6.6B links [11]. crawler may also maintain its own DNScache structure in cases when the local DNS server is not able to efficiently III. OBJECTIVES AND CLASSIFICATION cope with the load (e.g., its RAM cache does not scale to the This section formalizes the purpose of web crawling and number of hosts seen by the crawler or it becomes very slow classifies algorithms in related work, some of which we study after caching hundreds of millions of records). later in the paper. A summary of prior crawls and their methods in managing URLseen, RobotsCache, DNScache, and queue Q is A. Crawler Objectives shown in Table I. The table demonstrates that two approaches to storing visited URLs have emerged in the literature: RAM- We assume that the ideal task of a crawler is to start from a only and hybrid RAM-disk. In the former case [2], [6], [7], set of seed URLs Ω0 and eventually crawl the set of all pages crawlers keep a small subset of hosts in memory and visit Ω∞ that can be discovered from Ω0 using HTML links. The them repeatedly until a certain depth or some target number crawler is allowed to dynamically change the order in which of pages have been downloaded from each site. URLs that URLs are downloaded in order to achieve a reasonably good do not fit in memory are discarded and sites are assumed coverage of “useful” pages ΩU ⊆ Ω∞ in some finite amount of to never have more than some fixed volume of pages. This time. Due to the existence of legitimate sites with hundreds of methodology results in truncated web crawls that require millions of pages (e.g., ebay.com, yahoo.com, blogspot.com), different techniques from those studied here and will not be the crawler cannot make any restricting assumptions on the considered in our comparison. In the latter approach [14], [23], maximum number of pages per host, the number of hosts per [25], [27], URLs are first checked against a buffer of popular domain, the number of domains in the Internet, or the number links and those not found are examined using a disk file. The of pages in the crawl. We thus classify algorithms as non- RAM buffer may be an LRU cache [14], [23], an array of scalable if they impose hard limits on any of these metrics or recently added URLs [14], [23], a general-purpose database are unable to maintain crawling speed when these parameters with RAM caching [25], and a balanced tree of URLs pending become very large. a disk check [27]. We should also explain why this paper focuses on the performance of a single server rather than some distributed Most prior approaches keep RobotsCache in RAM and architecture. If one server can scale to N pages and maintain either crawl each host to exhaustion [2], [6], [7] or use an speed S, then with sufficient bandwidth it follows that m LRU cache in memory [14], [23]. The only hybrid approach
  • 4. 4 is used in [27], which employs a general-purpose database for verify that this holds in practice later in the paper). Further storing downloaded robots.txt. Finally, with the exception of assume that the current size of URLseen is U entries, the size [27], prior crawlers do not perform DNS caching and rely on of RAM allocated to URL checks is R, the average number the local DNS server to store these records for them. of links per downloaded page is l, the average URL length is b, the URL compression ratio is q, and the crawler expects to visit N pages. It then follows that n = lN links must IV. SCALABILITY OF DISK METHODS pass through URL check, np of them are unique, and bq is This section describes algorithms proposed in prior litera- the average number of bytes in a compressed URL. Finally, ture, analyzes their performance, and introduces our approach. denote by H the size of URL hashes used by the crawler and P the size of a memory pointer. Then we have the following A. Algorithms result. Theorem 1: The overhead of URLseen batch disk check In Mercator-A [14], URLs that are not found in memory is ω(n, R) = α(n, R)bn bytes, where for Mercator-B: cache are looked up on disk by seeking within the URLseen file and loading the relevant block of hashes. The method 2(2U H + pHn)(H + P ) α(n, R) = +2+p (1) clusters URLs by their site hash and attempts to resolve bR multiple in-memory links from the same site in one seek. and for Polybot: However, in general, locality of parsed out URLs is not 2(2U bq + pbqn)(b + 4P ) guaranteed and the worst-case delay of this method is one α(n, R) = + p. (2) seek/URL and the worst-case read overhead is one block/URL. bR A similar approach is used in WebCrawler [25], where a Proof: To prevent locking on URL check, both Mercator- general-purpose database performs multiple seeks (assuming B and Polybot must use two buffers of accumulated URLs (i.e., a common B-tree implementation) to find URLs on disk. one for checking the disk and the other for newly arriving Even with RAID, disk seeking cannot be reduced to below data). Assume this half-buffer allows storage of m URLs (i.e., 3 − 5 ms, which is several orders of magnitude slower than m = R/[2(H+P )] for Mercator-B and m = R/[2(b+4P )] for required in actual web crawls (e.g., 5 − 10 microseconds in Polybot) and the size of the initial disk file is f (i.e., f = U H IRLbot). General-purpose databases that we have examined for Mercator-B and f = U bq for Polybot). are much worse and experience a significant slowdown (i.e., For Mercator-B, the i-th iteration requires writing/reading 10 − 50 ms per lookup) after about 100 million inserted of mb bytes of arriving URL strings, reading the current records. Therefore, these approaches do not appear viable URLseen, writing it back, and appending mp hashes to it, unless RAM caching can achieve some enormously high hit i.e., 2f + 2mb + 2mpH(i − 1) + mpH bytes. This leads to the rates (i.e., 99.7% for IRLbot). We examine whether this is following after adding the final overhead to store pbn bytes of possible in the next section when studying caching. unique URLs in the queue: Mercator-B [23] and Polybot [27] use a so-called batch disk n/m check – they accumulate a buffer of URLs in memory and then ω(n) = (2f + 2mb + 2mpHi − mpH) + pbn merge it with a sorted URLseen file in one pass. Mercator- i=1 B stores only hashes of new URLs in RAM and places their 2(2U H + pHn)(H + P ) text on disk. In order to retain the mapping from hashes to = nb +2+p . (3) bR the text, a special pointer is attached to each hash. After the memory buffer is full, it is sorted in place and then compared For Polybot, the i-th iteration has overhead 2f + 2mpbq(i − with blocks of URLseen as they are read from disk. Non- 1) + mpbq, which yields: duplicate URLs are merged with those already on disk and n/m written into the new version of URLseen. Pointers are then ω(n) = (2f + 2mpbqi − mpbq) + pbn used to recover the text of unique URLs and append it to the i=1 disk queue. 2(2U bq + pbqn)(b + 4P ) Polybot keeps the entire URLs (i.e., actual strings) in = nb +p (4) bR memory and organizes them into a binary search tree. Once the tree size exceeds some threshold, it is merged with the disk and leads to (2). file URLseen, which contains compressed URLs already seen This result shows that ω(n, R) is a product of two elements: by the crawler. Besides being enormously CPU intensive (i.e., the number of bytes bn in all parsed URLs and how many compression of URLs and search in binary string trees are times α(n, R) they are written to/read from disk. If α(n, R) rather slow in our experience), this method has to perform grows with n, the crawler’s overhead will scale super-linearly more frequent scans of URLseen than Mercator-B due to the and may eventually become overwhelming to the point of less-efficient usage of RAM. stalling the crawler. As n → ∞, the quadratic term in ω(n, R) dominates the other terms, which places Mercator- B’s asymptotic performance at B. Modeling Prior Methods 2(H + P )pH 2 Assume the crawler is in some steady state where the proba- ω(n, R) = n (5) bility of uniqueness p among new URLs remains constant (we R
  • 5. 5 and that of Polybot at RAM <key,value> buffer 1 2(b + 4P )pbq 2 aux buffer 1 ω(n, R) = n. (6) <key,value,aux> R tuples … The ratio of these two terms is (H + P )H/[bq(b + 4P )], <key,value> buffer k which for the IRLbot case with H = 8 bytes/hash, P = 4 aux buffer k bytes/pointer, b = 110 bytes/URL, and using very optimistic bucket bq = 5 bytes/URL shows that Mercator-B is roughly 7.2 times cache Z buffer disk faster than Polybot as n → ∞. The best performance of any method that stores the text Fig. 1. Operation of DRUM. of URLs on disk before checking them against URLseen (e.g., Mercator-B) is αmin = 2 + p, which is the overhead needed to write all bn bytes to disk, read them back for 2) the disk cache Z is sequentially read in chunks of ∆ bytes processing, and then append bpn bytes to the queue. Methods and compared with the keys in bucket QH to determine their i with memory-kept URLs (e.g., Polybot) have an absolute uniqueness; 3) those <key,value> pairs in QH that require i lower bound of αmin = p, which is the overhead needed to an update are merged with the contents of the disk cache write the unique URLs to disk. Neither bound is achievable and written to the updated version of Z; 4) after all unique in practice, however. keys in QH are found, their original order is restored, QT i i is sequentially read into memory in blocks of size ∆, and the corresponding aux portion of each unique key is sent for C. DRUM further processing (see below). An important aspect of this We now describe the URL-check algorithm used in IRLbot, algorithm is that all buckets are checked in one pass through which belongs to a more general framework we call Disk disk cache Z.3 Repository with Update Management (DRUM). The purpose We now explain how DRUM is used for storing crawler of DRUM is to allow for efficient storage of large collections data. The most important DRUM object is URLseen, which of <key,value> pairs, where key is a unique identifier implements only one operation – check+update. Incoming (hash) of some data and value is arbitrary information tuples are <URLhash,-,URLtext>, where the key is an 8- attached to the key. There are three supported operations on byte hash of each URL, the value is empty, and the auxiliary these pairs – check, update, and check+update. In the data is the URL string. After all unique URLs are found, first case, the incoming set of data contains keys that must be their text strings (aux data) are sent to the next queue checked against those stored in the disk cache and classified for possible crawling. For caching robots.txt, we have an- as being duplicate or unique. For duplicate keys, the value other DRUM structure called RobotsCache, which supports associated with each key can be optionally retrieved from asynchronous check and update operations. For checks, it disk and used for some processing. In the second case, the receives tuples <HostHash,-,URLtext> and for updates incoming list contains <key,value> pairs that need to be <HostHash,HostData,->, where HostData contains merged into the existing disk cache. If a given key exists, its the robots.txt file, IP address of the host, and optionally value is updated (e.g., overridden or incremented); if it does other host-related information. The last DRUM object of not, a new entry is created in the disk file. Finally, the third this section is called RobotsRequested and is used for operation performs both check and update in one pass through storing the hashes of sites for which a robots.txt has been the disk cache. Also note that DRUM may be supplied with requested. Similar to URLseen, it only supports simultaneous a mixed list where some entries require just a check, while check+update and its incoming tuples are <HostHash, others need an update. -,HostText>. A high-level overview of DRUM is shown in Figure 1. In the Figure 2 shows the flow of new URLs produced by the figure, a continuous stream of tuples <key,value,aux> crawling threads. They are first sent directly to URLseen arrives into DRUM, where aux is some auxiliary data associ- using check+update. Duplicate URLs are discarded and ated with each key. DRUM spreads pairs <key,value> unique ones are sent for verification of their compliance with between k disk buckets QH , . . . , QH based on their key 1 k the budget (both STAR and BEAST are discussed later in the (i.e., all keys in the same bucket have the same bit-prefix). paper). URLs that pass the budget are queued to be checked This is accomplished by feeding pairs <key,value> into k against robots.txt using RobotsCache. URLs that have a memory arrays of size M each and then continuously writing matching robots.txt file are classified immediately as passing them to disk as the buffers fill up. The aux portion of each key or failing. Passing URLs are queued in Q and later downloaded (which usually contains the text of URLs) from the i-th bucket by the crawling threads. Failing URLs are discarded. is kept in a separate file QT in the same FIFO order as pairs i URLs that do not have a matching robots.txt are sent to the <key,value> in QH . Note that to maintain fast sequential i back of queue QR and their hostnames are passed through writing/reading, all buckets are pre-allocated on disk before RobotsRequested using check+update. Sites whose they are used. Once the largest bucket reaches a certain size r < R, the 3 Note that disk bucket sort is a well-known technique that exploits following process is repeated for i = 1, . . . , k: 1) bucket QH uniformity of keys; however, its usage in checking URL uniqueness and the i is read into the bucket buffer shown in Figure 1 and sorted; associated performance model of web crawling has not been explored before.
  • 6. 6 Our disk restriction then gives us that the size of all buckets new URLs crawling unique URLs DRUM kr and their text krb/H must be equal to D: threads URLseen check + update krb k(H + b)(R − 2M k) STAR budget kr + = = D. (9) check H H +P robots & unique robots download DNS threads hostnames queue QD It turns out that not all pairs (R, k) are feasible. The reason unable to update is that if R is set too small, we are not able to fill all of D with hostnames robots request check DRUM queue QE RobotsRequested buckets since 2M k will leave no room for r ≥ ∆. Re-writing check + DRUM update (9), we obtain a quadratic equation 2M k 2 − Rk + A = 0, RobotsCache check pass where A = (H + P )D/(H + b). If R2 < 8M A, we have no robots URLs BEAST budget fail enforcement solution and thus R is insufficient to support D. In that case, robots robots-check we need to maximize k(R − 2M k) subject to k ≤ km , where queue QR pass budget ready queue Q 1 ∆(H + P ) km = R− (10) Fig. 2. High level organization of IRLbot. 2M H is the maximum number of buckets that still leave room for hash is not already present in this file are fed through queue r ≥ ∆. Maximizing k(R −2M k), we obtain the optimal point QD into a special set of threads that perform DNS lookups k0 = R/(4M ). Assuming that R ≥ 2∆(1 + P/H), condition and download robots.txt. They subsequently issue a batch k0 ≤ km is always satisfied. Using k0 buckets brings our disk update to RobotsCache using DRUM. Since in steady- usage to D = (H + b)R2 /[8M (H + P )], which is always state (i.e., excluding the initial phase) the time needed to less than D. download robots.txt is much smaller than the average delay in In the case R2 ≥ 4M A, we can satisfy D and the correct QR (i.e., 1-2 days), each URL makes no more than one cycle number of buckets k is given by two choices: √ through this loop. In addition, when RobotsCache detects R ± R2 − 8M A that certain robots.txt or DNS records have become outdated, k= . (11) 4M it marks all corresponding URLs as “unable to check, outdated The reason why we have two values is that we can achieve records,” which forces RobotsRequested to pull a new set D either by using few buckets (i.e., k is small and r is large) of exclusion rules and/or perform another DNS lookup. Old or many buckets (i.e., k is large and r is small). The correct records are automatically expunged during the update when solution is to take the smaller root to minimize the number of RobotsCache is re-written. open handles and disk fragmentation. Putting things together: It should be noted that URLs are kept in memory only √ when they are needed for immediate action and all queues in R − R2 − 8M A Figure 2 are stored on disk. We should also note that DRUM k1 = . (12) 4M data structures can support as many hostnames, URLs, and Note that we still need to ensure k1 ≤ km , which holds robots.txt exception rules as disk space allows. when: D. DRUM Model ∆(H + P ) 2M AH R≥ + . (13) H ∆(H + P ) Assume that the crawler maintains a buffer of size M = 256 KB for each open file and that the hash bucket size r must Given that R ≥ 2∆(1 + P/H) from the statement of the be at least ∆ = 32 MB to support efficient reading during the theorem, it is easy to verify that (13) is always satisfied. check-merge phase. Further assume that the crawler can use Next, for the i-th iteration that fills up all k buckets, we need up to D bytes of disk space for this process. Then we have to write/read QT once (overhead 2krb/H) and read/write each i the following result. bucket once as well (overhead 2kr). The remaining overhead Theorem 2: Assuming that R ≥ 2∆(1 + P/H), DRUM’s is reading/writing URLseen (overhead 2f + 2krp(i − 1)) URLseen overhead is ω(n, R) = α(n, R)bn bytes, where: and appending the new URL hashes (overhead krp). We thus 8M (H+P )(2U H+pHn) obtain that we need nH/(kr) iterations and: + 2 + p + 2H R2 < Λ bR2 b α(n, R) = (H+b)(2U H+pHn) + 2 + p + 2H R2 ≥ Λ nH/(kr) bD b 2krb (7) ω(n, R) = 2f + + 2kr + 2krpi − krp + pbn H and Λ = 8M D(H + P )/(H + b). i=1 Proof: Memory R needs to support 2k open file buffers (2U H + pHn)H 2H = nb +2+p+ . (14) and one block of URL hashes that are loaded from QH . In bkr b i order to compute block size r, recall that it gets expanded Recalling our two conditions, we use k0 r = by a factor of (H + P )/H when read into RAM due to the HR2 /[8M (H + P )] for R2 < 8M A to obtain: addition of a pointer to each hash value. We thus obtain that r(H + P )/H + 2M k = R or: 8M (H + P )(2U H + pHn) 2H ω(n, R) = nb +2+p+ . bR2 b (R − 2M k)H r= . (8) (15) H +P
  • 7. 7 TABLE II TABLE III OVERHEAD α(n, R) FOR R = 1 GB, D = 4.39 TB OVERHEAD α(n, R) FOR D = D(n) N Mercator-B Polybot DRUM N R = 4 GB R = 8 GB Mercator-B DRUM Mercator-B DRUM 800M 11.6 69 2.26 8B 93 663 2.35 800M 4.48 2.30 3.29 2.30 80B 917 6, 610 3.3 8B 25 2.7 13.5 2.7 800B 9, 156 66, 082 12.5 80B 231 3.3 116 3.3 8T 91, 541 660, 802 104 800B 2, 290 3.3 1, 146 3.3 8T 22, 887 8.1 11, 444 3.7 For the other case R2 ≥ 8M A, we have k1 r = DH/(H +b) and thus get: Theorem 3: Maximum download rate (in pages/s) sup- ported by the disk portion of URL uniqueness checks is: (H + b)(2U H + pHn) 2H ω(n, R) = nb +2+p+ , bD b W Sdisk = . (18) (16) α(n, R)bl which leads to the statement of the theorem. Proof: The time needed to perform uniqueness checks It follows from the proof of Theorem 2 that in order to for n new URLs is spent in disk I/O involving ω(n, R) = match D to a given RAM size R and avoid unnecessary α(n, R)bn = α(n, R)blN bytes. Assuming that W is the allocation of disk space, one should operate at the optimal average disk I/O speed, it takes N/S seconds to generate n point given by R2 = Λ: new URLs and ω(n, R)/W seconds to check their uniqueness. R2 (H + b) Equating the two entities, we have (18). Dopt = . (17) We use IRLbot’s parameters to illustrate the applicability 8M (H + P ) of this theorem. Neglecting the process of appending new For example, R = 1 GB produces Dopt = 4.39 TB and URLs to the queue, the crawler’s read and write overhead is R = 2 GB produces Dopt = 17 TB. For D = Dopt , the symmetric. Then, assuming IRLbot’s 1-GB/s read speed and corresponding number of buckets is kopt = R/(4M ), the size 350-MB/s write speed (24-disk RAID-5), we obtain that its of the bucket buffer is ropt = RH/[2(H + P )] ≈ 0.33R, and average disk read-write speed is equal to 675 MB/s. Allocating the leading quadratic term of ω(n, R) in (7) is now R/(4M ) 15% of this rate for checking URL uniqueness4 , the effective times smaller than in Mercator-B. This ratio is 1, 000 for R = disk bandwidth of the server can be estimated at W = 101.25 1 GB and 8, 000 for R = 8 GB. The asymptotic speed-up in MB/s. Given the conditions of Table III for R = 8 GB and either case is significant. assuming N = 8 trillion pages, DRUM yields a sustained Finally, observe that the best possible performance of any download rate of Sdisk = 4, 192 pages/s (i.e., 711 mb/s using method that stores both hashes and URLs on disk is αmin = IRLbot’s average HTML page size of 21.2 KB). In crawls of 2 + p + 2H/b. the same scale, Mercator-B would be 3, 075 times slower and would admit an average rate of only 1.4 pages/s. Since with E. Comparison these parameters Polybot is 7.2 times slower than Mercator-B, We next compare disk performance of the studied methods its average crawling speed would be 0.2 pages/s. when non-quadratic terms in ω(n, R) are non-negligible. Table With just 10 DRUM servers and a 10-gb/s Internet link, one II shows α(n, R) of the three studied methods for fixed RAM can create a search engine with a download capacity of 100 size R and disk D as N increases from 800 million to 8 billion pages per month and scalability to trillions of pages. trillion (p = 1/9, U = 100M pages, b = 110 bytes, l = 59 links/page). As N reaches into the trillions, both Mercator- V. CACHING B and Polybot exhibit overhead that is thousands of times To understand whether caching provides improved per- larger than the optimal and invariably become “bogged down” formance, one must consider a complex interplay between in re-writing URLseen. On the other hand, DRUM stays the available CPU capacity, spare RAM size, disk speed, within a factor of 50 from the best theoretically possible value performance of the caching algorithm, and crawling rate. This (i.e., αmin = 2.256) and does not sacrifice nearly as much is a three-stage process – we first examine how cache size performance as the other two methods. and crawl speed affect the hit rate, then analyze the CPU Since disk size D is likely to be scaled with N in order restrictions of caching, and finally couple them with RAM/disk to support the newly downloaded pages, we assume for the limitations using analysis in the previous section. next example that D(n) is the maximum of 1 TB and the size of unique hashes appended to URLseen during the crawl of N pages, i.e., D(n) = max(pHn, 1012 ). Table III shows A. Cache Hit Rate how dynamically scaling disk size allows DRUM to keep the Assume that c bytes of RAM are available to a cache whose overhead virtually constant as N increases. entries incur fixed overhead γ bytes. Then E = c/γ is the To compute the average crawling rate that the above meth- ods support, assume that W is the average disk I/O speed and 4 Additional disk I/O is needed to verify robots.txt, perform reputation consider the next result. analysis, and enforce budgets.
  • 8. 8 TABLE IV TABLE V LRU HIT RATES STARTING AT N0 CRAWLED PAGES INSERTION RATE AND MAXIMUM CRAWLING SPEED FROM (21) Cache elements E N0 = 1B N0 = 4B Method µ(c) URLs/s Size Scache 256K 19% 16% Balanced tree (strings) 113K 2.2 GB 1, 295 4M 26% 22% Tree-based LRU (8-byte int) 185K 1.6 GB 1, 757 8M 68% 59% Balanced tree (8-byte int) 416K 768 MB 2, 552 16M 71% 67% CLOCK (8-byte int) 2M 320 MB 3, 577 64M 73% 73% 512M 80% 78% would match the rate of URL production to that of the cache, i.e., maximum number of elements stored in the cached at any lS = µ(c)(1 − φ(S)). (20) time. Then define π(c, S) to be the cache miss rate under Re-writing (20) using g(x) = lx/(1 − φ(x)), we have crawling speed S pages/s and cache size c. The reason why g(S) = µ(c), which has a unique solution S = g −1 (µ(c)) π depends on S is that the faster the crawl, the more pages since g(x) is a strictly increasing function with a proper it produces between visits to the same site, which is where inverse. duplicate links are most prevalent. Defining τh to be the per- For the common case φ(S) = S/Smax , where Smax is the host visit delay, common sense suggests that π(c, S) should server’s maximum (i.e., CPU-limited) crawling rate in pages/s, depend not only on c, but also on τh lS. (19) yields a very simple expression: Table IV shows LRU cache hit rates 1 − π(c, S) during several stages of our crawl. We seek in the trace file to µ(c)Smax Scache (c) = . (21) the point where the crawler has downloaded N0 pages and lSmax + µ(c) then simulate LRU hit rates by passing the next 10E URLs To show how to use the above result, Table V compares discovered by the crawler through the cache. As the table the speed of several memory structures on the IRLbot server shows, a significant jump in hit rates happens between 4M using E = 16M elements and displays model (21) for Smax = and 8M entries. This is consistent with IRLbot’s peak value 4, 000 pages/s. As can be seen in the table, insertion of text of τh lS ≈ 7.3 million. Note that before cache size reaches this URLs into a balanced tree (used in Polybot [27]) is the slowest value, most hits in the cache stem from redundant links within operation that also consumes the most memory. The speed of the same page. As E starts to exceed τh lS, popular URLs on classical LRU caching (185K/s) and search-trees with 8-byte each site survive between repeat visits and continue staying in keys (416K/s) is only slightly better since both use multiple the cache as long as the corresponding site is being crawled. (i.e., log2 E) jumps through memory. CLOCK [5], which is a Additional simulations confirming this effect are omitted for space and time optimized approximation to LRU, achieves a brevity. much better speed (2M/s), requires less RAM, and is suitable Unlike [5], which suggests that E be set 100 − 500 times for crawling rates up to 3, 577 pages/s on this server. larger than the number of threads, our results show that E After experimentally determining µ(c) and φ(S), one can must be slightly larger than τh lS to achieve a 60% hit rate easily compute Scache from (19); however, this metric by itself and as high as 10τh lS to achieve 73%. does not determine whether caching should be enabled or even how to select the optimal cache size c. Even though B. Cache Speed caching reduces the disk overhead by sending πn rather than Another aspect of keeping a RAM cache is the speed at n URLs to be checked against the disk, it also consumes more which potentially large memory structures must be checked memory and leaves less space for the buffer of URLs in RAM, and updated as new URLs keep pouring in. Since searching which in turn results in more scans through disk to determine large trees in RAM usually results in misses in the CPU cache, URL uniqueness. Understanding this tradeoff involves careful some of these algorithms can become very slow as the depth modeling of hybrid RAM-disk algorithms, which we perform of the search increases. Define 0 ≤ φ(S) ≤ 1 to be the average next. CPU utilization of the server crawling at S pages/s and µ(c) to be the number of URLs/s that a cache of size c can process C. Hybrid Performance on an unloaded server. Then, we have the following result. Theorem 4: Assuming φ(S) is monotonically non- We now address the issue of how to assess the performance decreasing, the maximum download rate Scache (in pages/s) of disk-based methods with RAM caching. Mercator-A im- supported by URL caching is: proves performance by a factor of 1/π since only πn URLs are sought from disk. Given common values of π ∈ [0.25, 0.35] Scache (c) = g −1 (µ(c)), (19) in Table IV, this optimization results in a 2.8 − 4 times −1 speed-up, which is clearly insufficient for making this method where g is the inverse of g(x) = lx/(1 − φ(x)). competitive with the other approaches. Proof: We assume that caching performance linearly depends on the available CPU capacity, i.e., if fraction φ(S) Mercator-B, Polybot, and DRUM all exhibit new overhead of the CPU is allocated to crawling, then the caching speed is α(π(c, S)n, R − c)bπ(c, S)n with α(n, R) taken from the µ(c)(1 − φ(S)) URLs/s. Then, the maximum crawling speed appropriate model. As n → ∞ and assuming c R,
  • 9. 9 TABLE VI TABLE VII OVERHEAD α(πn, R − c) FOR D = D(n), π = 0.33, c = 320 MB MAXIMUM HYBRID CRAWLING RATE maxc Shybrid (c) FOR D = D(n) N R = 4 GB R = 8 GB N R = 4 GB R=8 GB Mercator-B DRUM Mercator-B DRUM Mercator-B DRUM Mercator-B DRUM 800M 3.02 2.27 2.54 2.27 800M 18, 051 26, 433 23, 242 26, 433 8B 10.4 2.4 6.1 2.4 8B 6, 438 25, 261 10, 742 25, 261 80B 84 3.3 41 3.3 80B 1, 165 18, 023 2, 262 18, 023 800B 823 3.3 395 3.3 800B 136 18, 023 274 18, 023 8T 8, 211 4.5 3, 935 3.3 8T 13.9 11, 641 27.9 18, 023 Shybrid(c) Smax all three methods decrease ω by a factor of π −2 ∈ [8, 16] min(Shybrid(c), Scache(c)) for π ∈ [0.25, 0.35]. For n ∞, however, only the Sopt Sdisk linear factor bπ(c, S)n enjoys an immediate reduction, while Scache(c) α(π(c, S)n, R − c) may or may not change depending on the dominance of the first term in (1), (2), and (7), as well as the effect of reduced RAM size R − c on the overhead. copt c c R R Table VI shows one example where c = 320 MB (E = 16M elements, γ = 20 bytes/element, π = 0.33) occupies only Fig. 3. Finding optimal cache size copt and optimal crawling speed Sopt . a small fraction of R. Notice in the table that caching can make Mercator-B’s disk overhead close to optimal for small N , which nevertheless does not change its scaling performance copt = arg max min(Scache (c), Shybrid (c)), (24) as N → ∞. c∈[0,R] Since π(c, S) depends on S, determining the maximum which is illustrated in Figure 3. On the left of the figure, speed a hybrid approach supports is no longer straightforward. we plot some hypothetical functions Scache (c) and Shybrid (c) Theorem 5: Assuming π(c, S) is monotonically non- for c ∈ [0, R]. Assuming that µ(0) = ∞, the former curve decreasing in S, the maximum download rate Shybrid sup- always starts at Scache (0) = Smax and is monotonically ported by disk algorithms with RAM caching is: non-increasing. For π(0, S) = 1, the latter function starts W Shybrid (c) = h−1 at Shybrid (0) = Sdisk and tends to zero as c → R, but , (22) bl not necessarily monotonically. On the right of the figure, we where h−1 is the inverse of show the supported crawling rate min(Scache (c), Shybrid (c)) whose maximum point corresponds to the pair (copt , Sopt ). If h(x) = xα(π(c, x)n, R − c)π(c, x). Sopt > Sdisk , then caching should be enabled with c = copt ; Proof: From (18), we have: otherwise, it should be disabled. The most common case when the crawler benefits from disabling the cache is when R is W S= , (23) small compared to γτh lS or the CPU is the bottleneck (i.e., α(π(c, S)n, R − c)bπ(c, S)l Scache < Sdisk ). which can be written as h(S) = W/(bl). The solution to this equation is S = h−1 (W/(bl)) where as before the inverse h−1 VI. SPAM AND REPUTATION exists due to the strict monotonicity of h(x). This section explains the necessity for detecting spam To better understand (22), we show an example of finding during crawls and proposes a simple technique for computing the best cache size c that maximizes Shybrid (c) assuming domain reputation in real-time. π(c, S) is a step function of hit rates derived from Table IV. Specifically, π(c, S) = 1 if c = 0, π(c, S) = 0.84 if A. Problems with BFS 0 < c < γτh lS, 0.41 if c < 4γτh lS, 0.27 if c < 10γτh lS, and 0.22 for larger c. Table VII shows the resulting crawling speed Prior crawlers [7], [14], [23], [27] have no documented in pages/s after maximizing (22) with respect to c. As before, spam-avoidance algorithms and are typically assumed to per- Mercator-B is close to optimal for small N and large R, but form BFS traversals of the web graph. Several studies [1], [3] for N → ∞ its performance degrades. DRUM, on the other have examined in simulations the effect of changing crawl hand, maintains at least 11, 000 pages/s over the entire range order by applying bias towards more popular pages. The of N . Since these examples use large R in comparison to the conclusions are mixed and show that PageRank order [4] can cache size needed to achieve non-trivial hit rates, the values in be sometimes marginally better than BFS [1] and sometimes this table are almost inversely proportional to those in Table marginally worse [3], where the metric by which they are VI, which can be used to ballpark the maximum value of (22) compared is the rate at which the crawler discovers popular without inverting h(x). pages. Knowing function Shybrid from (22), one needs to couple While BFS works well in simulations, its performance on it with the performance of the caching algorithm to obtain the infinite graphs and/or in the presence of spam farms remains true optimal value of c: unknown. Our early experiments show that crawlers eventually
  • 10. 10 encounter a quickly branching site that will start to dominate check + new URLs update crawling the queue after 3 − 4 levels in the BFS tree. Some of these DRUM threads DRUM check + URLseen unique sites are spam-related with the aim of inflating the page rank PLDindegree update URLs of target hosts, while others are created by regular users sometimes for legitimate purposes (e.g., calendars, testing of PLD links update asp/php engines), sometimes for questionable purposes (e.g., STAR URLs & budgets intentional trapping of unwanted robots), and sometimes for robots-check no apparent reason at all. What makes these pages similar is BEAST budget queue QR enforcement the seemingly infinite number of dynamically generated pages pass budget and/or hosts within a given domain. Crawling these massive Fig. 4. Operation of STAR. webs or performing DNS lookups on millions of hosts from a given domain not only places a significant burden on the crawler, but also wastes bandwidth on downloading largely these methods become extremely disk intensive in large-scale useless content. applications (e.g., 41 billion pages and 641 million hosts found Simply restricting the branching factor or the maximum in our crawl) and arguably with enough effort can be manipu- number of pages/hosts per domain is not a viable solution lated [12] by huge link farms (i.e., millions of pages and sites since there is a number of legitimate sites that contain over pointing to a target spam page). In fact, strict page-level rank is a hundred million pages and over a dozen million virtual not absolutely necessary for controlling massively branching hosts (i.e., various blog sites, hosting services, directories, spam. Instead, we found that spam could be “deterred” by and forums). For example, Yahoo currently reports indexing budgeting the number of allowed pages per PLD based on 1.2 billion objects just within its own domain and blogspot domain reputation, which we determine by domain in-degree claims over 50 million users, each with a unique hostname. from resources that spammers must pay for. There are two Therefore, differentiating between legitimate and illegitimate options for these resources – PLDs and IP addresses. We chose web “monsters” becomes a fundamental task of any crawler. the former since classification based on IPs (first suggested Note that this task does not entail assigning popularity to in Lycos [21]) has proven less effective since large subnets each potential page as would be the case when returning query inside link farms could be given unnecessarily high priority results to a user; instead, the crawler needs to decide whether a and multiple independent sites co-hosted on the same IP were given domain or host should be allowed to massively branch improperly discounted. or not. Indeed, spam-sites and various auto-generated webs While it is possible to classify each site and even each with a handful of pages are not a problem as they can be subdirectory based on their PLD in-degree, our current im- downloaded with very little effort and later classified by data- plementation uses a coarse-granular approach of only limiting miners using PageRank or some other appropriate algorithm. spam at the PLD level. Each PLD x starts with a default budget The problem only occurs when the crawler assigns to domain B0 , which is dynamically adjusted using some function F (dx ) x download bandwidth that is disproportionate to the value of as x’s in-degree dx changes. Budget Bx represents the number x’s content. of pages that are allowed to pass from x (including all hosts Another aspect of spam classification is that it must be and subdomains in x) to crawling threads every T time units. performed with very little CPU/RAM/disk effort and run in Figure 4 shows how our system, which we call Spam real-time at speed SL links per second, where L is the number Tracking and Avoidance through Reputation (STAR), is orga- of unique URLs per page. nized. In the figure, crawling threads aggregate PLD-PLD link information and send it to a DRUM structure PLDindegree, B. Controlling Massive Sites which uses a batch update to store for each PLD x its Before we introduce our algorithm, several definitions are hash hx , in-degree dx , current budget Bx , and hashes of in order. Both host and site refer to Fully Qualified Domain all in-degree neighbors in the PLD graph. Unique URLs Names (FQDNs) on which valid pages reside (e.g., motors. arriving from URLseen perform a batch check against ebay.com). A server is a physical host that accepts TCP PLDindegree, and are given Bx on their way to BEAST, connections and communicates content to the crawler. Note which we discuss in the next section. that multiple hosts may be co-located on the same server. Note that by varying the budget function F (dx ), one can A top-level domain (TLD) or a country-code TLD (cc-TLD) implement a number of policies – crawling of only popular is a domain one level below the root in the DNS tree (e.g., pages (i.e., zero budget for low-ranked domains and maximum .com, .net, .uk). A pay-level domain (PLD) is any domain budget for high-ranked domains), equal distribution between that requires payment at a TLD or cc-TLD registrar. PLDs all domains (i.e., budget Bx = B0 for all x), and crawling are usually one level below the corresponding TLD (e.g., with a bias toward popular/unpopular pages (i.e., budget amazon.com), with certain exceptions for cc-TLDs (e.g., directly/inversely proportional to the PLD in-degree). ebay.co.uk, det.wa.edu.au). We use a comprehensive list of custom rules for identifying PLDs, which have been VII. POLITENESS AND BUDGETS compiled as part of our ongoing DNS project. While computing PageRank [19], BlockRank [18], or Sit- This section discusses how to enable polite crawler opera- eRank [10], [30] is a potential solution to the spam problem, tion and scalably enforce budgets.