1. Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Web Crawling Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
Carlos Castillo History
Classification
Implementation
Center for Web Research
Practical issues
Computer Science Department
Summary
University of Chile
www.cwr.cl References
2. Motivation Web Crawling
Behavior of a crawler Carlos Castillo
Selection policy Outline
Re-visit policy Motivation
Politeness policy Behavior of a crawler
Selection policy
Parallelization policy Re-visit policy
Politeness policy
Parallelization policy
Scheduling Scheduling
Short-term scheduling Short-term scheduling
Long-term scheduling
Long-term scheduling When to stop crawling
Architecture
When to stop crawling History
Classification
Architecture Implementation
Practical issues
History
Summary
Classification References
Implementation
Practical issues
Summary
References
3. An astronomer watching the sky Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
4. The problem of abundance Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
5 exabytes of new information a year Politeness policy
Parallelization policy
[Lyman and Varian, 2003] (1 exabyte = 1018 Scheduling
Short-term scheduling
bytes) Long-term scheduling
When to stop crawling
Most directories no longer encourage Architecture
History
administrators to submit their Web sites: they Classification
Implementation
have to find the page on their own Practical issues
Adversarial information retrieval Summary
References
5. The problem of abundance Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
5 exabytes of new information a year Politeness policy
Parallelization policy
[Lyman and Varian, 2003] (1 exabyte = 1018 Scheduling
Short-term scheduling
bytes) Long-term scheduling
When to stop crawling
Most directories no longer encourage Architecture
History
administrators to submit their Web sites: they Classification
Implementation
have to find the page on their own Practical issues
Adversarial information retrieval Summary
References
6. The problem of abundance Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
5 exabytes of new information a year Politeness policy
Parallelization policy
[Lyman and Varian, 2003] (1 exabyte = 1018 Scheduling
Short-term scheduling
bytes) Long-term scheduling
When to stop crawling
Most directories no longer encourage Architecture
History
administrators to submit their Web sites: they Classification
Implementation
have to find the page on their own Practical issues
Adversarial information retrieval Summary
References
7. The bandwidth is expensive Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
“Given that the bandwidth for conducting Selection policy
Re-visit policy
crawls is neither infinite nor free it is Politeness policy
Parallelization policy
becoming essential to crawl the Web in a Scheduling
Short-term scheduling
not only scalable, but efficient way if some Long-term scheduling
When to stop crawling
reasonable measure of quality or freshness is Architecture
to be maintained” [Edwards et al., 2001] History
Classification
Implementation
The cost of a “complete” Web crawl is estimated in Practical issues
Summary
$1.5 million USD [Craswell et al., 2004], only
References
considering network usage
8. The bandwidth is expensive Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
“Given that the bandwidth for conducting Selection policy
Re-visit policy
crawls is neither infinite nor free it is Politeness policy
Parallelization policy
becoming essential to crawl the Web in a Scheduling
Short-term scheduling
not only scalable, but efficient way if some Long-term scheduling
When to stop crawling
reasonable measure of quality or freshness is Architecture
to be maintained” [Edwards et al., 2001] History
Classification
Implementation
The cost of a “complete” Web crawl is estimated in Practical issues
Summary
$1.5 million USD [Craswell et al., 2004], only
References
considering network usage
9. Combination of policies Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Selection policy Scheduling
Short-term scheduling
Re-visit policy Long-term scheduling
When to stop crawling
Politeness policy Architecture
History
Classification
Parallelization policy Implementation
Practical issues
Summary
References
10. Combination of policies Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Selection policy Scheduling
Short-term scheduling
Re-visit policy Long-term scheduling
When to stop crawling
Politeness policy Architecture
History
Classification
Parallelization policy Implementation
Practical issues
Summary
References
11. Combination of policies Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Selection policy Scheduling
Short-term scheduling
Re-visit policy Long-term scheduling
When to stop crawling
Politeness policy Architecture
History
Classification
Parallelization policy Implementation
Practical issues
Summary
References
12. Combination of policies Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Selection policy Scheduling
Short-term scheduling
Re-visit policy Long-term scheduling
When to stop crawling
Politeness policy Architecture
History
Classification
Parallelization policy Implementation
Practical issues
Summary
References
13. It is necessary to prioritize Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
No search engine indexes more than 16% of the Scheduling
Web [Lawrence and Giles, 2000] Short-term scheduling
Long-term scheduling
When to stop crawling
Download only the “important” pages Architecture
Restrict to only a sub-domain History
Classification
Implementation
Avoid spamming Practical issues
Summary
References
14. Web Crawling
Selection based on links Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Order by Pagerank [Cho et al., 1998] Scheduling
Depth-first search [Najork and Wiener, 2001] Short-term scheduling
Long-term scheduling
When to stop crawling
Focused crawling [Chakrabarti et al., 1999], Architecture
attempting to infer similarity to pages before History
Classification
Implementation
downloading them
Practical issues
Summary
References
15. Web Crawling
Events Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Creation, which requires a link Parallelization policy
Scheduling
Update, can be either minor or major. Most of Short-term scheduling
Long-term scheduling
the changes are minor, but this is not easy to When to stop crawling
exploit Architecture
History
Classification
Deletion, which is more damaging to the search Implementation
engine’s reputation Practical issues
Summary
References
16. Web Crawling
Cost functions Carlos Castillo
Outline
Motivation
Freshness:
Behavior of a crawler
Selection policy
1 if p is not modified at time t Re-visit policy
Politeness policy
Fp (t) = Parallelization policy
0 otherwise Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Age: Architecture
History
Classification
Implementation
0 if p is not modified Practical issues
Ap (t) =
t − lastmod(p) otherwise Summary
References
Depending on the cost function used, the
behavior can be different
17. Web Crawling
Cost functions Carlos Castillo
Outline
Motivation
Freshness:
Behavior of a crawler
Selection policy
1 if p is not modified at time t Re-visit policy
Politeness policy
Fp (t) = Parallelization policy
0 otherwise Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Age: Architecture
History
Classification
Implementation
0 if p is not modified Practical issues
Ap (t) =
t − lastmod(p) otherwise Summary
References
Depending on the cost function used, the
behavior can be different
18. Evolution of freshness and age Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
19. Estimating freshness and age Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Page changes can be modeled as a Poisson Re-visit policy
Politeness policy
process [Brewington et al., 2000] Parallelization policy
Scheduling
Probability of a page being updated at time t is Short-term scheduling
Long-term scheduling
When to stop crawling
P(Fp (t) = 1) = e −λp t Architecture
History
Classification
Implementation
λp can be estimated using historical data, Practical issues
specially if last-modification date is provided by Summary
the server [Cho and Garcia-Molina, 2003] References
20. Web Crawling
Web robots can be a threat Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
They consume network resources Parallelization policy
Scheduling
They can cause server overload Short-term scheduling
Long-term scheduling
The robot exclusion protocol should be honored When to stop crawling
Architecture
[Koster, 1996] History
Classification
The re-visiting period should be reasonable Implementation
(what is reasonable?) Practical issues
Summary
References
21. Web Crawling
Robot exclusion Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Server exclusions Parallelization policy
D i s a l l o w : / c g i −b i n Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Page exclusions Architecture
History
<meta name=” r o b o t s ” Classification
Implementation
c o n t e n t =”n o i n d e x . n o f o l l o w , n o c a c h e”> Practical issues
Summary
References
22. Web Crawling
Robot exclusion Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Server exclusions Parallelization policy
D i s a l l o w : / c g i −b i n Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Page exclusions Architecture
History
<meta name=” r o b o t s ” Classification
Implementation
c o n t e n t =”n o i n d e x . n o f o l l o w , n o c a c h e”> Practical issues
Summary
References
23. Objectives Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Distribute the Web crawling Scheduling
Short-term scheduling
Ideally, no central control point Long-term scheduling
When to stop crawling
Reduce overhead due to communications Architecture
History
Classification
Reduce overlap, ideally zero Implementation
Practical issues
Summary
References
24. Types of policies Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Static assignment: typically a hash function on Scheduling
Short-term scheduling
site names Long-term scheduling
When to stop crawling
Dynamic assignment: more complicated to Architecture
History
handle, usually requires central control Classification
Implementation
Practical issues
Summary
References
25. Problem separation Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Indexing, downloading, and distributed crawling Politeness policy
Parallelization policy
are done in batches – this can be exploited to Scheduling
Short-term scheduling
separate the problem Long-term scheduling
When to stop crawling
Short-term scheduling: using the network Architecture
History
resources efficiently Classification
Implementation
Long-term scheduling: ordering the crawling Practical issues
process to download important pages first Summary
References
26. Problem separation Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Indexing, downloading, and distributed crawling Politeness policy
Parallelization policy
are done in batches – this can be exploited to Scheduling
Short-term scheduling
separate the problem Long-term scheduling
When to stop crawling
Short-term scheduling: using the network Architecture
History
resources efficiently Classification
Implementation
Long-term scheduling: ordering the crawling Practical issues
process to download important pages first Summary
References
27. Short-term scheduling Web Crawling
Carlos Castillo
Outline
Motivation
If B is the bandwidth available, then Bp , the Behavior of a crawler
Selection policy
downloading speed for page p, is Re-visit policy
Politeness policy
Parallelization policy
Sp Scheduling
Bp = Short-term scheduling
T∗ Long-term scheduling
When to stop crawling
Architecture
Where T ∗ is the optimal time to use all of the History
Classification
available bandwidth Implementation
Practical issues
p Sp Summary
T∗ = References
B
28. Full parallelization Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
29. Web Crawling
Full serialization Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
30. Web Crawling
Realistic scenario Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
31. Web Crawling
Number of active crawlers Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
32. Objective Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Download “important” pages first Scheduling
Short-term scheduling
Download X% of the top Y% pages Long-term scheduling
When to stop crawling
Cumulative Pagerank vs fraction of the Web – Architecture
History
total Pagerank is 1, random strategy should give Classification
Implementation
a straight line Practical issues
Summary
References
33. Objective Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Download “important” pages first Scheduling
Short-term scheduling
Download X% of the top Y% pages Long-term scheduling
When to stop crawling
Cumulative Pagerank vs fraction of the Web – Architecture
History
total Pagerank is 1, random strategy should give Classification
Implementation
a straight line Practical issues
Summary
References
34. Strategies Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Oracle with Pagerank Scheduling
Short-term scheduling
Depth-first search Long-term scheduling
When to stop crawling
Bigger sites first Architecture
History
Classification
Partial pagerank calculations Implementation
Practical issues
Summary
References
35. Strategies Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Oracle with Pagerank Scheduling
Short-term scheduling
Depth-first search Long-term scheduling
When to stop crawling
Bigger sites first Architecture
History
Classification
Partial pagerank calculations Implementation
Practical issues
Summary
References
36. Strategies Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Oracle with Pagerank Scheduling
Short-term scheduling
Depth-first search Long-term scheduling
When to stop crawling
Bigger sites first Architecture
History
Classification
Partial pagerank calculations Implementation
Practical issues
Summary
References
37. Strategies Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Oracle with Pagerank Scheduling
Short-term scheduling
Depth-first search Long-term scheduling
When to stop crawling
Bigger sites first Architecture
History
Classification
Partial pagerank calculations Implementation
Practical issues
Summary
References
38. Comparison of strategies Web Crawling
Carlos Castillo
Outline
[Castillo et al., 2004] Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
39. Distribution of visits per level Web Crawling
Carlos Castillo
Outline
Motivation
[Baeza-Yates and Castillo, 2004]
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
40. Pagerank and depth Web Crawling
Carlos Castillo
Cumulative Pagerank by levels in the Chilean Web
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
41. Pagerank and depth Web Crawling
Carlos Castillo
Correlation of Pagerank and depth is low at deeper levels
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
42. Web Crawling
First crawlers Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
RBSE spider - size of the Web: 100,000 pages Parallelization policy
Scheduling
Internet archive crawler - www.archive.org Short-term scheduling
Long-term scheduling
When to stop crawling
Webcrawler - first search engine powered by a Architecture
Web crawler History
Classification
Implementation
Pages were a scarce resource Practical issues
Summary
References
43. Web Crawling
First crawlers Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
RBSE spider - size of the Web: 100,000 pages Parallelization policy
Scheduling
Internet archive crawler - www.archive.org Short-term scheduling
Long-term scheduling
When to stop crawling
Webcrawler - first search engine powered by a Architecture
Web crawler History
Classification
Implementation
Pages were a scarce resource Practical issues
Summary
References
44. Second generation Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Mercator, SPHINX - focused crawling Scheduling
Short-term scheduling
Long-term scheduling
Lycos, Excite, Google - large-scale crawling When to stop crawling
Architecture
Parallel crawlers History
Classification
Problem of abundance Implementation
Practical issues
Summary
References
45. Second generation Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Mercator, SPHINX - focused crawling Scheduling
Short-term scheduling
Long-term scheduling
Lycos, Excite, Google - large-scale crawling When to stop crawling
Architecture
Parallel crawlers History
Classification
Problem of abundance Implementation
Practical issues
Summary
References
46. Web Crawling
Standard architecture Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
47. Different crawlers have different Web Crawling
Carlos Castillo
focus Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Different issues Short-term scheduling
Long-term scheduling
Quality: having “good resources” When to stop crawling
Architecture
Representation: having complete copies History
Classification
Freshnes: having updated copies Implementation
Practical issues
A global-scale crawler tries to balance them all Summary
References
48. Taxonomy of Web crawlers Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
49. Key operations Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Have I seen this URL ?
Scheduling
Have I seen this page (or a very similar one) ? Short-term scheduling
Long-term scheduling
When to stop crawling
Which pages should I download next ? Architecture
History
Store this page Classification
Implementation
Download this batch of pages Practical issues
Summary
References
50. Key operations Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Have I seen this URL ?
Scheduling
Have I seen this page (or a very similar one) ? Short-term scheduling
Long-term scheduling
When to stop crawling
Which pages should I download next ? Architecture
History
Store this page Classification
Implementation
Download this batch of pages Practical issues
Summary
References
51. Key operations Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Have I seen this URL ?
Scheduling
Have I seen this page (or a very similar one) ? Short-term scheduling
Long-term scheduling
When to stop crawling
Which pages should I download next ? Architecture
History
Store this page Classification
Implementation
Download this batch of pages Practical issues
Summary
References
52. Key operations Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Have I seen this URL ?
Scheduling
Have I seen this page (or a very similar one) ? Short-term scheduling
Long-term scheduling
When to stop crawling
Which pages should I download next ? Architecture
History
Store this page Classification
Implementation
Download this batch of pages Practical issues
Summary
References
53. Key operations Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Have I seen this URL ?
Scheduling
Have I seen this page (or a very similar one) ? Short-term scheduling
Long-term scheduling
When to stop crawling
Which pages should I download next ? Architecture
History
Store this page Classification
Implementation
Download this batch of pages Practical issues
Summary
References
54. The architecture needs to be Web Crawling
Carlos Castillo
highly optimized Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
“While it is fairly easy to build a slow Politeness policy
Parallelization policy
crawler that downloads a few pages per Scheduling
Short-term scheduling
second for a short period of time, building a Long-term scheduling
When to stop crawling
high-performance system that can download Architecture
hundreds of millions of pages over several History
Classification
weeks presentsa number of challenges in Implementation
Practical issues
system design, I/O and network efficiency, Summary
and robustness and manegeability” References
[Shkapenyuk and Suel, 2002].
55. Problems arise in large crawls Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Network and protocol problems Short-term scheduling
Long-term scheduling
Page contents problems When to stop crawling
Architecture
Server problems History
Classification
Implementation
Practical issues
Summary
References
56. Network and protocol problems Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Variable quality of service Scheduling
Short-term scheduling
Misconfigured firewalls Long-term scheduling
When to stop crawling
Crashing DNS servers Architecture
History
Classification
Wrong DNS servers pointing to good hosts Implementation
Practical issues
Summary
References
57. Server problems Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Responses lacking headers Scheduling
Short-term scheduling
Fancy “error” pages Long-term scheduling
When to stop crawling
“Deeep Web” pages which could be accessible Architecture
History
otherwise Classification
Implementation
Embedded session-ids in URLs Practical issues
Summary
References
58. Page contents problems Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
High prevalence of duplicates Scheduling
Short-term scheduling
Browsers are very tolerant Long-term scheduling
When to stop crawling
Malformed markup Architecture
History
Classification
Physical over logical formatting Implementation
Practical issues
Summary
References
59. Summary Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Web crawling is studied at multiple levels Scheduling
Short-term scheduling
Long-term scheduling, page selection Long-term scheduling
When to stop crawling
Scalability, parallelization Architecture
History
Classification
Practical issues, network usage Implementation
Practical issues
Summary
References
60. Summary Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Web crawling is studied at multiple levels Scheduling
Short-term scheduling
Long-term scheduling, page selection Long-term scheduling
When to stop crawling
Scalability, parallelization Architecture
History
Classification
Practical issues, network usage Implementation
Practical issues
Summary
References
61. Summary Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Web crawling is studied at multiple levels Scheduling
Short-term scheduling
Long-term scheduling, page selection Long-term scheduling
When to stop crawling
Scalability, parallelization Architecture
History
Classification
Practical issues, network usage Implementation
Practical issues
Summary
References
62. Summary Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Web crawling is studied at multiple levels Scheduling
Short-term scheduling
Long-term scheduling, page selection Long-term scheduling
When to stop crawling
Scalability, parallelization Architecture
History
Classification
Practical issues, network usage Implementation
Practical issues
Summary
References
63. Open problems Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling using historical information Scheduling
Short-term scheduling
Long-term scheduling
Exploiting the Web’s structure When to stop crawling
Architecture
Adversarial IR: Spam detection before History
downloading the pages Classification
Implementation
Practical issues
Summary
References
64. Open problems Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling using historical information Scheduling
Short-term scheduling
Long-term scheduling
Exploiting the Web’s structure When to stop crawling
Architecture
Adversarial IR: Spam detection before History
downloading the pages Classification
Implementation
Practical issues
Summary
References
65. Open problems Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling using historical information Scheduling
Short-term scheduling
Long-term scheduling
Exploiting the Web’s structure When to stop crawling
Architecture
Adversarial IR: Spam detection before History
downloading the pages Classification
Implementation
Practical issues
Summary
References
66. Baeza-Yates, R. and Castillo, C. (2004). Web Crawling
Crawling the infinite Web: five levels are enough. Carlos Castillo
In Proceedings of the third Workshop on Web Outline
Graphs (WAW), volume 3243 of Lecture Notes in Motivation
Computer Science, pages 156–167, Rome, Italy. Behavior of a crawler
Selection policy
Springer. Re-visit policy
Politeness policy
Parallelization policy
Brewington, B., Cybenko, G., Stata, R., Bharat, Scheduling
Short-term scheduling
K., and Maghoul, F. (2000). Long-term scheduling
When to stop crawling
How dynamic is the web? Architecture
In Proceedings of the Ninth Conference on World History
Classification
Wide Web, pages 257 – 276, Amsterdam, Implementation
Practical issues
Netherlands. Summary
Castillo, C., Marin, M., Rodriguez, A., and References
Baeza-Yates, R. (2004).
Scheduling algorithms for Web crawling.
67. In Latin American Web Conference Web Crawling
(WebMedia/LA-WEB), Riberao Preto, Brazil. Carlos Castillo
IEEE CS Press.
Outline
(To appear). Motivation
Behavior of a crawler
Chakrabarti, S., van den Berg, M., and Dom, B. Selection policy
(1999). Re-visit policy
Politeness policy
Parallelization policy
Focused crawling: a new approach to Scheduling
topic-specific web resource discovery. Short-term scheduling
Long-term scheduling
Computer Networks, 31(11–16):1623–1640. When to stop crawling
Architecture
History
Cho, J. and Garcia-Molina, H. (2003). Classification
Implementation
Estimating frequency of change. Practical issues
ACM Transactions on Internet Technology, 3(3). Summary
References
Cho, J., Garc´
ıa-Molina, H., and Page, L. (1998).
Efficient crawling through URL ordering.
In Proceedings of the seventh conference on
World Wide Web, Brisbane, Australia.
68. Craswell, N., Crimmins, F., Hawking, D., and Web Crawling
Moffat, A. (2004). Carlos Castillo
Performance and cost tradeoffs in web search. Outline
In Proceedings of the 15th Australasian Database Motivation
Conference, pages 161–169, Dunedin, New Behavior of a crawler
Selection policy
Zealand. Re-visit policy
Politeness policy
Parallelization policy
Edwards, J., McCurley, K. S., and Tomlin, J. A. Scheduling
(2001). Short-term scheduling
Long-term scheduling
When to stop crawling
An adaptive model for optimizing performance of
Architecture
an incremental web crawler. History
Classification
In Proceedings of the Tenth Conference on World Implementation
Practical issues
Wide Web, pages 106–113, Hong Kong. Elsevier
Summary
Science. References
Koster, M. (1996).
A standard for robot exclusion.
http://www.robotstxt.org/wc/exclusion.html.
Lawrence, S. and Giles, C. L. (2000).
69. Accessibility of information on the web. Web Crawling
Intelligence, 11(1):32–39. Carlos Castillo
Lyman, P. and Varian, H. R. (2003). Outline
How much information. Motivation
Behavior of a crawler
http://www.sims.berkeley.edu/how-much-info- Selection policy
2003. Re-visit policy
Politeness policy
Parallelization policy
Najork, M. and Wiener, J. L. (2001). Scheduling
Short-term scheduling
Breadth-first crawling yields high-quality pages. Long-term scheduling
When to stop crawling
In Proceedings of the Tenth Conference on World Architecture
Wide Web, pages 114–118, Hong Kong. Elsevier History
Classification
Science. Implementation
Practical issues
Shkapenyuk, V. and Suel, T. (2002). Summary
Design and implementation of a high-performance References
distributed web crawler.
In Proceedings of the 18th International
Conference on Data Engineering (ICDE), pages
357 – 368, San Jose, California. IEEE CS Press.
70. Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References
71. Web Crawling
Carlos Castillo
Outline
Motivation
Behavior of a crawler
Selection policy
Re-visit policy
Politeness policy
Parallelization policy
Scheduling
Short-term scheduling
Long-term scheduling
When to stop crawling
Architecture
History
Classification
Implementation
Practical issues
Summary
References