Web Crawlers - Exploring the WWW

Exploring The World
Wide Web
1
---Khusbu Mishra
---Siddhartha Anand

Outline
Introduction
Motivation Of Crawlers
Basic Crawlers And
Implementation Issues
Crawling Algorithms
Crawler Ethics And
Conflicts
Demonstration
2

Introduction
Exploit the graph structure of the web in a
sequential, automated manner.
Also known as spiders, bots, wanderers, ants,
fish and worms.
Starts from a seed page and then uses external
links within it to attend to other pages.
Links that we have got are stored in a
repository to be crawled later.
3

Outline
Introduction
Conflicts
Demonstration
4

Q: How does a
search engine know
that all these pages
contain the query
terms?
A: Because all of
those pages have
been crawled
5

Motivation For Crawlers
Support universal search engines (Google,
Yahoo, MSN/Windows Live, Ask, etc.)
Getting real-world data on large-scale
Vertical (specialized) search engines, e.g. news,
shopping, papers, recipes, reviews, etc.
Business intelligence: keep track of potential
competitors, partners
Monitor Web sites of interest
Evil: harvest emails for spamming, phishing…
Can you think of some others? 7

Outline
Introduction
Conflicts
Demonstration
8

Building A Crawling
Infrastructure
This is a sequential
crawler
Seeds can be any list of
starting URLs
Order of page visits is
determined by frontier
data structure
Stop criterion can be
anything

Frontier
The to-do list of a crawler that contains the
URLs of unvisited pages.
It may be implemented as a FIFO Queue.
URLs stored must be unique.
Can be maintained as a Hash-Table with URL
as a key.

Fetching Pages
We need a client that sends a request and
receives a response.
Should have timeouts
Exception Handling and Error Checking
Must follow Robot Exclusion Principle

Parsing
Once a page has been fetched we need to
parse its content to extract information.
It may imply URL Extraction or data.
The URL extracted needs to be canonicalised
before adding it to the frontier so that same
page is not revisited.

A Basic Crawler in
Python
Breadth First Searchqueue: a FIFO list
(shift and push)
import urllib2
from BeautifulSoup import BeautifulSoup
seed=http://www.seedurl.com
queue={seed}
while(len(queue)!=0)
seed=queue.popleft()
page=urllib2.urlopen(seed)
soup=BeautifulSoup(page)
links=soup.findAll(‘a’)
for item in links:
p=item.attrs
queue.append(p[0][1]) 13

URL Extraction and
Canonicalization
Relative vs. Absolute URLs
Crawler must translate relative URLs into absolute
URLs
Need to obtain Base URL from HTTP header, or
HTML Meta tag, or else current page path by
default
Examples
Base: http://www.cnn.com/linkto/
Relative URL: intl.html
Absolute URL:
http://www.cnn.com/linkto/intl.html
Relative URL: /US/ 14

More On Canonical URLs
Convert the protocol and hostname to lowercase.
Remove the anchor or reference part of the URL.
Perform URL encoding for some commonly used
characters .
Remove ‘ .. ’ and its parent directory from the
URL.
15

More On Canonical URLs
Some transformation are trivial, for example:
http://i
nformatics.indiana.edu/index.html#fragme
nt
http://informatics.indiana.edu/index.html
http://informatics.indiana.edu/dir1/./../dir2/
http://informatics.indiana.edu/dir2/
http://informatics.indiana.edu/~fil/
http://informatics.indiana.edu/%7Efil/
http://INFORMATICS.INDIANA.EDU/fil/
http://informatics.indiana.edu/fil/
16

URL Extraction and
Canonicalization
URL canonicalization
All of these:
http://www.cnn.com/TECH/
http://WWW.CNN.COM/TECH/
http://www.cnn.com:80/TECH/
http://www.cnn.com/bogus/../TECH/
Are really equivalent to this canonical form:
http://www.cnn.com/TECH/
In order to avoid duplication, the crawler must
transform all URLs into canonical form
17

Don’t want to fetch same page twice!
Keep lookup table (hash) of visited pages
The frontier grows very fast!
May need to prioritize for large crawls
Fetcher must be robust!
Don’t crash if download fails
Timeout mechanism
Determine file type to skip unwanted files
Can try using extensions, but not reliable
18

More Implementation
Issues
Fetching
Get only the first 10-100 KB per page
Take care to detect and break redirection loops
Soft fail for timeout, server not responding, file
not found, and other errors
19

More Implementation Issues:
Parsing
HTML has the structure of a
DOM (Document Object
Model) tree
Unfortunately actual HTML
is often incorrect in a strict
syntactic sense
Crawlers, like browsers,
must be robust/forgiving
Must pay attention to HTML
entities and unicode in text

Outline
Introduction
Conflicts
Demonstration
21

Graph
Traversal
(BFS or DFS?)
Breadth First Search
Implemented with QUEUE (FIFO)
Finds pages along shortest
paths
If we start with “good” pages,
this keeps us close;
Depth First Search
Implemented with STACK
(LIFO)
Wander away (“lost in
cyberspace”)
22

23
Breadth First Search
URL 1
URL 2
URL 5
URL 4
URL 7
URL 3 URL 6
Undiscovered
Discovered
Finished
Top of queue
Queue: URL1

24
URL 1
URL 2
URL 5
URL 4
URL 7
URL 3 URL 6
Undiscovered
Discovered
Finished
Top of queue
Queue: URL1

25
URL 1 URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: URL2
Top of queue
URL 2
1Shortest
path
from s
Popped: URL1

26
URL1
URL 2
URL5
URL4
URL7
URL6
0
Undiscovered
Discovered
Finished
Queue: s 2
Top of queue
URL3
1
1
Queue: URL2 URL3
Popped: URL1

27
URL1
URL2 URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: s 2 3
Top of queue
URL5
1
1
1
Queue: s 2Queue: URL2 URL3 URL5
Popped: URL1

28
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5
Top of queue
1
1
1
Queue: s 2 3Queue: s 2Queue: URL2 URL3 URL5
Popped: URL1

29
URL1
URL2
URL5
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5
Top of queue
URL4
1
1
Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL3 URL5 URL4
Popped: URL2

30
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5 4
Top of queue
1
1
1
2
5 already
discovered:
don't enqueue
Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL3 URL5 URL4
Popped: URL2

31
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5 4
Top of queue
1
1
1
2
Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL3 URL5 URL4
Popped: URL2

32
URL1
URL2
URL5
URL4
URL7
URL3
URL6
0
Undiscovered
Discovered
Finished
Queue: 3 5 4
Top of queue
1
1
1
2
Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL5 URL4
Popped: URL3

33
URL1
URL2
URL5
URL4
URL7
URL3
Undiscovered
Discovered
Finished
Queue: 3 5 4
Top of queue
1
2
URL6
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL5 URL4 URL6
Popped: URL3

34
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 3 5 4 6
Top of queue
1
1
1
2
2
Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL5 URL4 URL6
Popped: URL3

35
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 5 4 6
Top of queue
1
1
1
2
2
Queue: 3 5 4 6Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL4 URL6
Popped: URL5

36
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 4 6
Top of queue
1
1
1
2
2
Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL6
Popped: URL4

37
URL1
URL2
URL5
URL4
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 6 8
Top of queue
1
1
1
2
2
3
URL7
3
Queue: 4 6Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL7
Popped: URL6

38
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 6 8 7
Top of queue
1
1
1
2
2
3
Queue: 6 8Queue: 4 6Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue:
Popped: URL7

39
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 6 8 7
Top of queue
1
1
1
2
2
3
Queue: 6 8Queue: 4 6Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue:
Popped:

Best First Search
It assigns a score to
the unvisited URLs
on the page based
on similarity
measures.
URLs are added to
the frontier that is
maintained as a
priority queue.
The best URL is
crawled first.

Shark Search
•It uses a similarity
measure for estimating a
the relevance of an
unvisited URL.
• Relevance is measured
by 
-Anchor Text
-Link contents.
• It assigns an anchor and a
context score for each
URL.

Outline
Introduction
Conflicts
Demonstration
43

Crawler Ethics And Conflicts
Crawlers can cause trouble, even unwillingly, if
not properly designed to be “polite” and
“ethical”
For example, sending too many requests in
rapid succession to a single server can amount
to a Denial of Service (DoS) attack!
Server administrator and users will be upset
Crawler developer/admin IP address may be
blacklisted
44

Crawler Etiquette
(Important!)
Identify yourself
Use ‘User-Agent’ HTTP header to identify crawler, website
with description of crawler and contact information for crawler
developer
Use ‘From’ HTTP header to specify crawler developer email
Do not disguise crawler as a browser by using their ‘User-
Agent’ string
Always check that HTTP requests are successful, and
in case of error, use HTTP error code to determine
and immediately address problem
Pay attention to anything that may lead to too many
requests to any one server, even unwillingly
45

Crawler Etiquette
(Important!)
Spread the load, do not overwhelm a server
Make sure that no more than some max. number of requests
to any single server per unit time, say < 1/second
Honor the Robot Exclusion Protocol
A server can specify which parts of its document tree any
crawler is or is not allowed to crawl by a file named
‘robots.txt’ placed in the HTTP root directory, e.g.
 Crawler should always check, parse, and obey this file before
sending any requests to a server
More info at:
 http://www.google.com/robots.txt
 http://www.robotstxt.org/wc/exclusion.html
46

More On Robot Exclusion
Make sure URLs are canonical before checking
against robots.txt
Avoid fetching robots.txt for each request to a
server by caching its policy as relevant to this
crawler
Let’s look at some examples to understand the
protocol…
47

www.apple.com/robots.tx
t
48
# robots.txtforhttp://w w w.apple.com /
User-agent:*
D isallow :
All crawlers…
…can go
anywhere!

ots.txt
49
# Robots.txtfile forhttp://w w w.m icrosoft.com
User-agent:*
D isallow :/canada/Library/m np/2/aspx/
D isallow :/com m unities/bin.aspx
D isallow :/com m unities/eventdetails.m spx
D isallow :/com m unities/blogs/PortalResults.m spx
D isallow :/com m unities/rss.aspx
D isallow :/dow nloads/Brow se.aspx
D isallow :/dow nloads/info.aspx
D isallow :/france/form ation/centres/planning.asp
D isallow :/france/m np_utility.m spx
D isallow :/germ any/library/im ages/m np/
D isallow :/germ any/m np_utility.m spx
D isallow :/ie/ie40/
D isallow :/info/custom error.htm
D isallow :/info/sm art404.asp
D isallow :/intlkb/
D isallow :/isapi/
# etc…
All crawlers…
…are not
allowed in
these paths…

s.txt
50
# Robots.txtforhttp://w w w.springer.com (fragm ent)
User-agent:G ooglebot
D isallow :/chl/*
D isallow :/uk/*
D isallow :/italy/*
D isallow :/france/*
User-agent:slurp
D isallow :
Craw l-delay:2
User-agent:M SNBot
D isallow :
Craw l-delay:2
User-agent:scooter
D isallow :
# allothers
User-agent:*
D isallow :/
Google crawler is
allowed
everywhere
except these
paths
Yahoo and
MSN/Windows
Live are allowed
everywhere but
should slow down
AltaVista has no
limits
Everyone else keep off!

Outline
Introduction
Conflicts
Demonstration
51

Demonstration
Crawler
Data
Visualize Graph(using Gephi)

Web Crawlers - Exploring the WWW

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Web Crawlers - Exploring the WWW

Similar to Web Crawlers - Exploring the WWW (20)

Recently uploaded

Recently uploaded (20)

Web Crawlers - Exploring the WWW