Data scraping, Web crawlers are programs that extract information out of world wide web. This presentation aims to cover all relevant topics that one should know before building a crawler.
3. Introduction
Exploit the graph structure of the web in a
sequential, automated manner.
Also known as spiders, bots, wanderers, ants,
fish and worms.
Starts from a seed page and then uses external
links within it to attend to other pages.
Links that we have got are stored in a
repository to be crawled later.
3
5. Q: How does a
search engine know
that all these pages
contain the query
terms?
A: Because all of
those pages have
been crawled
5
6.
7. Motivation For Crawlers
Support universal search engines (Google,
Yahoo, MSN/Windows Live, Ask, etc.)
Getting real-world data on large-scale
Vertical (specialized) search engines, e.g. news,
shopping, papers, recipes, reviews, etc.
Business intelligence: keep track of potential
competitors, partners
Monitor Web sites of interest
Evil: harvest emails for spamming, phishing…
Can you think of some others? 7
9. Building A Crawling
Infrastructure
This is a sequential
crawler
Seeds can be any list of
starting URLs
Order of page visits is
determined by frontier
data structure
Stop criterion can be
anything
10. Frontier
The to-do list of a crawler that contains the
URLs of unvisited pages.
It may be implemented as a FIFO Queue.
URLs stored must be unique.
Can be maintained as a Hash-Table with URL
as a key.
11. Fetching Pages
We need a client that sends a request and
receives a response.
Should have timeouts
Exception Handling and Error Checking
Must follow Robot Exclusion Principle
12. Parsing
Once a page has been fetched we need to
parse its content to extract information.
It may imply URL Extraction or data.
The URL extracted needs to be canonicalised
before adding it to the frontier so that same
page is not revisited.
13. A Basic Crawler in
Python
Breadth First Searchqueue: a FIFO list
(shift and push)
import urllib2
from BeautifulSoup import BeautifulSoup
seed=http://www.seedurl.com
queue={seed}
while(len(queue)!=0)
seed=queue.popleft()
page=urllib2.urlopen(seed)
soup=BeautifulSoup(page)
links=soup.findAll(‘a’)
for item in links:
p=item.attrs
queue.append(p[0][1]) 13
14. URL Extraction and
Canonicalization
Relative vs. Absolute URLs
Crawler must translate relative URLs into absolute
URLs
Need to obtain Base URL from HTTP header, or
HTML Meta tag, or else current page path by
default
Examples
Base: http://www.cnn.com/linkto/
Relative URL: intl.html
Absolute URL:
http://www.cnn.com/linkto/intl.html
Relative URL: /US/ 14
15. More On Canonical URLs
Convert the protocol and hostname to lowercase.
Remove the anchor or reference part of the URL.
Perform URL encoding for some commonly used
characters .
Remove ‘ .. ’ and its parent directory from the
URL.
15
16. More On Canonical URLs
Some transformation are trivial, for example:
http://i
nformatics.indiana.edu/index.html#fragme
nt
http://informatics.indiana.edu/index.html
http://informatics.indiana.edu/dir1/./../dir2/
http://informatics.indiana.edu/dir2/
http://informatics.indiana.edu/~fil/
http://informatics.indiana.edu/%7Efil/
http://INFORMATICS.INDIANA.EDU/fil/
http://informatics.indiana.edu/fil/
16
17. URL Extraction and
Canonicalization
URL canonicalization
All of these:
http://www.cnn.com/TECH/
http://WWW.CNN.COM/TECH/
http://www.cnn.com:80/TECH/
http://www.cnn.com/bogus/../TECH/
Are really equivalent to this canonical form:
http://www.cnn.com/TECH/
In order to avoid duplication, the crawler must
transform all URLs into canonical form
17
18. Implementation Issues
Don’t want to fetch same page twice!
Keep lookup table (hash) of visited pages
The frontier grows very fast!
May need to prioritize for large crawls
Fetcher must be robust!
Don’t crash if download fails
Timeout mechanism
Determine file type to skip unwanted files
Can try using extensions, but not reliable
18
19. More Implementation
Issues
Fetching
Get only the first 10-100 KB per page
Take care to detect and break redirection loops
Soft fail for timeout, server not responding, file
not found, and other errors
19
20. More Implementation Issues:
Parsing
HTML has the structure of a
DOM (Document Object
Model) tree
Unfortunately actual HTML
is often incorrect in a strict
syntactic sense
Crawlers, like browsers,
must be robust/forgiving
Must pay attention to HTML
entities and unicode in text
22. Graph
Traversal
(BFS or DFS?)
Breadth First Search
Implemented with QUEUE (FIFO)
Finds pages along shortest
paths
If we start with “good” pages,
this keeps us close;
Depth First Search
Implemented with STACK
(LIFO)
Wander away (“lost in
cyberspace”)
22
23. 23
Breadth First Search
URL 1
URL 2
URL 5
URL 4
URL 7
URL 3 URL 6
Undiscovered
Discovered
Finished
Top of queue
Queue: URL1
24. 24
Breadth First Search
URL 1
URL 2
URL 5
URL 4
URL 7
URL 3 URL 6
Undiscovered
Discovered
Finished
Top of queue
Queue: URL1
25. 25
Breadth First Search
URL 1 URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: URL2
Top of queue
URL 2
1Shortest
path
from s
Popped: URL1
26. 26
Breadth First Search
URL1
URL 2
URL5
URL4
URL7
URL6
0
Undiscovered
Discovered
Finished
Queue: s 2
Top of queue
URL3
1
1
Queue: URL2 URL3
Popped: URL1
27. 27
Breadth First Search
URL1
URL2 URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: s 2 3
Top of queue
URL5
1
1
1
Queue: s 2Queue: URL2 URL3 URL5
Popped: URL1
41. Best First Search
It assigns a score to
the unvisited URLs
on the page based
on similarity
measures.
URLs are added to
the frontier that is
maintained as a
priority queue.
The best URL is
crawled first.
42. Shark Search
•It uses a similarity
measure for estimating a
the relevance of an
unvisited URL.
• Relevance is measured
by
-Anchor Text
-Link contents.
• It assigns an anchor and a
context score for each
URL.
44. Crawler Ethics And Conflicts
Crawlers can cause trouble, even unwillingly, if
not properly designed to be “polite” and
“ethical”
For example, sending too many requests in
rapid succession to a single server can amount
to a Denial of Service (DoS) attack!
Server administrator and users will be upset
Crawler developer/admin IP address may be
blacklisted
44
45. Crawler Etiquette
(Important!)
Identify yourself
Use ‘User-Agent’ HTTP header to identify crawler, website
with description of crawler and contact information for crawler
developer
Use ‘From’ HTTP header to specify crawler developer email
Do not disguise crawler as a browser by using their ‘User-
Agent’ string
Always check that HTTP requests are successful, and
in case of error, use HTTP error code to determine
and immediately address problem
Pay attention to anything that may lead to too many
requests to any one server, even unwillingly
45
46. Crawler Etiquette
(Important!)
Spread the load, do not overwhelm a server
Make sure that no more than some max. number of requests
to any single server per unit time, say < 1/second
Honor the Robot Exclusion Protocol
A server can specify which parts of its document tree any
crawler is or is not allowed to crawl by a file named
‘robots.txt’ placed in the HTTP root directory, e.g.
Crawler should always check, parse, and obey this file before
sending any requests to a server
More info at:
http://www.google.com/robots.txt
http://www.robotstxt.org/wc/exclusion.html
46
47. More On Robot Exclusion
Make sure URLs are canonical before checking
against robots.txt
Avoid fetching robots.txt for each request to a
server by caching its policy as relevant to this
crawler
Let’s look at some examples to understand the
protocol…
47
49. ots.txt
49
# Robots.txtfile forhttp://w w w.m icrosoft.com
User-agent:*
D isallow :/canada/Library/m np/2/aspx/
D isallow :/com m unities/bin.aspx
D isallow :/com m unities/eventdetails.m spx
D isallow :/com m unities/blogs/PortalResults.m spx
D isallow :/com m unities/rss.aspx
D isallow :/dow nloads/Brow se.aspx
D isallow :/dow nloads/info.aspx
D isallow :/france/form ation/centres/planning.asp
D isallow :/france/m np_utility.m spx
D isallow :/germ any/library/im ages/m np/
D isallow :/germ any/m np_utility.m spx
D isallow :/ie/ie40/
D isallow :/info/custom error.htm
D isallow :/info/sm art404.asp
D isallow :/intlkb/
D isallow :/isapi/
# etc…
All crawlers…
…are not
allowed in
these paths…
50. s.txt
50
# Robots.txtforhttp://w w w.springer.com (fragm ent)
User-agent:G ooglebot
D isallow :/chl/*
D isallow :/uk/*
D isallow :/italy/*
D isallow :/france/*
User-agent:slurp
D isallow :
Craw l-delay:2
User-agent:M SNBot
D isallow :
Craw l-delay:2
User-agent:scooter
D isallow :
# allothers
User-agent:*
D isallow :/
Google crawler is
allowed
everywhere
except these
paths
Yahoo and
MSN/Windows
Live are allowed
everywhere but
should slow down
AltaVista has no
limits
Everyone else keep off!