SlideShare a Scribd company logo
1 of 53
Download to read offline
Exploring The World
Wide Web
1
---Khusbu Mishra
---Siddhartha Anand
Outline
Introduction
Motivation Of Crawlers
Basic Crawlers And
Implementation Issues
Crawling Algorithms
Crawler Ethics And
Conflicts
Demonstration
2
Introduction
Exploit the graph structure of the web in a
sequential, automated manner.
Also known as spiders, bots, wanderers, ants,
fish and worms.
Starts from a seed page and then uses external
links within it to attend to other pages.
Links that we have got are stored in a
repository to be crawled later.
3
Outline
Introduction
Motivation Of Crawlers
Basic Crawlers And
Implementation Issues
Crawling Algorithms
Crawler Ethics And
Conflicts
Demonstration
4
Q: How does a
search engine know
that all these pages
contain the query
terms?
A: Because all of
those pages have
been crawled
5
Motivation For Crawlers
Support universal search engines (Google,
Yahoo, MSN/Windows Live, Ask, etc.)
Getting real-world data on large-scale
Vertical (specialized) search engines, e.g. news,
shopping, papers, recipes, reviews, etc.
Business intelligence: keep track of potential
competitors, partners
Monitor Web sites of interest
Evil: harvest emails for spamming, phishing…
Can you think of some others? 7
Outline
Introduction
Motivation Of Crawlers
Basic Crawlers And
Implementation Issues
Crawling Algorithms
Crawler Ethics And
Conflicts
Demonstration
8
Building A Crawling
Infrastructure
This is a sequential
crawler
Seeds can be any list of
starting URLs
Order of page visits is
determined by frontier
data structure
Stop criterion can be
anything
Frontier
The to-do list of a crawler that contains the
URLs of unvisited pages.
It may be implemented as a FIFO Queue.
URLs stored must be unique.
Can be maintained as a Hash-Table with URL
as a key.
Fetching Pages
We need a client that sends a request and
receives a response.
Should have timeouts
Exception Handling and Error Checking
Must follow Robot Exclusion Principle
Parsing
Once a page has been fetched we need to
parse its content to extract information.
It may imply URL Extraction or data.
The URL extracted needs to be canonicalised
before adding it to the frontier so that same
page is not revisited.
A Basic Crawler in
Python
Breadth First Searchqueue: a FIFO list
(shift and push)
import urllib2
from BeautifulSoup import BeautifulSoup
seed=http://www.seedurl.com
queue={seed}
while(len(queue)!=0)
seed=queue.popleft()
page=urllib2.urlopen(seed)
soup=BeautifulSoup(page)
links=soup.findAll(‘a’)
for item in links:
p=item.attrs
queue.append(p[0][1]) 13
URL Extraction and
Canonicalization
Relative vs. Absolute URLs
Crawler must translate relative URLs into absolute
URLs
Need to obtain Base URL from HTTP header, or
HTML Meta tag, or else current page path by
default
Examples
Base: http://www.cnn.com/linkto/
Relative URL: intl.html
Absolute URL:
http://www.cnn.com/linkto/intl.html
Relative URL: /US/ 14
More On Canonical URLs
Convert the protocol and hostname to lowercase.
Remove the anchor or reference part of the URL.
Perform URL encoding for some commonly used
characters .
Remove ‘ .. ’ and its parent directory from the
URL.
15
More On Canonical URLs
Some transformation are trivial, for example:
http://i
nformatics.indiana.edu/index.html#fragme
nt
http://informatics.indiana.edu/index.html
http://informatics.indiana.edu/dir1/./../dir2/
http://informatics.indiana.edu/dir2/
http://informatics.indiana.edu/~fil/
http://informatics.indiana.edu/%7Efil/
http://INFORMATICS.INDIANA.EDU/fil/
http://informatics.indiana.edu/fil/
16
URL Extraction and
Canonicalization
URL canonicalization
All of these:
http://www.cnn.com/TECH/
http://WWW.CNN.COM/TECH/
http://www.cnn.com:80/TECH/
http://www.cnn.com/bogus/../TECH/
Are really equivalent to this canonical form:
http://www.cnn.com/TECH/
In order to avoid duplication, the crawler must
transform all URLs into canonical form
17
Implementation Issues
Don’t want to fetch same page twice!
Keep lookup table (hash) of visited pages
The frontier grows very fast!
May need to prioritize for large crawls
Fetcher must be robust!
Don’t crash if download fails
Timeout mechanism
Determine file type to skip unwanted files
Can try using extensions, but not reliable
18
More Implementation
Issues
Fetching
Get only the first 10-100 KB per page
Take care to detect and break redirection loops
Soft fail for timeout, server not responding, file
not found, and other errors
19
More Implementation Issues:
Parsing
HTML has the structure of a
DOM (Document Object
Model) tree
Unfortunately actual HTML
is often incorrect in a strict
syntactic sense
Crawlers, like browsers,
must be robust/forgiving
Must pay attention to HTML
entities and unicode in text
Outline
Introduction
Motivation Of Crawlers
Basic Crawlers And
Implementation Issues
Crawling Algorithms
Crawler Ethics And
Conflicts
Demonstration
21
Graph
Traversal
(BFS or DFS?)
Breadth First Search
Implemented with QUEUE (FIFO)
Finds pages along shortest
paths
If we start with “good” pages,
this keeps us close;
Depth First Search
Implemented with STACK
(LIFO)
Wander away (“lost in
cyberspace”)
22
23
Breadth First Search
URL 1
URL 2
URL 5
URL 4
URL 7
URL 3 URL 6
Undiscovered
Discovered
Finished
Top of queue
Queue: URL1
24
Breadth First Search
URL 1
URL 2
URL 5
URL 4
URL 7
URL 3 URL 6
Undiscovered
Discovered
Finished
Top of queue
Queue: URL1
25
Breadth First Search
URL 1 URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: URL2
Top of queue
URL 2
1Shortest
path
from s
Popped: URL1
26
Breadth First Search
URL1
URL 2
URL5
URL4
URL7
URL6
0
Undiscovered
Discovered
Finished
Queue: s 2
Top of queue
URL3
1
1
Queue: URL2 URL3
Popped: URL1
27
Breadth First Search
URL1
URL2 URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: s 2 3
Top of queue
URL5
1
1
1
Queue: s 2Queue: URL2 URL3 URL5
Popped: URL1
28
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5
Top of queue
1
1
1
Queue: s 2 3Queue: s 2Queue: URL2 URL3 URL5
Popped: URL1
29
Breadth First Search
URL1
URL2
URL5
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5
Top of queue
URL4
1
1
Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL3 URL5 URL4
Popped: URL2
30
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5 4
Top of queue
1
1
1
2
5 already
discovered:
don't enqueue
Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL3 URL5 URL4
Popped: URL2
31
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5 4
Top of queue
1
1
1
2
Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL3 URL5 URL4
Popped: URL2
32
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3
URL6
0
Undiscovered
Discovered
Finished
Queue: 3 5 4
Top of queue
1
1
1
2
Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL5 URL4
Popped: URL3
33
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3
Undiscovered
Discovered
Finished
Queue: 3 5 4
Top of queue
1
2
URL6
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL5 URL4 URL6
Popped: URL3
34
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 3 5 4 6
Top of queue
1
1
1
2
2
Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL5 URL4 URL6
Popped: URL3
35
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 5 4 6
Top of queue
1
1
1
2
2
Queue: 3 5 4 6Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL4 URL6
Popped: URL5
36
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 4 6
Top of queue
1
1
1
2
2
Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL6
Popped: URL4
37
Breadth First Search
URL1
URL2
URL5
URL4
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 6 8
Top of queue
1
1
1
2
2
3
URL7
3
Queue: 4 6Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL7
Popped: URL6
38
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 6 8 7
Top of queue
1
1
1
2
2
3
Queue: 6 8Queue: 4 6Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue:
Popped: URL7
39
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 6 8 7
Top of queue
1
1
1
2
2
3
Queue: 6 8Queue: 4 6Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4
Popped: URL3
Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue:
Popped:
BFS v/s DFS
Best First Search
It assigns a score to
the unvisited URLs
on the page based
on similarity
measures.
URLs are added to
the frontier that is
maintained as a
priority queue.
The best URL is
crawled first.
Shark Search
•It uses a similarity
measure for estimating a
the relevance of an
unvisited URL.
• Relevance is measured
by 
-Anchor Text
-Link contents.
• It assigns an anchor and a
context score for each
URL.
Outline
Introduction
Motivation Of Crawlers
Basic Crawlers And
Implementation Issues
Crawling Algorithms
Crawler Ethics And
Conflicts
Demonstration
43
Crawler Ethics And Conflicts
Crawlers can cause trouble, even unwillingly, if
not properly designed to be “polite” and
“ethical”
For example, sending too many requests in
rapid succession to a single server can amount
to a Denial of Service (DoS) attack!
Server administrator and users will be upset
Crawler developer/admin IP address may be
blacklisted
44
Crawler Etiquette
(Important!)
Identify yourself
Use ‘User-Agent’ HTTP header to identify crawler, website
with description of crawler and contact information for crawler
developer
Use ‘From’ HTTP header to specify crawler developer email
Do not disguise crawler as a browser by using their ‘User-
Agent’ string
Always check that HTTP requests are successful, and
in case of error, use HTTP error code to determine
and immediately address problem
Pay attention to anything that may lead to too many
requests to any one server, even unwillingly
45
Crawler Etiquette
(Important!)
Spread the load, do not overwhelm a server
Make sure that no more than some max. number of requests
to any single server per unit time, say < 1/second
Honor the Robot Exclusion Protocol
A server can specify which parts of its document tree any
crawler is or is not allowed to crawl by a file named
‘robots.txt’ placed in the HTTP root directory, e.g.
 Crawler should always check, parse, and obey this file before
sending any requests to a server
More info at:
 http://www.google.com/robots.txt
 http://www.robotstxt.org/wc/exclusion.html
46
More On Robot Exclusion
Make sure URLs are canonical before checking
against robots.txt
Avoid fetching robots.txt for each request to a
server by caching its policy as relevant to this
crawler
Let’s look at some examples to understand the
protocol…
47
www.apple.com/robots.tx
t
48
# robots.txtforhttp://w w w.apple.com /
User-agent:*
D isallow :
All crawlers…
…can go
anywhere!
ots.txt
49
# Robots.txtfile forhttp://w w w.m icrosoft.com
User-agent:*
D isallow :/canada/Library/m np/2/aspx/
D isallow :/com m unities/bin.aspx
D isallow :/com m unities/eventdetails.m spx
D isallow :/com m unities/blogs/PortalResults.m spx
D isallow :/com m unities/rss.aspx
D isallow :/dow nloads/Brow se.aspx
D isallow :/dow nloads/info.aspx
D isallow :/france/form ation/centres/planning.asp
D isallow :/france/m np_utility.m spx
D isallow :/germ any/library/im ages/m np/
D isallow :/germ any/m np_utility.m spx
D isallow :/ie/ie40/
D isallow :/info/custom error.htm
D isallow :/info/sm art404.asp
D isallow :/intlkb/
D isallow :/isapi/
# etc…
All crawlers…
…are not
allowed in
these paths…
s.txt
50
# Robots.txtforhttp://w w w.springer.com (fragm ent)
User-agent:G ooglebot
D isallow :/chl/*
D isallow :/uk/*
D isallow :/italy/*
D isallow :/france/*
User-agent:slurp
D isallow :
Craw l-delay:2
User-agent:M SNBot
D isallow :
Craw l-delay:2
User-agent:scooter
D isallow :
# allothers
User-agent:*
D isallow :/
Google crawler is
allowed
everywhere
except these
paths
Yahoo and
MSN/Windows
Live are allowed
everywhere but
should slow down
AltaVista has no
limits
Everyone else keep off!
Outline
Introduction
Motivation Of Crawlers
Basic Crawlers And
Implementation Issues
Crawling Algorithms
Crawler Ethics And
Conflicts
Demonstration
51
Demonstration
Crawler
Data
Visualize Graph(using Gephi)
Thank
You!

More Related Content

What's hot

Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
Mayur Garg
 

What's hot (20)

Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applications
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
 
Coding for a wget based Web Crawler
Coding for a wget based Web CrawlerCoding for a wget based Web Crawler
Coding for a wget based Web Crawler
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
“Web crawler”
“Web crawler”“Web crawler”
“Web crawler”
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web Crawler
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
Web crawling
Web crawlingWeb crawling
Web crawling
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
 
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep WebSmart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
 
Log File Analysis: The most powerful tool in your SEO toolkit
Log File Analysis: The most powerful tool in your SEO toolkitLog File Analysis: The most powerful tool in your SEO toolkit
Log File Analysis: The most powerful tool in your SEO toolkit
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 

Similar to Web Crawlers - Exploring the WWW

Boolean Book Table of Contents
Boolean Book Table of ContentsBoolean Book Table of Contents
Boolean Book Table of Contents
Irina Shamaeva
 
Needle in an enterprise haystack
Needle in an enterprise haystackNeedle in an enterprise haystack
Needle in an enterprise haystack
Andrew Mleczko
 

Similar to Web Crawlers - Exploring the WWW (20)

google dork.pdf
google dork.pdfgoogle dork.pdf
google dork.pdf
 
Link Building at Scale: Big Links with a Tiny Team
Link Building at Scale: Big Links with a Tiny TeamLink Building at Scale: Big Links with a Tiny Team
Link Building at Scale: Big Links with a Tiny Team
 
Link Building at Scale With a Tiny Team - Sam Oh
Link Building at Scale With a Tiny Team - Sam OhLink Building at Scale With a Tiny Team - Sam Oh
Link Building at Scale With a Tiny Team - Sam Oh
 
Boolean Book Table of Contents
Boolean Book Table of ContentsBoolean Book Table of Contents
Boolean Book Table of Contents
 
BrightonSEO
BrightonSEOBrightonSEO
BrightonSEO
 
Capcon 2010
Capcon 2010Capcon 2010
Capcon 2010
 
A Guide to Log Analysis with Big Query
A Guide to Log Analysis with Big QueryA Guide to Log Analysis with Big Query
A Guide to Log Analysis with Big Query
 
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your LogsSearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
 
Scraping Scripting Hacking
Scraping Scripting HackingScraping Scripting Hacking
Scraping Scripting Hacking
 
DEFCON 23 - Jason Haddix - how do i shot web
DEFCON 23 - Jason Haddix - how do i shot webDEFCON 23 - Jason Haddix - how do i shot web
DEFCON 23 - Jason Haddix - how do i shot web
 
On site audit with screaming frog gdi
On site audit with screaming frog gdiOn site audit with screaming frog gdi
On site audit with screaming frog gdi
 
How to disrupt established markets with SEO in 2015 - LOGIN 2015
How to disrupt established markets with SEO in 2015 - LOGIN 2015How to disrupt established markets with SEO in 2015 - LOGIN 2015
How to disrupt established markets with SEO in 2015 - LOGIN 2015
 
Google Hacking 101
Google Hacking 101Google Hacking 101
Google Hacking 101
 
BDD to the Bone: Using Behave and Selenium to Test-Drive Web Applications
BDD to the Bone: Using Behave and Selenium to Test-Drive Web ApplicationsBDD to the Bone: Using Behave and Selenium to Test-Drive Web Applications
BDD to the Bone: Using Behave and Selenium to Test-Drive Web Applications
 
Apache Solr for TYPO3 what's new 2018
Apache Solr for TYPO3 what's new 2018Apache Solr for TYPO3 what's new 2018
Apache Solr for TYPO3 what's new 2018
 
How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...
How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...
How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...
 
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
 
Needle in an enterprise haystack
Needle in an enterprise haystackNeedle in an enterprise haystack
Needle in an enterprise haystack
 
Tracking online conversations with Yahoo Pipes
Tracking online conversations with Yahoo PipesTracking online conversations with Yahoo Pipes
Tracking online conversations with Yahoo Pipes
 

Recently uploaded

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 

Recently uploaded (20)

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 

Web Crawlers - Exploring the WWW

  • 1. Exploring The World Wide Web 1 ---Khusbu Mishra ---Siddhartha Anand
  • 2. Outline Introduction Motivation Of Crawlers Basic Crawlers And Implementation Issues Crawling Algorithms Crawler Ethics And Conflicts Demonstration 2
  • 3. Introduction Exploit the graph structure of the web in a sequential, automated manner. Also known as spiders, bots, wanderers, ants, fish and worms. Starts from a seed page and then uses external links within it to attend to other pages. Links that we have got are stored in a repository to be crawled later. 3
  • 4. Outline Introduction Motivation Of Crawlers Basic Crawlers And Implementation Issues Crawling Algorithms Crawler Ethics And Conflicts Demonstration 4
  • 5. Q: How does a search engine know that all these pages contain the query terms? A: Because all of those pages have been crawled 5
  • 6.
  • 7. Motivation For Crawlers Support universal search engines (Google, Yahoo, MSN/Windows Live, Ask, etc.) Getting real-world data on large-scale Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc. Business intelligence: keep track of potential competitors, partners Monitor Web sites of interest Evil: harvest emails for spamming, phishing… Can you think of some others? 7
  • 8. Outline Introduction Motivation Of Crawlers Basic Crawlers And Implementation Issues Crawling Algorithms Crawler Ethics And Conflicts Demonstration 8
  • 9. Building A Crawling Infrastructure This is a sequential crawler Seeds can be any list of starting URLs Order of page visits is determined by frontier data structure Stop criterion can be anything
  • 10. Frontier The to-do list of a crawler that contains the URLs of unvisited pages. It may be implemented as a FIFO Queue. URLs stored must be unique. Can be maintained as a Hash-Table with URL as a key.
  • 11. Fetching Pages We need a client that sends a request and receives a response. Should have timeouts Exception Handling and Error Checking Must follow Robot Exclusion Principle
  • 12. Parsing Once a page has been fetched we need to parse its content to extract information. It may imply URL Extraction or data. The URL extracted needs to be canonicalised before adding it to the frontier so that same page is not revisited.
  • 13. A Basic Crawler in Python Breadth First Searchqueue: a FIFO list (shift and push) import urllib2 from BeautifulSoup import BeautifulSoup seed=http://www.seedurl.com queue={seed} while(len(queue)!=0) seed=queue.popleft() page=urllib2.urlopen(seed) soup=BeautifulSoup(page) links=soup.findAll(‘a’) for item in links: p=item.attrs queue.append(p[0][1]) 13
  • 14. URL Extraction and Canonicalization Relative vs. Absolute URLs Crawler must translate relative URLs into absolute URLs Need to obtain Base URL from HTTP header, or HTML Meta tag, or else current page path by default Examples Base: http://www.cnn.com/linkto/ Relative URL: intl.html Absolute URL: http://www.cnn.com/linkto/intl.html Relative URL: /US/ 14
  • 15. More On Canonical URLs Convert the protocol and hostname to lowercase. Remove the anchor or reference part of the URL. Perform URL encoding for some commonly used characters . Remove ‘ .. ’ and its parent directory from the URL. 15
  • 16. More On Canonical URLs Some transformation are trivial, for example: http://i nformatics.indiana.edu/index.html#fragme nt http://informatics.indiana.edu/index.html http://informatics.indiana.edu/dir1/./../dir2/ http://informatics.indiana.edu/dir2/ http://informatics.indiana.edu/~fil/ http://informatics.indiana.edu/%7Efil/ http://INFORMATICS.INDIANA.EDU/fil/ http://informatics.indiana.edu/fil/ 16
  • 17. URL Extraction and Canonicalization URL canonicalization All of these: http://www.cnn.com/TECH/ http://WWW.CNN.COM/TECH/ http://www.cnn.com:80/TECH/ http://www.cnn.com/bogus/../TECH/ Are really equivalent to this canonical form: http://www.cnn.com/TECH/ In order to avoid duplication, the crawler must transform all URLs into canonical form 17
  • 18. Implementation Issues Don’t want to fetch same page twice! Keep lookup table (hash) of visited pages The frontier grows very fast! May need to prioritize for large crawls Fetcher must be robust! Don’t crash if download fails Timeout mechanism Determine file type to skip unwanted files Can try using extensions, but not reliable 18
  • 19. More Implementation Issues Fetching Get only the first 10-100 KB per page Take care to detect and break redirection loops Soft fail for timeout, server not responding, file not found, and other errors 19
  • 20. More Implementation Issues: Parsing HTML has the structure of a DOM (Document Object Model) tree Unfortunately actual HTML is often incorrect in a strict syntactic sense Crawlers, like browsers, must be robust/forgiving Must pay attention to HTML entities and unicode in text
  • 21. Outline Introduction Motivation Of Crawlers Basic Crawlers And Implementation Issues Crawling Algorithms Crawler Ethics And Conflicts Demonstration 21
  • 22. Graph Traversal (BFS or DFS?) Breadth First Search Implemented with QUEUE (FIFO) Finds pages along shortest paths If we start with “good” pages, this keeps us close; Depth First Search Implemented with STACK (LIFO) Wander away (“lost in cyberspace”) 22
  • 23. 23 Breadth First Search URL 1 URL 2 URL 5 URL 4 URL 7 URL 3 URL 6 Undiscovered Discovered Finished Top of queue Queue: URL1
  • 24. 24 Breadth First Search URL 1 URL 2 URL 5 URL 4 URL 7 URL 3 URL 6 Undiscovered Discovered Finished Top of queue Queue: URL1
  • 25. 25 Breadth First Search URL 1 URL5 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: URL2 Top of queue URL 2 1Shortest path from s Popped: URL1
  • 26. 26 Breadth First Search URL1 URL 2 URL5 URL4 URL7 URL6 0 Undiscovered Discovered Finished Queue: s 2 Top of queue URL3 1 1 Queue: URL2 URL3 Popped: URL1
  • 27. 27 Breadth First Search URL1 URL2 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: s 2 3 Top of queue URL5 1 1 1 Queue: s 2Queue: URL2 URL3 URL5 Popped: URL1
  • 28. 28 Breadth First Search URL1 URL2 URL5 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: 2 3 5 Top of queue 1 1 1 Queue: s 2 3Queue: s 2Queue: URL2 URL3 URL5 Popped: URL1
  • 29. 29 Breadth First Search URL1 URL2 URL5 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: 2 3 5 Top of queue URL4 1 1 Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL3 URL5 URL4 Popped: URL2
  • 30. 30 Breadth First Search URL1 URL2 URL5 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: 2 3 5 4 Top of queue 1 1 1 2 5 already discovered: don't enqueue Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL3 URL5 URL4 Popped: URL2
  • 31. 31 Breadth First Search URL1 URL2 URL5 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: 2 3 5 4 Top of queue 1 1 1 2 Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL3 URL5 URL4 Popped: URL2
  • 32. 32 Breadth First Search URL1 URL2 URL5 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: 3 5 4 Top of queue 1 1 1 2 Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL5 URL4 Popped: URL3
  • 33. 33 Breadth First Search URL1 URL2 URL5 URL4 URL7 URL3 Undiscovered Discovered Finished Queue: 3 5 4 Top of queue 1 2 URL6 Popped: URL3 Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL5 URL4 URL6 Popped: URL3
  • 34. 34 Breadth First Search URL1 URL2 URL5 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: 3 5 4 6 Top of queue 1 1 1 2 2 Queue: 3 5 4 Popped: URL3 Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL5 URL4 URL6 Popped: URL3
  • 35. 35 Breadth First Search URL1 URL2 URL5 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: 5 4 6 Top of queue 1 1 1 2 2 Queue: 3 5 4 6Queue: 3 5 4 Popped: URL3 Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL4 URL6 Popped: URL5
  • 36. 36 Breadth First Search URL1 URL2 URL5 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: 4 6 Top of queue 1 1 1 2 2 Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4 Popped: URL3 Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL6 Popped: URL4
  • 37. 37 Breadth First Search URL1 URL2 URL5 URL4 URL3 URL6 0 Undiscovered Discovered Finished Queue: 6 8 Top of queue 1 1 1 2 2 3 URL7 3 Queue: 4 6Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4 Popped: URL3 Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: URL7 Popped: URL6
  • 38. 38 Breadth First Search URL1 URL2 URL5 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: 6 8 7 Top of queue 1 1 1 2 2 3 Queue: 6 8Queue: 4 6Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4 Popped: URL3 Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: Popped: URL7
  • 39. 39 Breadth First Search URL1 URL2 URL5 URL4 URL7 URL3 URL6 0 Undiscovered Discovered Finished Queue: 6 8 7 Top of queue 1 1 1 2 2 3 Queue: 6 8Queue: 4 6Queue: 5 4 6Queue: 3 5 4 6Queue: 3 5 4 Popped: URL3 Queue: 3 5 4Queue: 2 3 5 4Queue: 2 3 5 4Queue: 2 3 5Queue: 2 3 5Queue: s 2 3Queue: s 2Queue: Popped:
  • 41. Best First Search It assigns a score to the unvisited URLs on the page based on similarity measures. URLs are added to the frontier that is maintained as a priority queue. The best URL is crawled first.
  • 42. Shark Search •It uses a similarity measure for estimating a the relevance of an unvisited URL. • Relevance is measured by  -Anchor Text -Link contents. • It assigns an anchor and a context score for each URL.
  • 43. Outline Introduction Motivation Of Crawlers Basic Crawlers And Implementation Issues Crawling Algorithms Crawler Ethics And Conflicts Demonstration 43
  • 44. Crawler Ethics And Conflicts Crawlers can cause trouble, even unwillingly, if not properly designed to be “polite” and “ethical” For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack! Server administrator and users will be upset Crawler developer/admin IP address may be blacklisted 44
  • 45. Crawler Etiquette (Important!) Identify yourself Use ‘User-Agent’ HTTP header to identify crawler, website with description of crawler and contact information for crawler developer Use ‘From’ HTTP header to specify crawler developer email Do not disguise crawler as a browser by using their ‘User- Agent’ string Always check that HTTP requests are successful, and in case of error, use HTTP error code to determine and immediately address problem Pay attention to anything that may lead to too many requests to any one server, even unwillingly 45
  • 46. Crawler Etiquette (Important!) Spread the load, do not overwhelm a server Make sure that no more than some max. number of requests to any single server per unit time, say < 1/second Honor the Robot Exclusion Protocol A server can specify which parts of its document tree any crawler is or is not allowed to crawl by a file named ‘robots.txt’ placed in the HTTP root directory, e.g.  Crawler should always check, parse, and obey this file before sending any requests to a server More info at:  http://www.google.com/robots.txt  http://www.robotstxt.org/wc/exclusion.html 46
  • 47. More On Robot Exclusion Make sure URLs are canonical before checking against robots.txt Avoid fetching robots.txt for each request to a server by caching its policy as relevant to this crawler Let’s look at some examples to understand the protocol… 47
  • 48. www.apple.com/robots.tx t 48 # robots.txtforhttp://w w w.apple.com / User-agent:* D isallow : All crawlers… …can go anywhere!
  • 49. ots.txt 49 # Robots.txtfile forhttp://w w w.m icrosoft.com User-agent:* D isallow :/canada/Library/m np/2/aspx/ D isallow :/com m unities/bin.aspx D isallow :/com m unities/eventdetails.m spx D isallow :/com m unities/blogs/PortalResults.m spx D isallow :/com m unities/rss.aspx D isallow :/dow nloads/Brow se.aspx D isallow :/dow nloads/info.aspx D isallow :/france/form ation/centres/planning.asp D isallow :/france/m np_utility.m spx D isallow :/germ any/library/im ages/m np/ D isallow :/germ any/m np_utility.m spx D isallow :/ie/ie40/ D isallow :/info/custom error.htm D isallow :/info/sm art404.asp D isallow :/intlkb/ D isallow :/isapi/ # etc… All crawlers… …are not allowed in these paths…
  • 50. s.txt 50 # Robots.txtforhttp://w w w.springer.com (fragm ent) User-agent:G ooglebot D isallow :/chl/* D isallow :/uk/* D isallow :/italy/* D isallow :/france/* User-agent:slurp D isallow : Craw l-delay:2 User-agent:M SNBot D isallow : Craw l-delay:2 User-agent:scooter D isallow : # allothers User-agent:* D isallow :/ Google crawler is allowed everywhere except these paths Yahoo and MSN/Windows Live are allowed everywhere but should slow down AltaVista has no limits Everyone else keep off!
  • 51. Outline Introduction Motivation Of Crawlers Basic Crawlers And Implementation Issues Crawling Algorithms Crawler Ethics And Conflicts Demonstration 51