2. What’s the Crawler
Crawlers walk on the network, search anything it
found and doing anything what they wants...
Search engine
Data finder / collector
Anything else...
2 / 19
3. Conception
Crawler can easy to be separate into three
steps...
Download
Data operation
Find the next seed
3 / 19
4. Pseudo Code
Fetch the web page, parser it, get useful
information and repeat it again.
f o r u r l i n nextSeed ( ) :
info = fetch ( url )
data , seeds = o p e r a t e ( i n f o )
pushSeed ( seeds )
4 / 19
5. Greedy
But easy things are always too hard to be
solved...
Web server always block the crawler!
Data always never structured!
How to find the next seed!
Crawler always bounded on network
speed...
5 / 19
6. Operation
When we link to the target...
Download the web page, parser the HTML
code
Download the database, parser the DB
format
Finial, record everything into our DB
6 / 19
7. Pseudo Code
Parser the HTML code, for example, search
what’s you need...
from B e a u t i f u l S o u p import ∗
soup = B e a u t i f u l S o u p ( webpage )
## P r i n t t h e main body
p r i n t soup . h t m l . body
## P r i n t t h e f i r s t t a g <a> i n body
p r i n t soup . h t m l . body . a
## Find t h e p a r t i c u l a r t a g
t a g s = soup . f i n d A l l ( ’ form ’ )
7 / 19
8. Operation (cont’d)
And more, you also can do something else, like
payload, when operate the web page...
Post / Get the method based on HTML
Find the next seed on the web page
Something good / bad
8 / 19
9. Link to Site
Before we operated the web page, we need to...
Link to web site
Get the web page
But server master hates the net crawler, ’cause
No functionality
Slow down / burn out the resource
As the thief
9 / 19
10. Fetch
If you are not Google
You must be the human
10 / 19
11. Be a Human
Be a human as a human being...
No one can press anything under 0.11
second
No one can look page with few secode
No one can work for all day
11 / 19
12. Rules
Using the framework / tool to enumlate the
browser
Change the default setting
Simulate the existed browser
Cookie support
Time issue and random variable
12 / 19
13. Pseudo Code
Simple fetch code
import u r l l i b 2
from c o o k i e l i b import CookieJar
import time , random
f o r n i n range (MAX LOOP ) :
## Cookie
ck = CookieJar ( )
ck = u r l l i b 2 . HTTPCookieProcessor ( ck )
req = u r l l i b 2 . b u i l d o p e n e r ( ck )
## User−Agent
req . addheaders = [ ( ’ User−Agent ’ , ’ c r a w l e r c m j ’ ) ]
data = req . open ( u r l ) . read ( )
## Wait
t i m e . s l e e p ( random . r a n d i n t ( 0 , 5 ) )
13 / 19
14. Seed
The last one, but the hardest one...
We always unknown the
next sheep
14 / 19
15. Find Sheep
Using the well known search engine
Also, search engine blocks other crawler
The crawler needs to parser the garbage
code
The result maybe the js code...
Using the random / enumerate method
Too hard to find the useful target
Cost lots of time
Cannot shut sheeps immediately
15 / 19
16. Based Search Engine
Design an other crawler
Given the initial keyword as the seed
Fetch the search engine
Parser the result, and get the next seed if
possible
Repeat until stop or blocked.
16 / 19
17. Tricky
Using the distribution model
Separate each parts
More volunteers can speed-up
17 / 19
18. Pyro4
Pyro4 can help you to remote control python
object...
Expose the object can access as on local
side
Using the remote resource to process
Provide the M-n model
18 / 19