SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Crawler
 @hack-stuff.com
  Anything can be a crawler


November 11, 2012




                              1 / 19
What’s the Crawler

 Crawlers walk on the network, search anything it
 found and doing anything what they wants...
     Search engine
     Data finder / collector
     Anything else...




                                                    2 / 19
Conception

 Crawler can easy to be separate into three
 steps...
     Download
     Data operation
     Find the next seed




                                              3 / 19
Pseudo Code
 Fetch the web page, parser it, get useful
 information and repeat it again.


 f o r u r l i n nextSeed ( ) :
       info = fetch ( url )
       data , seeds = o p e r a t e ( i n f o )
       pushSeed ( seeds )




                                                  4 / 19
Greedy
 But easy things are always too hard to be
 solved...
     Web server always block the crawler!
     Data always never structured!
     How to find the next seed!
     Crawler always bounded on network
     speed...



                                             5 / 19
Operation
 When we link to the target...
     Download the web page, parser the HTML
     code
     Download the database, parser the DB
     format
     Finial, record everything into our DB




                                              6 / 19
Pseudo Code
 Parser the HTML code, for example, search
 what’s you need...


 from B e a u t i f u l S o u p import ∗

 soup = B e a u t i f u l S o u p ( webpage )
 ## P r i n t t h e main body
 p r i n t soup . h t m l . body
 ## P r i n t t h e f i r s t t a g <a> i n body
 p r i n t soup . h t m l . body . a
 ## Find t h e p a r t i c u l a r t a g
 t a g s = soup . f i n d A l l ( ’ form ’ )




                                                   7 / 19
Operation (cont’d)

 And more, you also can do something else, like
 payload, when operate the web page...
     Post / Get the method based on HTML
     Find the next seed on the web page
     Something good / bad




                                                  8 / 19
Link to Site
 Before we operated the web page, we need to...
     Link to web site
     Get the web page
 But server master hates the net crawler, ’cause
     No functionality
     Slow down / burn out the resource
     As the thief



                                                   9 / 19
Fetch


   If you are not Google
  You must be the human



                           10 / 19
Be a Human

 Be a human as a human being...
     No one can press anything under 0.11
     second
     No one can look page with few secode
     No one can work for all day




                                            11 / 19
Rules
 Using the framework / tool to enumlate the
 browser
     Change the default setting
     Simulate the existed browser
     Cookie support
     Time issue and random variable




                                              12 / 19
Pseudo Code
 Simple fetch code


 import u r l l i b 2
 from c o o k i e l i b import CookieJar
 import time , random

 f o r n i n range (MAX LOOP ) :
       ## Cookie
       ck = CookieJar ( )
       ck = u r l l i b 2 . HTTPCookieProcessor ( ck )
       req = u r l l i b 2 . b u i l d o p e n e r ( ck )
       ## User−Agent
       req . addheaders = [ ( ’ User−Agent ’ , ’ c r a w l e r c m j ’ ) ]
       data = req . open ( u r l ) . read ( )
       ## Wait
       t i m e . s l e e p ( random . r a n d i n t ( 0 , 5 ) )
                                                                             13 / 19
Seed

 The last one, but the hardest one...


 We always unknown the
 next sheep


                                        14 / 19
Find Sheep
 Using the well known search engine
     Also, search engine blocks other crawler
     The crawler needs to parser the garbage
     code
     The result maybe the js code...
 Using the random / enumerate method
     Too hard to find the useful target
     Cost lots of time
     Cannot shut sheeps immediately

                                                15 / 19
Based Search Engine
 Design an other crawler
     Given the initial keyword as the seed
     Fetch the search engine
     Parser the result, and get the next seed if
     possible
     Repeat until stop or blocked.




                                                   16 / 19
Tricky

 Using the distribution model
     Separate each parts
     More volunteers can speed-up




                                    17 / 19
Pyro4
 Pyro4 can help you to remote control python
 object...
     Expose the object can access as on local
     side
     Using the remote resource to process
     Provide the M-n model




                                                18 / 19
Thanks for participation
        Q&A




                           19 / 19

Weitere ähnliche Inhalte

Andere mochten auch

Web2.0 attack and defence
Web2.0 attack and defenceWeb2.0 attack and defence
Web2.0 attack and defencehackstuff
 
Dvwa low level
Dvwa low levelDvwa low level
Dvwa low levelhackstuff
 
新手無痛入門Apk逆向
新手無痛入門Apk逆向新手無痛入門Apk逆向
新手無痛入門Apk逆向hackstuff
 
Python 網頁爬蟲由淺入淺
Python 網頁爬蟲由淺入淺Python 網頁爬蟲由淺入淺
Python 網頁爬蟲由淺入淺hackstuff
 
Antivirus Bypass
Antivirus BypassAntivirus Bypass
Antivirus Bypasshackstuff
 
Php lfi rfi掃盲大補帖
Php lfi rfi掃盲大補帖Php lfi rfi掃盲大補帖
Php lfi rfi掃盲大補帖hackstuff
 
cmd injection
cmd injectioncmd injection
cmd injectionhackstuff
 
調試器原理與架構
調試器原理與架構調試器原理與架構
調試器原理與架構hackstuff
 
勒索軟體態勢與應措
勒索軟體態勢與應措勒索軟體態勢與應措
勒索軟體態勢與應措jack51706
 
台科大網路鑑識課程 封包分析及中繼站追蹤
台科大網路鑑識課程 封包分析及中繼站追蹤台科大網路鑑識課程 封包分析及中繼站追蹤
台科大網路鑑識課程 封包分析及中繼站追蹤jack51706
 
SQL injection duplicate error principle
SQL injection duplicate error principleSQL injection duplicate error principle
SQL injection duplicate error principlehackstuff
 
資安人員如何協助企業面對層出不窮的資安威脅
資安人員如何協助企業面對層出不窮的資安威脅 資安人員如何協助企業面對層出不窮的資安威脅
資安人員如何協助企業面對層出不窮的資安威脅 Tim Hsu
 
Algo/Crypto about CTF
Algo/Crypto about CTFAlgo/Crypto about CTF
Algo/Crypto about CTFhackstuff
 
窺探職場上所需之資安專業技術與能力 Tdohconf
窺探職場上所需之資安專業技術與能力 Tdohconf窺探職場上所需之資安專業技術與能力 Tdohconf
窺探職場上所需之資安專業技術與能力 Tdohconfjack51706
 
ROP 輕鬆談
ROP 輕鬆談ROP 輕鬆談
ROP 輕鬆談hackstuff
 
Harden Your Linux
Harden Your LinuxHarden Your Linux
Harden Your LinuxTim Hsu
 
Android Security Development
Android Security DevelopmentAndroid Security Development
Android Security Developmenthackstuff
 
4226 4228 台南安平new
4226 4228 台南安平new4226 4228 台南安平new
4226 4228 台南安平newannbinn20122012
 
BASH 漏洞深入探討
BASH 漏洞深入探討BASH 漏洞深入探討
BASH 漏洞深入探討Tim Hsu
 
SITCON2016, 防毒擋不住?勒索軟體猖獗與實作
SITCON2016, 防毒擋不住?勒索軟體猖獗與實作SITCON2016, 防毒擋不住?勒索軟體猖獗與實作
SITCON2016, 防毒擋不住?勒索軟體猖獗與實作Sheng-Hao Ma
 

Andere mochten auch (20)

Web2.0 attack and defence
Web2.0 attack and defenceWeb2.0 attack and defence
Web2.0 attack and defence
 
Dvwa low level
Dvwa low levelDvwa low level
Dvwa low level
 
新手無痛入門Apk逆向
新手無痛入門Apk逆向新手無痛入門Apk逆向
新手無痛入門Apk逆向
 
Python 網頁爬蟲由淺入淺
Python 網頁爬蟲由淺入淺Python 網頁爬蟲由淺入淺
Python 網頁爬蟲由淺入淺
 
Antivirus Bypass
Antivirus BypassAntivirus Bypass
Antivirus Bypass
 
Php lfi rfi掃盲大補帖
Php lfi rfi掃盲大補帖Php lfi rfi掃盲大補帖
Php lfi rfi掃盲大補帖
 
cmd injection
cmd injectioncmd injection
cmd injection
 
調試器原理與架構
調試器原理與架構調試器原理與架構
調試器原理與架構
 
勒索軟體態勢與應措
勒索軟體態勢與應措勒索軟體態勢與應措
勒索軟體態勢與應措
 
台科大網路鑑識課程 封包分析及中繼站追蹤
台科大網路鑑識課程 封包分析及中繼站追蹤台科大網路鑑識課程 封包分析及中繼站追蹤
台科大網路鑑識課程 封包分析及中繼站追蹤
 
SQL injection duplicate error principle
SQL injection duplicate error principleSQL injection duplicate error principle
SQL injection duplicate error principle
 
資安人員如何協助企業面對層出不窮的資安威脅
資安人員如何協助企業面對層出不窮的資安威脅 資安人員如何協助企業面對層出不窮的資安威脅
資安人員如何協助企業面對層出不窮的資安威脅
 
Algo/Crypto about CTF
Algo/Crypto about CTFAlgo/Crypto about CTF
Algo/Crypto about CTF
 
窺探職場上所需之資安專業技術與能力 Tdohconf
窺探職場上所需之資安專業技術與能力 Tdohconf窺探職場上所需之資安專業技術與能力 Tdohconf
窺探職場上所需之資安專業技術與能力 Tdohconf
 
ROP 輕鬆談
ROP 輕鬆談ROP 輕鬆談
ROP 輕鬆談
 
Harden Your Linux
Harden Your LinuxHarden Your Linux
Harden Your Linux
 
Android Security Development
Android Security DevelopmentAndroid Security Development
Android Security Development
 
4226 4228 台南安平new
4226 4228 台南安平new4226 4228 台南安平new
4226 4228 台南安平new
 
BASH 漏洞深入探討
BASH 漏洞深入探討BASH 漏洞深入探討
BASH 漏洞深入探討
 
SITCON2016, 防毒擋不住?勒索軟體猖獗與實作
SITCON2016, 防毒擋不住?勒索軟體猖獗與實作SITCON2016, 防毒擋不住?勒索軟體猖獗與實作
SITCON2016, 防毒擋不住?勒索軟體猖獗與實作
 

Ähnlich wie Crawler

HoneyPy & HoneyDB (CarolinaCon 13)
HoneyPy & HoneyDB (CarolinaCon 13)HoneyPy & HoneyDB (CarolinaCon 13)
HoneyPy & HoneyDB (CarolinaCon 13)Phillip Maddux
 
Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...
Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...
Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...Frédéric Harper
 
BSidesLondon | Your Money, Your Media - A DRMtastic Android (reverse|re
BSidesLondon | Your Money, Your Media - A DRMtastic Android (reverse|reBSidesLondon | Your Money, Your Media - A DRMtastic Android (reverse|re
BSidesLondon | Your Money, Your Media - A DRMtastic Android (reverse|reChandra Pratap
 
Hitbkl 2012
Hitbkl 2012Hitbkl 2012
Hitbkl 2012F _
 
Hack information of any website using webkiller
Hack information of any website using webkillerHack information of any website using webkiller
Hack information of any website using webkillerSoniakohli6
 
Your money, your media a DRMtastic (reverse|re) eng. tutorial
Your money, your media a DRMtastic (reverse|re) eng. tutorialYour money, your media a DRMtastic (reverse|re) eng. tutorial
Your money, your media a DRMtastic (reverse|re) eng. tutorialSecurity BSides London
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxiesSensePost
 
UKSG - Just Do IT Yourself
UKSG  - Just Do IT YourselfUKSG  - Just Do IT Yourself
UKSG - Just Do IT YourselfTony Hirst
 
Searching Shodan For Fun And Profit
Searching Shodan For Fun And ProfitSearching Shodan For Fun And Profit
Searching Shodan For Fun And ProfitE Hacking
 
Fronteers 2009 Of Hamsters, Feature Creatures and Missed Opportunities
Fronteers 2009 Of Hamsters, Feature Creatures and Missed OpportunitiesFronteers 2009 Of Hamsters, Feature Creatures and Missed Opportunities
Fronteers 2009 Of Hamsters, Feature Creatures and Missed OpportunitiesChristian Heilmann
 
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects (1).pdf
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects (1).pdfAdvanced View of Projects Raspberry Pi List - Raspberry PI Projects (1).pdf
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects (1).pdfIsmailkhan77481
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)Sammy Fung
 
Machine Learning for videogames in Unity3D - Fabio Corrirossi - Codemotion Ro...
Machine Learning for videogames in Unity3D - Fabio Corrirossi - Codemotion Ro...Machine Learning for videogames in Unity3D - Fabio Corrirossi - Codemotion Ro...
Machine Learning for videogames in Unity3D - Fabio Corrirossi - Codemotion Ro...Codemotion
 
A Botnet Detecting Infrastructure Using a Beneficial Botnet
A Botnet Detecting Infrastructure Using a Beneficial BotnetA Botnet Detecting Infrastructure Using a Beneficial Botnet
A Botnet Detecting Infrastructure Using a Beneficial BotnetTakashi Yamanoue
 
Advanced view of projects raspberry pi list raspberry pi projects
Advanced view of projects raspberry pi list   raspberry pi projectsAdvanced view of projects raspberry pi list   raspberry pi projects
Advanced view of projects raspberry pi list raspberry pi projectsWiseNaeem
 
Droidcon 2010: Android and iPhone - a known Antagonism ? Professor Dr. Kai Ra...
Droidcon 2010: Android and iPhone - a known Antagonism ? Professor Dr. Kai Ra...Droidcon 2010: Android and iPhone - a known Antagonism ? Professor Dr. Kai Ra...
Droidcon 2010: Android and iPhone - a known Antagonism ? Professor Dr. Kai Ra...Droidcon Berlin
 
Mothra - A FreeBSD send-pr tool for bugzilla system
Mothra - A FreeBSD send-pr tool for bugzilla systemMothra - A FreeBSD send-pr tool for bugzilla system
Mothra - A FreeBSD send-pr tool for bugzilla systemDaniel Lin
 
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects.pdf
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects.pdfAdvanced View of Projects Raspberry Pi List - Raspberry PI Projects.pdf
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects.pdfWiseNaeem
 
Packet Sniffer
Packet Sniffer Packet Sniffer
Packet Sniffer vilss
 

Ähnlich wie Crawler (20)

HoneyPy & HoneyDB (CarolinaCon 13)
HoneyPy & HoneyDB (CarolinaCon 13)HoneyPy & HoneyDB (CarolinaCon 13)
HoneyPy & HoneyDB (CarolinaCon 13)
 
Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...
Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...
Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...
 
BSidesLondon | Your Money, Your Media - A DRMtastic Android (reverse|re
BSidesLondon | Your Money, Your Media - A DRMtastic Android (reverse|reBSidesLondon | Your Money, Your Media - A DRMtastic Android (reverse|re
BSidesLondon | Your Money, Your Media - A DRMtastic Android (reverse|re
 
Hitbkl 2012
Hitbkl 2012Hitbkl 2012
Hitbkl 2012
 
Hack information of any website using webkiller
Hack information of any website using webkillerHack information of any website using webkiller
Hack information of any website using webkiller
 
Your money, your media a DRMtastic (reverse|re) eng. tutorial
Your money, your media a DRMtastic (reverse|re) eng. tutorialYour money, your media a DRMtastic (reverse|re) eng. tutorial
Your money, your media a DRMtastic (reverse|re) eng. tutorial
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxies
 
UKSG - Just Do IT Yourself
UKSG  - Just Do IT YourselfUKSG  - Just Do IT Yourself
UKSG - Just Do IT Yourself
 
Searching Shodan For Fun And Profit
Searching Shodan For Fun And ProfitSearching Shodan For Fun And Profit
Searching Shodan For Fun And Profit
 
What is being exposed from IoT Devices
What is being exposed from IoT DevicesWhat is being exposed from IoT Devices
What is being exposed from IoT Devices
 
Fronteers 2009 Of Hamsters, Feature Creatures and Missed Opportunities
Fronteers 2009 Of Hamsters, Feature Creatures and Missed OpportunitiesFronteers 2009 Of Hamsters, Feature Creatures and Missed Opportunities
Fronteers 2009 Of Hamsters, Feature Creatures and Missed Opportunities
 
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects (1).pdf
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects (1).pdfAdvanced View of Projects Raspberry Pi List - Raspberry PI Projects (1).pdf
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects (1).pdf
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)
 
Machine Learning for videogames in Unity3D - Fabio Corrirossi - Codemotion Ro...
Machine Learning for videogames in Unity3D - Fabio Corrirossi - Codemotion Ro...Machine Learning for videogames in Unity3D - Fabio Corrirossi - Codemotion Ro...
Machine Learning for videogames in Unity3D - Fabio Corrirossi - Codemotion Ro...
 
A Botnet Detecting Infrastructure Using a Beneficial Botnet
A Botnet Detecting Infrastructure Using a Beneficial BotnetA Botnet Detecting Infrastructure Using a Beneficial Botnet
A Botnet Detecting Infrastructure Using a Beneficial Botnet
 
Advanced view of projects raspberry pi list raspberry pi projects
Advanced view of projects raspberry pi list   raspberry pi projectsAdvanced view of projects raspberry pi list   raspberry pi projects
Advanced view of projects raspberry pi list raspberry pi projects
 
Droidcon 2010: Android and iPhone - a known Antagonism ? Professor Dr. Kai Ra...
Droidcon 2010: Android and iPhone - a known Antagonism ? Professor Dr. Kai Ra...Droidcon 2010: Android and iPhone - a known Antagonism ? Professor Dr. Kai Ra...
Droidcon 2010: Android and iPhone - a known Antagonism ? Professor Dr. Kai Ra...
 
Mothra - A FreeBSD send-pr tool for bugzilla system
Mothra - A FreeBSD send-pr tool for bugzilla systemMothra - A FreeBSD send-pr tool for bugzilla system
Mothra - A FreeBSD send-pr tool for bugzilla system
 
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects.pdf
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects.pdfAdvanced View of Projects Raspberry Pi List - Raspberry PI Projects.pdf
Advanced View of Projects Raspberry Pi List - Raspberry PI Projects.pdf
 
Packet Sniffer
Packet Sniffer Packet Sniffer
Packet Sniffer
 

Crawler

  • 1. Crawler @hack-stuff.com Anything can be a crawler November 11, 2012 1 / 19
  • 2. What’s the Crawler Crawlers walk on the network, search anything it found and doing anything what they wants... Search engine Data finder / collector Anything else... 2 / 19
  • 3. Conception Crawler can easy to be separate into three steps... Download Data operation Find the next seed 3 / 19
  • 4. Pseudo Code Fetch the web page, parser it, get useful information and repeat it again. f o r u r l i n nextSeed ( ) : info = fetch ( url ) data , seeds = o p e r a t e ( i n f o ) pushSeed ( seeds ) 4 / 19
  • 5. Greedy But easy things are always too hard to be solved... Web server always block the crawler! Data always never structured! How to find the next seed! Crawler always bounded on network speed... 5 / 19
  • 6. Operation When we link to the target... Download the web page, parser the HTML code Download the database, parser the DB format Finial, record everything into our DB 6 / 19
  • 7. Pseudo Code Parser the HTML code, for example, search what’s you need... from B e a u t i f u l S o u p import ∗ soup = B e a u t i f u l S o u p ( webpage ) ## P r i n t t h e main body p r i n t soup . h t m l . body ## P r i n t t h e f i r s t t a g <a> i n body p r i n t soup . h t m l . body . a ## Find t h e p a r t i c u l a r t a g t a g s = soup . f i n d A l l ( ’ form ’ ) 7 / 19
  • 8. Operation (cont’d) And more, you also can do something else, like payload, when operate the web page... Post / Get the method based on HTML Find the next seed on the web page Something good / bad 8 / 19
  • 9. Link to Site Before we operated the web page, we need to... Link to web site Get the web page But server master hates the net crawler, ’cause No functionality Slow down / burn out the resource As the thief 9 / 19
  • 10. Fetch If you are not Google You must be the human 10 / 19
  • 11. Be a Human Be a human as a human being... No one can press anything under 0.11 second No one can look page with few secode No one can work for all day 11 / 19
  • 12. Rules Using the framework / tool to enumlate the browser Change the default setting Simulate the existed browser Cookie support Time issue and random variable 12 / 19
  • 13. Pseudo Code Simple fetch code import u r l l i b 2 from c o o k i e l i b import CookieJar import time , random f o r n i n range (MAX LOOP ) : ## Cookie ck = CookieJar ( ) ck = u r l l i b 2 . HTTPCookieProcessor ( ck ) req = u r l l i b 2 . b u i l d o p e n e r ( ck ) ## User−Agent req . addheaders = [ ( ’ User−Agent ’ , ’ c r a w l e r c m j ’ ) ] data = req . open ( u r l ) . read ( ) ## Wait t i m e . s l e e p ( random . r a n d i n t ( 0 , 5 ) ) 13 / 19
  • 14. Seed The last one, but the hardest one... We always unknown the next sheep 14 / 19
  • 15. Find Sheep Using the well known search engine Also, search engine blocks other crawler The crawler needs to parser the garbage code The result maybe the js code... Using the random / enumerate method Too hard to find the useful target Cost lots of time Cannot shut sheeps immediately 15 / 19
  • 16. Based Search Engine Design an other crawler Given the initial keyword as the seed Fetch the search engine Parser the result, and get the next seed if possible Repeat until stop or blocked. 16 / 19
  • 17. Tricky Using the distribution model Separate each parts More volunteers can speed-up 17 / 19
  • 18. Pyro4 Pyro4 can help you to remote control python object... Expose the object can access as on local side Using the remote resource to process Provide the M-n model 18 / 19