Crawler

•

1 gefällt mir•2,058 views

hackstuff

What’s the Crawler

Crawlers walk on the network, search anything it
found and doing anything what they wants...
Search engine
Data ﬁnder / collector
Anything else...

2 / 19

Conception

Crawler can easy to be separate into three
steps...
Download
Data operation
Find the next seed

3 / 19

Pseudo Code
Fetch the web page, parser it, get useful
information and repeat it again.

f o r u r l i n nextSeed ( ) :
info = fetch ( url )
data , seeds = o p e r a t e ( i n f o )
pushSeed ( seeds )

4 / 19

Greedy
But easy things are always too hard to be
solved...
Web server always block the crawler!
Data always never structured!
How to ﬁnd the next seed!
Crawler always bounded on network
speed...

5 / 19

Operation
When we link to the target...
Download the web page, parser the HTML
code
Download the database, parser the DB
format
Finial, record everything into our DB

6 / 19

Pseudo Code
Parser the HTML code, for example, search
what’s you need...

from B e a u t i f u l S o u p import ∗

soup = B e a u t i f u l S o u p ( webpage )
## P r i n t t h e main body
p r i n t soup . h t m l . body
## P r i n t t h e f i r s t t a g <a> i n body
p r i n t soup . h t m l . body . a
## Find t h e p a r t i c u l a r t a g
t a g s = soup . f i n d A l l ( ’ form ’ )

7 / 19

Operation (cont’d)

And more, you also can do something else, like
payload, when operate the web page...
Post / Get the method based on HTML
Find the next seed on the web page
Something good / bad

8 / 19

Link to Site
Before we operated the web page, we need to...
Link to web site
Get the web page
But server master hates the net crawler, ’cause
No functionality
Slow down / burn out the resource
As the thief

9 / 19

Fetch

If you are not Google
You must be the human

10 / 19

Be a Human

Be a human as a human being...
No one can press anything under 0.11
second
No one can look page with few secode
No one can work for all day

11 / 19

Rules
Using the framework / tool to enumlate the
browser
Change the default setting
Simulate the existed browser
Cookie support
Time issue and random variable

12 / 19

Pseudo Code
Simple fetch code

import u r l l i b 2
from c o o k i e l i b import CookieJar
import time , random

f o r n i n range (MAX LOOP ) :
## Cookie
ck = CookieJar ( )
ck = u r l l i b 2 . HTTPCookieProcessor ( ck )
req = u r l l i b 2 . b u i l d o p e n e r ( ck )
## User−Agent
req . addheaders = [ ( ’ User−Agent ’ , ’ c r a w l e r c m j ’ ) ]
data = req . open ( u r l ) . read ( )
## Wait
t i m e . s l e e p ( random . r a n d i n t ( 0 , 5 ) )
13 / 19

Seed

The last one, but the hardest one...

We always unknown the
next sheep

14 / 19

Find Sheep
Using the well known search engine
Also, search engine blocks other crawler
The crawler needs to parser the garbage
code
The result maybe the js code...
Using the random / enumerate method
Too hard to ﬁnd the useful target
Cost lots of time
Cannot shut sheeps immediately

15 / 19

Based Search Engine
Design an other crawler
Given the initial keyword as the seed
Fetch the search engine
Parser the result, and get the next seed if
possible
Repeat until stop or blocked.

16 / 19

Tricky

Using the distribution model
Separate each parts
More volunteers can speed-up

17 / 19

Pyro4
Pyro4 can help you to remote control python
object...
Expose the object can access as on local
side
Using the remote resource to process
Provide the M-n model

18 / 19

Empfohlen

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung

世界のどこかで楽しくRubyでお仕事するためにKuniaki Igarashi

NIF 2.0 Hands on Turorial.Ciro Neto

Python, web scraping and content management: Scrapy and DjangoSammy Fung

儲かるドキュメントYoshiki Shibukawa

Missing kids on youguest3fa681

Webshell 簡單應用hackstuff

Rootkit 101hackstuff

Empfohlen

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung

世界のどこかで楽しくRubyでお仕事するためにKuniaki Igarashi

NIF 2.0 Hands on Turorial.Ciro Neto

Python, web scraping and content management: Scrapy and DjangoSammy Fung

儲かるドキュメントYoshiki Shibukawa

Missing kids on youguest3fa681

Webshell 簡單應用hackstuff

Rootkit 101hackstuff

Web2.0 attack and defencehackstuff

Dvwa low levelhackstuff

新手無痛入門Apk逆向hackstuff

Python 網頁爬蟲由淺入淺hackstuff

Antivirus Bypasshackstuff

Php lfi rfi掃盲大補帖hackstuff

cmd injectionhackstuff

調試器原理與架構hackstuff

勒索軟體態勢與應措jack51706

台科大網路鑑識課程封包分析及中繼站追蹤jack51706

SQL injection duplicate error principlehackstuff

資安人員如何協助企業面對層出不窮的資安威脅 Tim Hsu

Algo/Crypto about CTFhackstuff

窺探職場上所需之資安專業技術與能力 Tdohconfjack51706

ROP 輕鬆談hackstuff

Harden Your LinuxTim Hsu

Android Security Developmenthackstuff

4226 4228 台南安平newannbinn20122012

BASH 漏洞深入探討Tim Hsu

SITCON2016, 防毒擋不住？勒索軟體猖獗與實作Sheng-Hao Ma

HoneyPy & HoneyDB (CarolinaCon 13)Phillip Maddux

Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...Frédéric Harper

Weitere ähnliche Inhalte

Andere mochten auch

Web2.0 attack and defencehackstuff

Dvwa low levelhackstuff

新手無痛入門Apk逆向hackstuff

Python 網頁爬蟲由淺入淺hackstuff

Antivirus Bypasshackstuff

Php lfi rfi掃盲大補帖hackstuff

cmd injectionhackstuff

調試器原理與架構hackstuff

勒索軟體態勢與應措jack51706

台科大網路鑑識課程封包分析及中繼站追蹤jack51706

SQL injection duplicate error principlehackstuff

資安人員如何協助企業面對層出不窮的資安威脅 Tim Hsu

Algo/Crypto about CTFhackstuff

窺探職場上所需之資安專業技術與能力 Tdohconfjack51706

ROP 輕鬆談hackstuff

Harden Your LinuxTim Hsu

Android Security Developmenthackstuff

4226 4228 台南安平newannbinn20122012

BASH 漏洞深入探討Tim Hsu

SITCON2016, 防毒擋不住？勒索軟體猖獗與實作Sheng-Hao Ma

Andere mochten auch (20)

Web2.0 attack and defence

Dvwa low level

新手無痛入門Apk逆向

Python 網頁爬蟲由淺入淺

Antivirus Bypass

Php lfi rfi掃盲大補帖

cmd injection

調試器原理與架構

勒索軟體態勢與應措

台科大網路鑑識課程封包分析及中繼站追蹤

SQL injection duplicate error principle

資安人員如何協助企業面對層出不窮的資安威脅

Algo/Crypto about CTF

窺探職場上所需之資安專業技術與能力 Tdohconf

ROP 輕鬆談

Harden Your Linux

Android Security Development

4226 4228 台南安平new

BASH 漏洞深入探討

SITCON2016, 防毒擋不住？勒索軟體猖獗與實作

Ähnlich wie Crawler

HoneyPy & HoneyDB (CarolinaCon 13)Phillip Maddux

Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...Frédéric Harper

BSidesLondon | Your Money, Your Media - A DRMtastic Android (reverse|reChandra Pratap

Hitbkl 2012F _

Hack information of any website using webkillerSoniakohli6

Your money, your media a DRMtastic (reverse|re) eng. tutorialSecurity BSides London

A tale of two proxiesSensePost

UKSG - Just Do IT YourselfTony Hirst

Searching Shodan For Fun And ProfitE Hacking

What is being exposed from IoT DevicesThe Security of Things Forum

Fronteers 2009 Of Hamsters, Feature Creatures and Missed OpportunitiesChristian Heilmann

Advanced View of Projects Raspberry Pi List - Raspberry PI Projects (1).pdfIsmailkhan77481

How do we develop open source software to help open data ? (MOSC 2013)Sammy Fung

Machine Learning for videogames in Unity3D - Fabio Corrirossi - Codemotion Ro...Codemotion

A Botnet Detecting Infrastructure Using a Beneficial BotnetTakashi Yamanoue

Advanced view of projects raspberry pi list raspberry pi projectsWiseNaeem

Droidcon 2010: Android and iPhone - a known Antagonism ? Professor Dr. Kai Ra...Droidcon Berlin

Mothra - A FreeBSD send-pr tool for bugzilla systemDaniel Lin

Advanced View of Projects Raspberry Pi List - Raspberry PI Projects.pdfWiseNaeem

Packet Sniffer vilss

Ähnlich wie Crawler (20)

HoneyPy & HoneyDB (CarolinaCon 13)

Toronto user groups workshop #2 - 2013-04-06 - Building Windows 8 apps, more ...

BSidesLondon | Your Money, Your Media - A DRMtastic Android (reverse|re

Hitbkl 2012

Hack information of any website using webkiller

Your money, your media a DRMtastic (reverse|re) eng. tutorial

A tale of two proxies

UKSG - Just Do IT Yourself

Searching Shodan For Fun And Profit

What is being exposed from IoT Devices

Fronteers 2009 Of Hamsters, Feature Creatures and Missed Opportunities

Advanced View of Projects Raspberry Pi List - Raspberry PI Projects (1).pdf

How do we develop open source software to help open data ? (MOSC 2013)

Machine Learning for videogames in Unity3D - Fabio Corrirossi - Codemotion Ro...

A Botnet Detecting Infrastructure Using a Beneficial Botnet

Advanced view of projects raspberry pi list raspberry pi projects

Droidcon 2010: Android and iPhone - a known Antagonism ? Professor Dr. Kai Ra...

Mothra - A FreeBSD send-pr tool for bugzilla system

Advanced View of Projects Raspberry Pi List - Raspberry PI Projects.pdf

Packet Sniffer

Crawler

1. Crawler @hack-stuff.com Anything can be a crawler November 11, 2012 1 / 19

2. What’s the Crawler Crawlers walk on the network, search anything it found and doing anything what they wants... Search engine Data ﬁnder / collector Anything else... 2 / 19

3. Conception Crawler can easy to be separate into three steps... Download Data operation Find the next seed 3 / 19

4. Pseudo Code Fetch the web page, parser it, get useful information and repeat it again. f o r u r l i n nextSeed ( ) : info = fetch ( url ) data , seeds = o p e r a t e ( i n f o ) pushSeed ( seeds ) 4 / 19

5. Greedy But easy things are always too hard to be solved... Web server always block the crawler! Data always never structured! How to ﬁnd the next seed! Crawler always bounded on network speed... 5 / 19

6. Operation When we link to the target... Download the web page, parser the HTML code Download the database, parser the DB format Finial, record everything into our DB 6 / 19

7. Pseudo Code Parser the HTML code, for example, search what’s you need... from B e a u t i f u l S o u p import ∗ soup = B e a u t i f u l S o u p ( webpage ) ## P r i n t t h e main body p r i n t soup . h t m l . body ## P r i n t t h e f i r s t t a g <a> i n body p r i n t soup . h t m l . body . a ## Find t h e p a r t i c u l a r t a g t a g s = soup . f i n d A l l ( ’ form ’ ) 7 / 19

8. Operation (cont’d) And more, you also can do something else, like payload, when operate the web page... Post / Get the method based on HTML Find the next seed on the web page Something good / bad 8 / 19

9. Link to Site Before we operated the web page, we need to... Link to web site Get the web page But server master hates the net crawler, ’cause No functionality Slow down / burn out the resource As the thief 9 / 19

10. Fetch If you are not Google You must be the human 10 / 19

11. Be a Human Be a human as a human being... No one can press anything under 0.11 second No one can look page with few secode No one can work for all day 11 / 19

12. Rules Using the framework / tool to enumlate the browser Change the default setting Simulate the existed browser Cookie support Time issue and random variable 12 / 19

13. Pseudo Code Simple fetch code import u r l l i b 2 from c o o k i e l i b import CookieJar import time , random f o r n i n range (MAX LOOP ) : ## Cookie ck = CookieJar ( ) ck = u r l l i b 2 . HTTPCookieProcessor ( ck ) req = u r l l i b 2 . b u i l d o p e n e r ( ck ) ## User−Agent req . addheaders = [ ( ’ User−Agent ’ , ’ c r a w l e r c m j ’ ) ] data = req . open ( u r l ) . read ( ) ## Wait t i m e . s l e e p ( random . r a n d i n t ( 0 , 5 ) ) 13 / 19

14. Seed The last one, but the hardest one... We always unknown the next sheep 14 / 19

15. Find Sheep Using the well known search engine Also, search engine blocks other crawler The crawler needs to parser the garbage code The result maybe the js code... Using the random / enumerate method Too hard to ﬁnd the useful target Cost lots of time Cannot shut sheeps immediately 15 / 19

16. Based Search Engine Design an other crawler Given the initial keyword as the seed Fetch the search engine Parser the result, and get the next seed if possible Repeat until stop or blocked. 16 / 19

17. Tricky Using the distribution model Separate each parts More volunteers can speed-up 17 / 19

18. Pyro4 Pyro4 can help you to remote control python object... Expose the object can access as on local side Using the remote resource to process Provide the M-n model 18 / 19

19. Thanks for participation Q&A 19 / 19