SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Downloaden Sie, um offline zu lesen
Scraping with Python
     for Fun & Profit



@PyCon India 2010          Abhishek Mishra
http://in.pycon.org   hello@ideamonk.com
Now... I want you to
put your _data_ on
the web!




                photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
There is still a
                                                             HUGE
                                                       UNLOCKED
                                                      POTENTIAL!!




photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
There is still a
                                                                         HUGE
                                                                   UNLOCKED
                                                                  POTENTIAL!!




search analyze visualize co-relate

            photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
Haven’t got DATA
on the web as
DATA!




               photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
Haven’t got DATA
on the web as
DATA!




               photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
Haven’t got DATA
on the web as
DATA!




               photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
FRUSTRATION !!!
1. will take a lot of time

2. the data isn’t consumable

3. you want it right now!

4. missing source




                    photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
Solution ?




             photo credit - Angie Pangie
Solution ?

        just scrape it out!




                       photo credit - Angie Pangie
Solution ?

        just scrape it out!
         1. analyze request
         2. find patterns
         3. extract data


                        photo credit - Angie Pangie
Solution ?

        just scrape it out!
         1. analyze request
         2. find patterns
         3. extract data
        Use cases - good bad ugly


                             photo credit - Angie Pangie
Tools of Trade

• urllib2your data, getting back results
  passing

• BeautifulSoup entangled web
  parsing it out of the

• Mechanizeweb browsing
  programmatic

• Scrapy framework
  a web scraping
Our objective
Our objective
<body>
<div id="results">
<img src="shield.png" style="height:140px;"/>
<h1>Lohit   Narayan
</h1>                                           Name             Reg     CGPA
<div>                                           ====             ===     ====
Registration Number :   1234<br   />
                                                Lohit Narayan,   1234,   9.3
<table>
<tr>                                            ...              ...     ...
<td>Object Oriented Programming</td>
<td> B+</td>
</tr>
<tr>
<td>Probability and Statistics</td>
<td> B+</td>
</tr>
<tr>
<td><b>CGPA</b></td>
<td><b>9.3</b></td>
</tr>
</table>
urllib2


• Where to send?
• What to send?
• Answered by -
  firebug,
  chrome developer tools,
  a local debugging proxy
urllib2
import urllib
import urllib2

url = “http://localhost/results/results.php”

data = { ‘regno’ : ‘1000’ }
data_encoded = urllib.urlencode(data)

request = urllib2.urlopen( url, data_encoded )
html = request.read()
urllib2
import urllib
import urllib2

url = “http://localhost/results/results.php”

data = { ‘regno’ : ‘1000’ }
data_encoded = urllib.urlencode(data)

request = urllib2.urlopen( url, data_encoded )
html = request.read()
     '<html>n<head>nt<title>Foo University results</title>nt<style type="text/css">ntt#results {
     ntttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #dd
     ddff;nttttext-align:center;ntt}nt</style>n</head>n<body>nt<div id="results">nt<img sr
     c="shield.png" style="height:140px;"/>nt<h1>Rahim Kumar</h1>nt<div>nttRegistration Number   :
     1000tt<br/>ntt<table>nttttttt<tr>nttttt<td>Humanities Elective II</td>ntttt
     t<td> B-</td>ntttt</tr>nttttttt<tr>nttttt<td>Humanities Elective III</td>nttt
     tt<td> A</td>ntttt</tr>nttttttt<tr>nttttt<td>Discrete Mathematics</td>nttt
     tt<td> C-</td>ntttt</tr>nttttttt<tr>nttttt<td>Environmental Science and Engineer
     ing</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Data Structures</t
     d>nttttt<td> D</td>ntttt</tr>nttttttt<tr>nttttt<td>Science Elective I</td>
     nttttt<td> D-</td>ntttt</tr>nttttttt<tr>nttttt<td>Object Oriented Programmin
     g</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Probability and Stat
     istics</td>nttttt<td> A-</td>ntttt</tr>ntttttt<tr>ntttt<td><b>CGPA</b></td>n
        tttt<td><b>9.1</b></td>nttt</tr>ntt</table>nt</div>nt</div>n</body>n</htmln'
urllib2
import urllib    # for urlencode
import urllib2   # main urllib2 module

url = “http://localhost/results/results.php”

data = { ‘regno’ : ‘1000’ }
# ^ data as an object

data_encoded = urllib.urlencode(data)
# a string, key=val pairs separated by ‘&’

request = urllib2.urlopen( url, data_encoded )
# ^ returns a network object

html = request.read()
# ^ reads out its contents to a string
urllib2
import urllib
import urllib2

url = “http://localhost/results/results.php”
data = {}
scraped_results = []

for i in xrange(1000, 3000):
    data[‘regno’] = str(i)
    data_encoded = urllib.urlencode(data)
    request = urllib2.urlopen(url, data_encoded)
    html = request.read()
    scraped_results.append(html)
'<html>n<head>nt<title>Foo University results</title>nt<style type="text/css">ntt#results {
ntttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #dd
ddff;nttttext-align:center;ntt}nt</style>n</head>n<body>nt<div id="results">nt<img sr
c="shield.png" style="height:140px;"/>nt<h1>Rahim Kumar</h1>nt<div>nttRegistration Number   :
1000tt<br/>ntt<table>nttttttt<tr>nttttt<td>Humanities Elective II</td>ntttt
t<td> B-</td>ntttt</tr>nttttttt<tr>nttttt<td>Humanities Elective III</td>nttt
tt<td> A</td>ntttt</tr>nttttttt<tr>nttttt<td>Discrete Mathematics</td>nttt
tt<td> C-</td>ntttt</tr>nttttttt<tr>nttttt<td>Environmental Science and Engineer
ing</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Data Structures</t
d>nttttt<td> D</td>ntttt</tr>nttttttt<tr>nttttt<td>Science Elective I</td>
nttttt<td> D-</td>ntttt</tr>nttttttt<tr>nttttt<td>Object Oriented Programmin
g</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Probability and Stat
istics</td>nttttt<td> A-</td>ntttt</tr>ntttttt<tr>ntttt<td><b>CGPA</b></td>n
   tttt<td><b>9.1</b></td>nttt</tr>ntt</table>nt</div>nt</div>n</body>n</htmln'




                           Thats alright,
                    But this is still meaningless.
'<html>n<head>nt<title>Foo University results</title>nt<style type="text/css">ntt#results {
ntttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #dd
ddff;nttttext-align:center;ntt}nt</style>n</head>n<body>nt<div id="results">nt<img sr
c="shield.png" style="height:140px;"/>nt<h1>Rahim Kumar</h1>nt<div>nttRegistration Number   :
1000tt<br/>ntt<table>nttttttt<tr>nttttt<td>Humanities Elective II</td>ntttt
t<td> B-</td>ntttt</tr>nttttttt<tr>nttttt<td>Humanities Elective III</td>nttt
tt<td> A</td>ntttt</tr>nttttttt<tr>nttttt<td>Discrete Mathematics</td>nttt
tt<td> C-</td>ntttt</tr>nttttttt<tr>nttttt<td>Environmental Science and Engineer
ing</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Data Structures</t
d>nttttt<td> D</td>ntttt</tr>nttttttt<tr>nttttt<td>Science Elective I</td>
nttttt<td> D-</td>ntttt</tr>nttttttt<tr>nttttt<td>Object Oriented Programmin
g</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Probability and Stat
istics</td>nttttt<td> A-</td>ntttt</tr>ntttttt<tr>ntttt<td><b>CGPA</b></td>n
   tttt<td><b>9.1</b></td>nttt</tr>ntt</table>nt</div>nt</div>n</body>n</htmln'




                           Thats alright,
                    But this is still meaningless.


                 Enter BeautifulSoup
BeautifulSoup




            http://www.flickr.com/photos/29233640@N07/3292261940/
BeautifulSoup

• Python HTML/XML Parser
• Wont choke you on bad markup
• auto-detects encoding
• easy tree traversal
• find all links whose url contains ‘foo.com’

                                  http://www.flickr.com/photos/29233640@N07/3292261940/
BeautifulSoup
>>> import re
>>> from BeautifulSoup import BeautifulSoup

>>> html = """<a href="http://foo.com/foobar">link1</a> <a
href="http://foo.com/beep">link2</a> <a href="http://
foo.co.in/beep">link3</a>"""

>>> soup = BeautifulSoup(html)
>>> links = soup.findAll('a', {'href': re.compile
(".*foo.com.*")})
>>> links
[<a href="http://foo.com/foobar">link1</a>, <a href="http://
foo.com/beep">link2</a>]

>>> links[0][‘href’]
u'http://foo.com/foobar'

>>> links[0].contents
[u'link1']
BeautifulSoup
>>> html = “““<html>n <head>n <title>n        Foo University resultsn </title>n <style type="text/css">n          #results {n
    tttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #ddddff;nttttext-
align:center;ntt}n </style>n </head>n <body>n <div id="results">n            <img src="shield.png" style="height:140px;" /
  >n    <h1>n     Bipin Narayann   </h1>n     <div>n     Registration Number : 1000n      <br />n    <table>n       <tr>n
 <td>n         Humanities Elective IIn       </td>n        <td>n        C-n       </td>n     </tr>n      <tr>n        <td>n
        Humanities Elective IIIn        </td>n       <td>n         B-n       </td>n     </tr>n     <tr>n        <td>n
          Discrete Mathematicsn       </td>n        <td>n         Cn      </td>n      </tr>n     <tr>n       <td>n
 Environmental Science and Engineeringn         </td>n        <td>n        Cn       </td>n     </tr>n      <tr>n        <td>n
Data Structuresn        </td>n      <td>n         An       </td>n      </tr>n      <tr>n      <td>n        Science Elective
In       </td>n       <td>n       B-n       </td>n      </tr>n       <tr>n       <td>n       Object Oriented Programmingn
 </td>n        <td>n        B+n     </td>n       </tr>n      <tr>n       <td>n        Probability and Statisticsn         </
 td>n       <td>n        An      </td>n      </tr>n       <tr>n       <td>n        <b>n        CGPAn        </b>n        </
 td>n       <td>n        <b>n         8n        </b>n       </td>n      </tr>n     </table>n   </div>n </div>n </body>
                                                           n</html>”””
BeautifulSoup
>>> html = “““<html>n <head>n <title>n        Foo University resultsn </title>n <style type="text/css">n          #results {n
    tttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #ddddff;nttttext-
align:center;ntt}n </style>n </head>n <body>n <div id="results">n            <img src="shield.png" style="height:140px;" /
  >n    <h1>n     Bipin Narayann   </h1>n     <div>n     Registration Number : 1000n      <br />n    <table>n       <tr>n
 <td>n         Humanities Elective IIn       </td>n        <td>n        C-n       </td>n     </tr>n      <tr>n        <td>n
        Humanities Elective IIIn        </td>n       <td>n         B-n       </td>n     </tr>n     <tr>n        <td>n
          Discrete Mathematicsn       </td>n        <td>n         Cn      </td>n      </tr>n     <tr>n       <td>n
 Environmental Science and Engineeringn         </td>n        <td>n        Cn       </td>n     </tr>n      <tr>n        <td>n
Data Structuresn        </td>n      <td>n         An       </td>n      </tr>n      <tr>n      <td>n        Science Elective
In       </td>n       <td>n       B-n       </td>n      </tr>n       <tr>n       <td>n       Object Oriented Programmingn
 </td>n        <td>n        B+n     </td>n       </tr>n      <tr>n       <td>n        Probability and Statisticsn         </
 td>n       <td>n        An      </td>n      </tr>n       <tr>n       <td>n        <b>n        CGPAn        </b>n        </
 td>n       <td>n        <b>n         8n        </b>n       </td>n      </tr>n     </table>n   </div>n </div>n </body>
                                                           n</html>”””




 >>> soup.h1
 <h1>Bipin Narang</h1>
 >>> soup.h1.contents
 [u’Bipin Narang’]
 >>> soup.h1.contents[0]
 u’Bipin Narang’
BeautifulSoup
>>> html = “““<html>n <head>n <title>n        Foo University resultsn </title>n <style type="text/css">n          #results {n
    tttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #ddddff;nttttext-
align:center;ntt}n </style>n </head>n <body>n <div id="results">n            <img src="shield.png" style="height:140px;" /
  >n    <h1>n     Bipin Narayann   </h1>n     <div>n    Registration Number : 1000n       <br />n    <table>n       <tr>n
 <td>n         Humanities Elective IIn       </td>n        <td>n        C-n       </td>n     </tr>n      <tr>n        <td>n
        Humanities Elective IIIn        </td>n       <td>n         B-n       </td>n     </tr>n     <tr>n        <td>n
          Discrete Mathematicsn       </td>n        <td>n         Cn      </td>n      </tr>n     <tr>n       <td>n
 Environmental Science and Engineeringn         </td>n        <td>n        Cn       </td>n     </tr>n      <tr>n        <td>n
Data Structuresn        </td>n      <td>n         An       </td>n      </tr>n      <tr>n      <td>n        Science Elective
In       </td>n       <td>n       B-n       </td>n      </tr>n       <tr>n       <td>n       Object Oriented Programmingn
 </td>n        <td>n        B+n     </td>n       </tr>n      <tr>n       <td>n        Probability and Statisticsn         </
 td>n       <td>n        An      </td>n      </tr>n       <tr>n       <td>n        <b>n        CGPAn        </b>n        </
 td>n       <td>n        <b>n         8n        </b>n       </td>n      </tr>n     </table>n   </div>n </div>n </body>
                                                           n</html>”””




 >>> soup.div.div.contents[0]
 u'nttRegistration Number : 1000tt'
 >>> soup.div.div.contents[0].split(':')
 [u'nttRegistration Number ', u' 1000tt']
 >>> soup.div.div.contents[0].split(':')[1].strip()
 u'1000'
BeautifulSoup
>>> html = “““<html>n <head>n <title>n        Foo University resultsn </title>n <style type="text/css">n          #results {n
    tttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #ddddff;nttttext-
align:center;ntt}n </style>n </head>n <body>n <div id="results">n            <img src="shield.png" style="height:140px;" /
  >n    <h1>n     Bipin Narayann   </h1>n     <div>n    Registration Number : 1000n       <br />n    <table>n       <tr>n
 <td>n         Humanities Elective IIn       </td>n        <td>n        C-n       </td>n     </tr>n      <tr>n        <td>n
        Humanities Elective IIIn        </td>n       <td>n         B-n       </td>n     </tr>n     <tr>n        <td>n
          Discrete Mathematicsn       </td>n        <td>n         Cn      </td>n      </tr>n     <tr>n       <td>n
 Environmental Science and Engineeringn         </td>n        <td>n        Cn       </td>n     </tr>n      <tr>n        <td>n
Data Structuresn        </td>n      <td>n         An       </td>n      </tr>n      <tr>n      <td>n        Science Elective
In       </td>n       <td>n       B-n       </td>n      </tr>n       <tr>n       <td>n       Object Oriented Programmingn
 </td>n        <td>n        B+n     </td>n       </tr>n      <tr>n       <td>n        Probability and Statisticsn         </
 td>n       <td>n        An      </td>n      </tr>n       <tr>n       <td>n        <b>n        CGPAn        </b>n        </
 td>n       <td>n        <b>n         8n        </b>n       </td>n      </tr>n     </table>n   </div>n </div>n </body>
                                                           n</html>”””




 >>> soup.find('b')
 <b>CGPA</b>
 >>> soup.find('b').findNext('b')
 <b>8</b>
 >>> soup.find('b').findNext('b').contents[0]
 u'8'
urllib2 + BeautifulSoup

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://localhost/results/results.php"
data={}
for i in xrange(1000,1100):
    data['regno'] = str(i)
    html = urllib2.urlopen( url, urllib.urlencode(data) ).read()
    soup = BeautifulSoup(html)
    name = soup.h1.contents[0]
    regno = soup.div.div.contents[0].split(':')[1].strip()
    # ^ though we alread know current regno ;) == str(i)
    cgpa = soup.find('b').findNext('b').contents[0]
    print "%s,%s,%s" % (regno, name, 'cgpa')
Mechanize




            http://www.flickr.com/photos/flod/3092381868/
Mechanize

   Stateful web browsing in Python
   after Andy Lester’s WWW:Mechanize

              fill forms
             follow links
           handle cookies
           browser history



                    http://www.flickr.com/photos/flod/3092381868/
Mechanize
>>> import mechanize
>>> br = mechanize.Browser()
>>> response = br.open(“http://google.com”)
>>> br.title()
‘Google’
>>> br.geturl()
'http://www.google.co.in/'
>>> print response.info()
Date: Fri, 10 Sep 2010 04:06:57 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Server: gws
X-XSS-Protection: 1; mode=block
Connection: close
content-type: text/html; charset=ISO-8859-1

>>> br.open("http://yahoo.com")
>>> br.title()
'Yahoo! India'
>>> br.back()
>>> br.title()
'Google'
Mechanize
filling forms
  >>> import mechanize
  >>> br = mechanize.Browser()
  >>> response = br.open("http://in.pycon.org/2010/account/
  login")
  >>> html = response.read()
  >>> html.find('href="/2010/user/ideamonk">Abhishek |')
  -1

  >>> br.select_form(name="login")
  >>> br.form["username"] = "ideamonk"
  >>> br.form["password"] = "***********"
  >>> response2 = br.submit()
  >>> html = response2.read()
  >>> html.find('href="/2010/user/ideamonk">Abhishek |')
  10186
Mechanize
filling forms
  >>> import mechanize
  >>> br = mechanize.Browser()
  >>> response = br.open("http://in.pycon.org/2010/account/
  login")
  >>> html = response.read()
  >>> html.find('href="/2010/user/ideamonk">Abhishek |')
  -1

  >>> br.select_form(name="login")
  >>> br.form["username"] = "ideamonk"
  >>> br.form["password"] = "***********"
  >>> response2 = br.submit()
  >>> html = response2.read()
  >>> html.find('href="/2010/user/ideamonk">Abhishek |')
  10186
Mechanize
logging into gmail
Mechanize
logging into gmail




                     the form has no name!
                       cant select_form it
Mechanize
logging into gmail

  >>> br.open("http://mail.google.com")
  >>> login_form = br.forms().next()
  >>> login_form["Email"] = "ideamonk@gmail.com"
  >>> login_form["Passwd"] = "*************"
  >>> response = br.open( login_form.click() )
  >>> br.geturl()
  'https://mail.google.com/mail/?shva=1'
  >>> response2 = br.open("http://mail.google.com/mail/h/")
  >>> br.title()
  'Gmail - Inbox'
  >>> html = response2.read()
Mechanize
reading my inbox
  >>> soup = BeautifulSoup(html)
  >>> trs = soup.findAll('tr', {'bgcolor':'#ffffff'})
  >>> for td in trs:
  ...     print td.b
  ...
  <b>madhukargunjan</b>
  <b>NIHI</b>
  <b>Chamindra</b>
  <b>KR2 blr</b>
  <b>Imran</b>
  <b>Ratnadeep Debnath</b>
  <b>yasir</b>
  <b>Glenn</b>
  <b>Joyent, Carly Guarcello</b>
  <b>Zack</b>
  <b>SahanaEden</b>
  ...
  >>>
Mechanize




Can scrape my gmail chats !?
                          http://www.flickr.com/photos/ktylerconk/1465018115/
Scrapy




a Framework for web scraping
                           http://www.flickr.com/photos/lwr/3191194761
Scrapy




        uses XPath
navigate through elements
                            http://www.flickr.com/photos/apc33/15012679/
Scrapy
XPath and Scrapy
  <h1> My important chunk of text </h1>
  //h1/text()

  <div id=”description”> Foo bar beep &gt; 42 </div>
  //div[@id=‘description’]

  <div id=‘report’> <p>Text1</p> <p>Text2</p> </div>
  //div[@id=‘report’]/p[2]/text()

  // select from the document, wherever they are
  / select from root node
  @ select attributes

                scrapy.selector.XmlXPathSelector
               scrapy.selector. HtmlXPathSelector
          http://www.w3schools.com/xpath/default.asp
Scrapy




       Interactive Scraping Shell!
$ scrapy-ctl.py shell http://my/target/website

                                     http://www.flickr.com/photos/span112/2650364158/
Scrapy




           Step1
define a model to store Items

                           http://www.flickr.com/photos/bunchofpants/106465356
Scrapy




             Step II
create your Spider to extract Items

                            http://www.flickr.com/photos/thewhitestdogalive/4187373600
Scrapy




          Step III
write a Pipeline to store them

                                 http://www.flickr.com/photos/zen/413575764/
Scrapy



Lets write a Scrapy spider
SpojBackup

• Spoj? Programming contest website
• 4021039 code submissions
• 81386 users
• web scraping == backup tool
• uses Mechanize
• http://github.com/ideamonk/spojbackup
SpojBackup

•   https://www.spoj.pl/status/ideamonk/signedlist/




•   https://www.spoj.pl/files/src/save/3198919

•   Code walktrough
Conclusions




              http://www.flickr.com/photos/bunchofpants/106465356
Conclusions
Python - large set of tools for scraping
Choose them wisely
Do not steal




                                           http://www.flickr.com/photos/bunchofpants/106465356
Conclusions
Python - large set of tools for scraping
Choose them wisely
Do not steal




                                           http://www.flickr.com/photos/bunchofpants/106465356
Conclusions
Python - large set of tools for scraping
Choose them wisely
Do not steal
Use the cloud - http://docs.picloud.com/advanced_examples.html
Share your scrapers - http://scraperwiki.com/
Thanks to @amatix , flickr, you, etc

        @PyCon India 2010                                     Abhishek Mishra
        http://in.pycon.org                              hello@ideamonk.com

                                                                 http://www.flickr.com/photos/bunchofpants/106465356

Weitere ähnliche Inhalte

Was ist angesagt?

.htaccess for SEOs - A presentation by Roxana Stingu
.htaccess for SEOs - A presentation by Roxana Stingu.htaccess for SEOs - A presentation by Roxana Stingu
.htaccess for SEOs - A presentation by Roxana StinguRoxana Stingu
 
The internet for SEOs by Roxana Stingu
The internet for SEOs by Roxana StinguThe internet for SEOs by Roxana Stingu
The internet for SEOs by Roxana StinguRoxana Stingu
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talkdtdannen
 
Scraping Scripting Hacking
Scraping Scripting HackingScraping Scripting Hacking
Scraping Scripting HackingMike Ellis
 
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your LogsSearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your LogsDistilled
 
Web data from R
Web data from RWeb data from R
Web data from Rschamber
 
API Design & Security in django
API Design & Security in djangoAPI Design & Security in django
API Design & Security in djangoTareque Hossain
 
The Lonesome LOD Cloud
The Lonesome LOD CloudThe Lonesome LOD Cloud
The Lonesome LOD CloudRuben Verborgh
 
Proactive Measures for Good Site Health - Brighton SEO 2014
Proactive Measures for Good Site Health - Brighton SEO 2014Proactive Measures for Good Site Health - Brighton SEO 2014
Proactive Measures for Good Site Health - Brighton SEO 2014Thomas Whittam
 
In the Trenches with Accessible EPUB - Charles LaPierre - ebookcraft 2017
In the Trenches with Accessible EPUB - Charles LaPierre - ebookcraft 2017In the Trenches with Accessible EPUB - Charles LaPierre - ebookcraft 2017
In the Trenches with Accessible EPUB - Charles LaPierre - ebookcraft 2017BookNet Canada
 
Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?Petra Kis-Herczegh
 
Beautiful REST+JSON APIs with Ion
Beautiful REST+JSON APIs with IonBeautiful REST+JSON APIs with Ion
Beautiful REST+JSON APIs with IonStormpath
 
Log analysis and pro use cases for search marketers online version (1)
Log analysis and pro use cases for search marketers online version (1)Log analysis and pro use cases for search marketers online version (1)
Log analysis and pro use cases for search marketers online version (1)David Sottimano
 
Building Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET CoreBuilding Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET CoreStormpath
 
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick StoxSMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stoxpatrickstox
 
Dangerous Google searching for secrets
Dangerous Google searching for secretsDangerous Google searching for secrets
Dangerous Google searching for secretsPim Piepers
 
Creating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraCreating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraMarkus Lanthaler
 

Was ist angesagt? (20)

.htaccess for SEOs - A presentation by Roxana Stingu
.htaccess for SEOs - A presentation by Roxana Stingu.htaccess for SEOs - A presentation by Roxana Stingu
.htaccess for SEOs - A presentation by Roxana Stingu
 
BrightonSEO
BrightonSEOBrightonSEO
BrightonSEO
 
Nextzy Technologies Co.,ltd. Jsoup
Nextzy Technologies Co.,ltd. JsoupNextzy Technologies Co.,ltd. Jsoup
Nextzy Technologies Co.,ltd. Jsoup
 
The internet for SEOs by Roxana Stingu
The internet for SEOs by Roxana StinguThe internet for SEOs by Roxana Stingu
The internet for SEOs by Roxana Stingu
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
Scraping Scripting Hacking
Scraping Scripting HackingScraping Scripting Hacking
Scraping Scripting Hacking
 
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your LogsSearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
 
Web data from R
Web data from RWeb data from R
Web data from R
 
API Design & Security in django
API Design & Security in djangoAPI Design & Security in django
API Design & Security in django
 
The Lonesome LOD Cloud
The Lonesome LOD CloudThe Lonesome LOD Cloud
The Lonesome LOD Cloud
 
Proactive Measures for Good Site Health - Brighton SEO 2014
Proactive Measures for Good Site Health - Brighton SEO 2014Proactive Measures for Good Site Health - Brighton SEO 2014
Proactive Measures for Good Site Health - Brighton SEO 2014
 
In the Trenches with Accessible EPUB - Charles LaPierre - ebookcraft 2017
In the Trenches with Accessible EPUB - Charles LaPierre - ebookcraft 2017In the Trenches with Accessible EPUB - Charles LaPierre - ebookcraft 2017
In the Trenches with Accessible EPUB - Charles LaPierre - ebookcraft 2017
 
Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?
 
Linked Data Fragments
Linked Data FragmentsLinked Data Fragments
Linked Data Fragments
 
Beautiful REST+JSON APIs with Ion
Beautiful REST+JSON APIs with IonBeautiful REST+JSON APIs with Ion
Beautiful REST+JSON APIs with Ion
 
Log analysis and pro use cases for search marketers online version (1)
Log analysis and pro use cases for search marketers online version (1)Log analysis and pro use cases for search marketers online version (1)
Log analysis and pro use cases for search marketers online version (1)
 
Building Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET CoreBuilding Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET Core
 
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick StoxSMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
 
Dangerous Google searching for secrets
Dangerous Google searching for secretsDangerous Google searching for secrets
Dangerous Google searching for secrets
 
Creating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraCreating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with Hydra
 

Andere mochten auch

Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Building strings in 1 minute
Building strings in 1 minuteBuilding strings in 1 minute
Building strings in 1 minuteAkshay Gupta
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientistsErin Shellman
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BSJohn D
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupJim Chang
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
Identification Simplified - An Introduction to Biometrics
Identification Simplified - An Introduction to BiometricsIdentification Simplified - An Introduction to Biometrics
Identification Simplified - An Introduction to BiometricsAbhishek Mishra
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 

Andere mochten auch (14)

Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Building strings in 1 minute
Building strings in 1 minuteBuilding strings in 1 minute
Building strings in 1 minute
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Identification Simplified - An Introduction to Biometrics
Identification Simplified - An Introduction to BiometricsIdentification Simplified - An Introduction to Biometrics
Identification Simplified - An Introduction to Biometrics
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 

Ähnlich wie Scraping with Python for Fun and Profit - PyCon India 2010

Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...John McCaffrey
 
Windy cityrails performance_tuning
Windy cityrails performance_tuningWindy cityrails performance_tuning
Windy cityrails performance_tuningJohn McCaffrey
 
Progressive downloads and rendering (Stoyan Stefanov)
Progressive downloads and rendering (Stoyan Stefanov)Progressive downloads and rendering (Stoyan Stefanov)
Progressive downloads and rendering (Stoyan Stefanov)Ontico
 
Tell your story better with great data visualization
Tell your story better with great data visualizationTell your story better with great data visualization
Tell your story better with great data visualizationDavid Mathias
 
Progressive Downloads and Rendering
Progressive Downloads and RenderingProgressive Downloads and Rendering
Progressive Downloads and RenderingStoyan Stefanov
 
Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"Fwdays
 
Our application got popular and now it breaks
Our application got popular and now it breaksOur application got popular and now it breaks
Our application got popular and now it breaksColdFusionConference
 
Our application got popular and now it breaks
Our application got popular and now it breaksOur application got popular and now it breaks
Our application got popular and now it breaksdevObjective
 
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScaleGDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScalePatrick Chanezon
 
Web Performance Optimisation
Web Performance OptimisationWeb Performance Optimisation
Web Performance OptimisationChris Burgess
 
Http/2 - What's it all about?
Http/2  - What's it all about?Http/2  - What's it all about?
Http/2 - What's it all about?Andy Davies
 
Performance and UX
Performance and UXPerformance and UX
Performance and UXPeter Rozek
 
Microdata semantic-extend
Microdata semantic-extendMicrodata semantic-extend
Microdata semantic-extendSeek Tan
 
Extended edition: How to speed up .NET and SQL Server web apps (2 x 45 mins w...
Extended edition: How to speed up .NET and SQL Server web apps (2 x 45 mins w...Extended edition: How to speed up .NET and SQL Server web apps (2 x 45 mins w...
Extended edition: How to speed up .NET and SQL Server web apps (2 x 45 mins w...Bart Read
 
Building a Single Page Application using Ember.js ... for fun and profit
Building a Single Page Application using Ember.js ... for fun and profitBuilding a Single Page Application using Ember.js ... for fun and profit
Building a Single Page Application using Ember.js ... for fun and profitBen Limmer
 
PrettyFaces: SEO, Dynamic, Parameters, Bookmarks, Navigation for JSF / JSF2 (...
PrettyFaces: SEO, Dynamic, Parameters, Bookmarks, Navigation for JSF / JSF2 (...PrettyFaces: SEO, Dynamic, Parameters, Bookmarks, Navigation for JSF / JSF2 (...
PrettyFaces: SEO, Dynamic, Parameters, Bookmarks, Navigation for JSF / JSF2 (...Lincoln III
 
TOSSUG HTML5 讀書會 新標籤與表單
TOSSUG HTML5 讀書會 新標籤與表單TOSSUG HTML5 讀書會 新標籤與表單
TOSSUG HTML5 讀書會 新標籤與表單偉格 高
 

Ähnlich wie Scraping with Python for Fun and Profit - PyCon India 2010 (20)

Hardcode SEO
Hardcode SEOHardcode SEO
Hardcode SEO
 
SearchMonkey
SearchMonkeySearchMonkey
SearchMonkey
 
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
 
Windy cityrails performance_tuning
Windy cityrails performance_tuningWindy cityrails performance_tuning
Windy cityrails performance_tuning
 
Progressive downloads and rendering (Stoyan Stefanov)
Progressive downloads and rendering (Stoyan Stefanov)Progressive downloads and rendering (Stoyan Stefanov)
Progressive downloads and rendering (Stoyan Stefanov)
 
Tell your story better with great data visualization
Tell your story better with great data visualizationTell your story better with great data visualization
Tell your story better with great data visualization
 
Progressive Downloads and Rendering
Progressive Downloads and RenderingProgressive Downloads and Rendering
Progressive Downloads and Rendering
 
Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"
 
Our application got popular and now it breaks
Our application got popular and now it breaksOur application got popular and now it breaks
Our application got popular and now it breaks
 
Our application got popular and now it breaks
Our application got popular and now it breaksOur application got popular and now it breaks
Our application got popular and now it breaks
 
Speed Matters!
Speed Matters!Speed Matters!
Speed Matters!
 
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScaleGDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
 
Web Performance Optimisation
Web Performance OptimisationWeb Performance Optimisation
Web Performance Optimisation
 
Http/2 - What's it all about?
Http/2  - What's it all about?Http/2  - What's it all about?
Http/2 - What's it all about?
 
Performance and UX
Performance and UXPerformance and UX
Performance and UX
 
Microdata semantic-extend
Microdata semantic-extendMicrodata semantic-extend
Microdata semantic-extend
 
Extended edition: How to speed up .NET and SQL Server web apps (2 x 45 mins w...
Extended edition: How to speed up .NET and SQL Server web apps (2 x 45 mins w...Extended edition: How to speed up .NET and SQL Server web apps (2 x 45 mins w...
Extended edition: How to speed up .NET and SQL Server web apps (2 x 45 mins w...
 
Building a Single Page Application using Ember.js ... for fun and profit
Building a Single Page Application using Ember.js ... for fun and profitBuilding a Single Page Application using Ember.js ... for fun and profit
Building a Single Page Application using Ember.js ... for fun and profit
 
PrettyFaces: SEO, Dynamic, Parameters, Bookmarks, Navigation for JSF / JSF2 (...
PrettyFaces: SEO, Dynamic, Parameters, Bookmarks, Navigation for JSF / JSF2 (...PrettyFaces: SEO, Dynamic, Parameters, Bookmarks, Navigation for JSF / JSF2 (...
PrettyFaces: SEO, Dynamic, Parameters, Bookmarks, Navigation for JSF / JSF2 (...
 
TOSSUG HTML5 讀書會 新標籤與表單
TOSSUG HTML5 讀書會 新標籤與表單TOSSUG HTML5 讀書會 新標籤與表單
TOSSUG HTML5 讀書會 新標籤與表單
 

Mehr von Abhishek Mishra

Introduction to Game programming with PyGame Part 1
Introduction to Game programming with PyGame Part 1Introduction to Game programming with PyGame Part 1
Introduction to Game programming with PyGame Part 1Abhishek Mishra
 
Amritadhara - Issues, Challenges, and a Solution
Amritadhara - Issues, Challenges, and a SolutionAmritadhara - Issues, Challenges, and a Solution
Amritadhara - Issues, Challenges, and a SolutionAbhishek Mishra
 
Project SpaceLock - Architecture & Design
Project SpaceLock - Architecture & DesignProject SpaceLock - Architecture & Design
Project SpaceLock - Architecture & DesignAbhishek Mishra
 
The Beginning - Jan 20 2009
The Beginning - Jan 20 2009The Beginning - Jan 20 2009
The Beginning - Jan 20 2009Abhishek Mishra
 
SpaceLock Meetup - Plan 25 Jan 09
SpaceLock Meetup - Plan 25 Jan 09SpaceLock Meetup - Plan 25 Jan 09
SpaceLock Meetup - Plan 25 Jan 09Abhishek Mishra
 

Mehr von Abhishek Mishra (10)

Paddles at pelham
Paddles at pelhamPaddles at pelham
Paddles at pelham
 
All in a day
All in a dayAll in a day
All in a day
 
Introduction to Game programming with PyGame Part 1
Introduction to Game programming with PyGame Part 1Introduction to Game programming with PyGame Part 1
Introduction to Game programming with PyGame Part 1
 
Introducing BugBase 1.0
Introducing BugBase 1.0Introducing BugBase 1.0
Introducing BugBase 1.0
 
Amritadhara - Issues, Challenges, and a Solution
Amritadhara - Issues, Challenges, and a SolutionAmritadhara - Issues, Challenges, and a Solution
Amritadhara - Issues, Challenges, and a Solution
 
Gibson Guitar Robot
Gibson Guitar RobotGibson Guitar Robot
Gibson Guitar Robot
 
Space Lock Web UI
Space Lock Web UISpace Lock Web UI
Space Lock Web UI
 
Project SpaceLock - Architecture & Design
Project SpaceLock - Architecture & DesignProject SpaceLock - Architecture & Design
Project SpaceLock - Architecture & Design
 
The Beginning - Jan 20 2009
The Beginning - Jan 20 2009The Beginning - Jan 20 2009
The Beginning - Jan 20 2009
 
SpaceLock Meetup - Plan 25 Jan 09
SpaceLock Meetup - Plan 25 Jan 09SpaceLock Meetup - Plan 25 Jan 09
SpaceLock Meetup - Plan 25 Jan 09
 

Kürzlich hochgeladen

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Kürzlich hochgeladen (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Scraping with Python for Fun and Profit - PyCon India 2010

  • 1. Scraping with Python for Fun & Profit @PyCon India 2010 Abhishek Mishra http://in.pycon.org hello@ideamonk.com
  • 2. Now... I want you to put your _data_ on the web! photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
  • 3. There is still a HUGE UNLOCKED POTENTIAL!! photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
  • 4. There is still a HUGE UNLOCKED POTENTIAL!! search analyze visualize co-relate photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
  • 5. Haven’t got DATA on the web as DATA! photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
  • 6. Haven’t got DATA on the web as DATA! photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
  • 7. Haven’t got DATA on the web as DATA! photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
  • 8. FRUSTRATION !!! 1. will take a lot of time 2. the data isn’t consumable 3. you want it right now! 4. missing source photo credit - TED - the next web http://www.ted.com/talks/lang/eng/tim_berners_lee_on_the_next_web.html
  • 9. Solution ? photo credit - Angie Pangie
  • 10. Solution ? just scrape it out! photo credit - Angie Pangie
  • 11. Solution ? just scrape it out! 1. analyze request 2. find patterns 3. extract data photo credit - Angie Pangie
  • 12. Solution ? just scrape it out! 1. analyze request 2. find patterns 3. extract data Use cases - good bad ugly photo credit - Angie Pangie
  • 13. Tools of Trade • urllib2your data, getting back results passing • BeautifulSoup entangled web parsing it out of the • Mechanizeweb browsing programmatic • Scrapy framework a web scraping
  • 15. Our objective <body> <div id="results"> <img src="shield.png" style="height:140px;"/> <h1>Lohit Narayan </h1> Name Reg CGPA <div> ==== === ==== Registration Number : 1234<br /> Lohit Narayan, 1234, 9.3 <table> <tr> ... ... ... <td>Object Oriented Programming</td> <td> B+</td> </tr> <tr> <td>Probability and Statistics</td> <td> B+</td> </tr> <tr> <td><b>CGPA</b></td> <td><b>9.3</b></td> </tr> </table>
  • 16. urllib2 • Where to send? • What to send? • Answered by - firebug, chrome developer tools, a local debugging proxy
  • 17. urllib2 import urllib import urllib2 url = “http://localhost/results/results.php” data = { ‘regno’ : ‘1000’ } data_encoded = urllib.urlencode(data) request = urllib2.urlopen( url, data_encoded ) html = request.read()
  • 18. urllib2 import urllib import urllib2 url = “http://localhost/results/results.php” data = { ‘regno’ : ‘1000’ } data_encoded = urllib.urlencode(data) request = urllib2.urlopen( url, data_encoded ) html = request.read() '<html>n<head>nt<title>Foo University results</title>nt<style type="text/css">ntt#results { ntttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #dd ddff;nttttext-align:center;ntt}nt</style>n</head>n<body>nt<div id="results">nt<img sr c="shield.png" style="height:140px;"/>nt<h1>Rahim Kumar</h1>nt<div>nttRegistration Number : 1000tt<br/>ntt<table>nttttttt<tr>nttttt<td>Humanities Elective II</td>ntttt t<td> B-</td>ntttt</tr>nttttttt<tr>nttttt<td>Humanities Elective III</td>nttt tt<td> A</td>ntttt</tr>nttttttt<tr>nttttt<td>Discrete Mathematics</td>nttt tt<td> C-</td>ntttt</tr>nttttttt<tr>nttttt<td>Environmental Science and Engineer ing</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Data Structures</t d>nttttt<td> D</td>ntttt</tr>nttttttt<tr>nttttt<td>Science Elective I</td> nttttt<td> D-</td>ntttt</tr>nttttttt<tr>nttttt<td>Object Oriented Programmin g</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Probability and Stat istics</td>nttttt<td> A-</td>ntttt</tr>ntttttt<tr>ntttt<td><b>CGPA</b></td>n tttt<td><b>9.1</b></td>nttt</tr>ntt</table>nt</div>nt</div>n</body>n</htmln'
  • 19. urllib2 import urllib # for urlencode import urllib2 # main urllib2 module url = “http://localhost/results/results.php” data = { ‘regno’ : ‘1000’ } # ^ data as an object data_encoded = urllib.urlencode(data) # a string, key=val pairs separated by ‘&’ request = urllib2.urlopen( url, data_encoded ) # ^ returns a network object html = request.read() # ^ reads out its contents to a string
  • 20. urllib2 import urllib import urllib2 url = “http://localhost/results/results.php” data = {} scraped_results = [] for i in xrange(1000, 3000): data[‘regno’] = str(i) data_encoded = urllib.urlencode(data) request = urllib2.urlopen(url, data_encoded) html = request.read() scraped_results.append(html)
  • 21. '<html>n<head>nt<title>Foo University results</title>nt<style type="text/css">ntt#results { ntttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #dd ddff;nttttext-align:center;ntt}nt</style>n</head>n<body>nt<div id="results">nt<img sr c="shield.png" style="height:140px;"/>nt<h1>Rahim Kumar</h1>nt<div>nttRegistration Number : 1000tt<br/>ntt<table>nttttttt<tr>nttttt<td>Humanities Elective II</td>ntttt t<td> B-</td>ntttt</tr>nttttttt<tr>nttttt<td>Humanities Elective III</td>nttt tt<td> A</td>ntttt</tr>nttttttt<tr>nttttt<td>Discrete Mathematics</td>nttt tt<td> C-</td>ntttt</tr>nttttttt<tr>nttttt<td>Environmental Science and Engineer ing</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Data Structures</t d>nttttt<td> D</td>ntttt</tr>nttttttt<tr>nttttt<td>Science Elective I</td> nttttt<td> D-</td>ntttt</tr>nttttttt<tr>nttttt<td>Object Oriented Programmin g</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Probability and Stat istics</td>nttttt<td> A-</td>ntttt</tr>ntttttt<tr>ntttt<td><b>CGPA</b></td>n tttt<td><b>9.1</b></td>nttt</tr>ntt</table>nt</div>nt</div>n</body>n</htmln' Thats alright, But this is still meaningless.
  • 22. '<html>n<head>nt<title>Foo University results</title>nt<style type="text/css">ntt#results { ntttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #dd ddff;nttttext-align:center;ntt}nt</style>n</head>n<body>nt<div id="results">nt<img sr c="shield.png" style="height:140px;"/>nt<h1>Rahim Kumar</h1>nt<div>nttRegistration Number : 1000tt<br/>ntt<table>nttttttt<tr>nttttt<td>Humanities Elective II</td>ntttt t<td> B-</td>ntttt</tr>nttttttt<tr>nttttt<td>Humanities Elective III</td>nttt tt<td> A</td>ntttt</tr>nttttttt<tr>nttttt<td>Discrete Mathematics</td>nttt tt<td> C-</td>ntttt</tr>nttttttt<tr>nttttt<td>Environmental Science and Engineer ing</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Data Structures</t d>nttttt<td> D</td>ntttt</tr>nttttttt<tr>nttttt<td>Science Elective I</td> nttttt<td> D-</td>ntttt</tr>nttttttt<tr>nttttt<td>Object Oriented Programmin g</td>nttttt<td> B</td>ntttt</tr>nttttttt<tr>nttttt<td>Probability and Stat istics</td>nttttt<td> A-</td>ntttt</tr>ntttttt<tr>ntttt<td><b>CGPA</b></td>n tttt<td><b>9.1</b></td>nttt</tr>ntt</table>nt</div>nt</div>n</body>n</htmln' Thats alright, But this is still meaningless. Enter BeautifulSoup
  • 23. BeautifulSoup http://www.flickr.com/photos/29233640@N07/3292261940/
  • 24. BeautifulSoup • Python HTML/XML Parser • Wont choke you on bad markup • auto-detects encoding • easy tree traversal • find all links whose url contains ‘foo.com’ http://www.flickr.com/photos/29233640@N07/3292261940/
  • 25. BeautifulSoup >>> import re >>> from BeautifulSoup import BeautifulSoup >>> html = """<a href="http://foo.com/foobar">link1</a> <a href="http://foo.com/beep">link2</a> <a href="http:// foo.co.in/beep">link3</a>""" >>> soup = BeautifulSoup(html) >>> links = soup.findAll('a', {'href': re.compile (".*foo.com.*")}) >>> links [<a href="http://foo.com/foobar">link1</a>, <a href="http:// foo.com/beep">link2</a>] >>> links[0][‘href’] u'http://foo.com/foobar' >>> links[0].contents [u'link1']
  • 26. BeautifulSoup >>> html = “““<html>n <head>n <title>n Foo University resultsn </title>n <style type="text/css">n #results {n tttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #ddddff;nttttext- align:center;ntt}n </style>n </head>n <body>n <div id="results">n <img src="shield.png" style="height:140px;" / >n <h1>n Bipin Narayann </h1>n <div>n Registration Number : 1000n <br />n <table>n <tr>n <td>n Humanities Elective IIn </td>n <td>n C-n </td>n </tr>n <tr>n <td>n Humanities Elective IIIn </td>n <td>n B-n </td>n </tr>n <tr>n <td>n Discrete Mathematicsn </td>n <td>n Cn </td>n </tr>n <tr>n <td>n Environmental Science and Engineeringn </td>n <td>n Cn </td>n </tr>n <tr>n <td>n Data Structuresn </td>n <td>n An </td>n </tr>n <tr>n <td>n Science Elective In </td>n <td>n B-n </td>n </tr>n <tr>n <td>n Object Oriented Programmingn </td>n <td>n B+n </td>n </tr>n <tr>n <td>n Probability and Statisticsn </ td>n <td>n An </td>n </tr>n <tr>n <td>n <b>n CGPAn </b>n </ td>n <td>n <b>n 8n </b>n </td>n </tr>n </table>n </div>n </div>n </body> n</html>”””
  • 27. BeautifulSoup >>> html = “““<html>n <head>n <title>n Foo University resultsn </title>n <style type="text/css">n #results {n tttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #ddddff;nttttext- align:center;ntt}n </style>n </head>n <body>n <div id="results">n <img src="shield.png" style="height:140px;" / >n <h1>n Bipin Narayann </h1>n <div>n Registration Number : 1000n <br />n <table>n <tr>n <td>n Humanities Elective IIn </td>n <td>n C-n </td>n </tr>n <tr>n <td>n Humanities Elective IIIn </td>n <td>n B-n </td>n </tr>n <tr>n <td>n Discrete Mathematicsn </td>n <td>n Cn </td>n </tr>n <tr>n <td>n Environmental Science and Engineeringn </td>n <td>n Cn </td>n </tr>n <tr>n <td>n Data Structuresn </td>n <td>n An </td>n </tr>n <tr>n <td>n Science Elective In </td>n <td>n B-n </td>n </tr>n <tr>n <td>n Object Oriented Programmingn </td>n <td>n B+n </td>n </tr>n <tr>n <td>n Probability and Statisticsn </ td>n <td>n An </td>n </tr>n <tr>n <td>n <b>n CGPAn </b>n </ td>n <td>n <b>n 8n </b>n </td>n </tr>n </table>n </div>n </div>n </body> n</html>””” >>> soup.h1 <h1>Bipin Narang</h1> >>> soup.h1.contents [u’Bipin Narang’] >>> soup.h1.contents[0] u’Bipin Narang’
  • 28. BeautifulSoup >>> html = “““<html>n <head>n <title>n Foo University resultsn </title>n <style type="text/css">n #results {n tttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #ddddff;nttttext- align:center;ntt}n </style>n </head>n <body>n <div id="results">n <img src="shield.png" style="height:140px;" / >n <h1>n Bipin Narayann </h1>n <div>n Registration Number : 1000n <br />n <table>n <tr>n <td>n Humanities Elective IIn </td>n <td>n C-n </td>n </tr>n <tr>n <td>n Humanities Elective IIIn </td>n <td>n B-n </td>n </tr>n <tr>n <td>n Discrete Mathematicsn </td>n <td>n Cn </td>n </tr>n <tr>n <td>n Environmental Science and Engineeringn </td>n <td>n Cn </td>n </tr>n <tr>n <td>n Data Structuresn </td>n <td>n An </td>n </tr>n <tr>n <td>n Science Elective In </td>n <td>n B-n </td>n </tr>n <tr>n <td>n Object Oriented Programmingn </td>n <td>n B+n </td>n </tr>n <tr>n <td>n Probability and Statisticsn </ td>n <td>n An </td>n </tr>n <tr>n <td>n <b>n CGPAn </b>n </ td>n <td>n <b>n 8n </b>n </td>n </tr>n </table>n </div>n </div>n </body> n</html>””” >>> soup.div.div.contents[0] u'nttRegistration Number : 1000tt' >>> soup.div.div.contents[0].split(':') [u'nttRegistration Number ', u' 1000tt'] >>> soup.div.div.contents[0].split(':')[1].strip() u'1000'
  • 29. BeautifulSoup >>> html = “““<html>n <head>n <title>n Foo University resultsn </title>n <style type="text/css">n #results {n tttmargin:0px 60px;ntttbackground: #efefff;ntttpadding:10px;ntttborder:2px solid #ddddff;nttttext- align:center;ntt}n </style>n </head>n <body>n <div id="results">n <img src="shield.png" style="height:140px;" / >n <h1>n Bipin Narayann </h1>n <div>n Registration Number : 1000n <br />n <table>n <tr>n <td>n Humanities Elective IIn </td>n <td>n C-n </td>n </tr>n <tr>n <td>n Humanities Elective IIIn </td>n <td>n B-n </td>n </tr>n <tr>n <td>n Discrete Mathematicsn </td>n <td>n Cn </td>n </tr>n <tr>n <td>n Environmental Science and Engineeringn </td>n <td>n Cn </td>n </tr>n <tr>n <td>n Data Structuresn </td>n <td>n An </td>n </tr>n <tr>n <td>n Science Elective In </td>n <td>n B-n </td>n </tr>n <tr>n <td>n Object Oriented Programmingn </td>n <td>n B+n </td>n </tr>n <tr>n <td>n Probability and Statisticsn </ td>n <td>n An </td>n </tr>n <tr>n <td>n <b>n CGPAn </b>n </ td>n <td>n <b>n 8n </b>n </td>n </tr>n </table>n </div>n </div>n </body> n</html>””” >>> soup.find('b') <b>CGPA</b> >>> soup.find('b').findNext('b') <b>8</b> >>> soup.find('b').findNext('b').contents[0] u'8'
  • 30. urllib2 + BeautifulSoup import urllib import urllib2 from BeautifulSoup import BeautifulSoup url = "http://localhost/results/results.php" data={} for i in xrange(1000,1100): data['regno'] = str(i) html = urllib2.urlopen( url, urllib.urlencode(data) ).read() soup = BeautifulSoup(html) name = soup.h1.contents[0] regno = soup.div.div.contents[0].split(':')[1].strip() # ^ though we alread know current regno ;) == str(i) cgpa = soup.find('b').findNext('b').contents[0] print "%s,%s,%s" % (regno, name, 'cgpa')
  • 31. Mechanize http://www.flickr.com/photos/flod/3092381868/
  • 32. Mechanize Stateful web browsing in Python after Andy Lester’s WWW:Mechanize fill forms follow links handle cookies browser history http://www.flickr.com/photos/flod/3092381868/
  • 33. Mechanize >>> import mechanize >>> br = mechanize.Browser() >>> response = br.open(“http://google.com”) >>> br.title() ‘Google’ >>> br.geturl() 'http://www.google.co.in/' >>> print response.info() Date: Fri, 10 Sep 2010 04:06:57 GMT Expires: -1 Cache-Control: private, max-age=0 Content-Type: text/html; charset=ISO-8859-1 Server: gws X-XSS-Protection: 1; mode=block Connection: close content-type: text/html; charset=ISO-8859-1 >>> br.open("http://yahoo.com") >>> br.title() 'Yahoo! India' >>> br.back() >>> br.title() 'Google'
  • 34. Mechanize filling forms >>> import mechanize >>> br = mechanize.Browser() >>> response = br.open("http://in.pycon.org/2010/account/ login") >>> html = response.read() >>> html.find('href="/2010/user/ideamonk">Abhishek |') -1 >>> br.select_form(name="login") >>> br.form["username"] = "ideamonk" >>> br.form["password"] = "***********" >>> response2 = br.submit() >>> html = response2.read() >>> html.find('href="/2010/user/ideamonk">Abhishek |') 10186
  • 35. Mechanize filling forms >>> import mechanize >>> br = mechanize.Browser() >>> response = br.open("http://in.pycon.org/2010/account/ login") >>> html = response.read() >>> html.find('href="/2010/user/ideamonk">Abhishek |') -1 >>> br.select_form(name="login") >>> br.form["username"] = "ideamonk" >>> br.form["password"] = "***********" >>> response2 = br.submit() >>> html = response2.read() >>> html.find('href="/2010/user/ideamonk">Abhishek |') 10186
  • 37. Mechanize logging into gmail the form has no name! cant select_form it
  • 38. Mechanize logging into gmail >>> br.open("http://mail.google.com") >>> login_form = br.forms().next() >>> login_form["Email"] = "ideamonk@gmail.com" >>> login_form["Passwd"] = "*************" >>> response = br.open( login_form.click() ) >>> br.geturl() 'https://mail.google.com/mail/?shva=1' >>> response2 = br.open("http://mail.google.com/mail/h/") >>> br.title() 'Gmail - Inbox' >>> html = response2.read()
  • 39. Mechanize reading my inbox >>> soup = BeautifulSoup(html) >>> trs = soup.findAll('tr', {'bgcolor':'#ffffff'}) >>> for td in trs: ... print td.b ... <b>madhukargunjan</b> <b>NIHI</b> <b>Chamindra</b> <b>KR2 blr</b> <b>Imran</b> <b>Ratnadeep Debnath</b> <b>yasir</b> <b>Glenn</b> <b>Joyent, Carly Guarcello</b> <b>Zack</b> <b>SahanaEden</b> ... >>>
  • 40. Mechanize Can scrape my gmail chats !? http://www.flickr.com/photos/ktylerconk/1465018115/
  • 41. Scrapy a Framework for web scraping http://www.flickr.com/photos/lwr/3191194761
  • 42. Scrapy uses XPath navigate through elements http://www.flickr.com/photos/apc33/15012679/
  • 43. Scrapy XPath and Scrapy <h1> My important chunk of text </h1> //h1/text() <div id=”description”> Foo bar beep &gt; 42 </div> //div[@id=‘description’] <div id=‘report’> <p>Text1</p> <p>Text2</p> </div> //div[@id=‘report’]/p[2]/text() // select from the document, wherever they are / select from root node @ select attributes scrapy.selector.XmlXPathSelector scrapy.selector. HtmlXPathSelector http://www.w3schools.com/xpath/default.asp
  • 44. Scrapy Interactive Scraping Shell! $ scrapy-ctl.py shell http://my/target/website http://www.flickr.com/photos/span112/2650364158/
  • 45. Scrapy Step1 define a model to store Items http://www.flickr.com/photos/bunchofpants/106465356
  • 46. Scrapy Step II create your Spider to extract Items http://www.flickr.com/photos/thewhitestdogalive/4187373600
  • 47. Scrapy Step III write a Pipeline to store them http://www.flickr.com/photos/zen/413575764/
  • 48. Scrapy Lets write a Scrapy spider
  • 49. SpojBackup • Spoj? Programming contest website • 4021039 code submissions • 81386 users • web scraping == backup tool • uses Mechanize • http://github.com/ideamonk/spojbackup
  • 50. SpojBackup • https://www.spoj.pl/status/ideamonk/signedlist/ • https://www.spoj.pl/files/src/save/3198919 • Code walktrough
  • 51. Conclusions http://www.flickr.com/photos/bunchofpants/106465356
  • 52. Conclusions Python - large set of tools for scraping Choose them wisely Do not steal http://www.flickr.com/photos/bunchofpants/106465356
  • 53. Conclusions Python - large set of tools for scraping Choose them wisely Do not steal http://www.flickr.com/photos/bunchofpants/106465356
  • 54. Conclusions Python - large set of tools for scraping Choose them wisely Do not steal Use the cloud - http://docs.picloud.com/advanced_examples.html Share your scrapers - http://scraperwiki.com/ Thanks to @amatix , flickr, you, etc @PyCon India 2010 Abhishek Mishra http://in.pycon.org hello@ideamonk.com http://www.flickr.com/photos/bunchofpants/106465356