SlideShare a Scribd company logo
1 of 23
/me wants it
         Scraping Sites to get Data




Rob Coup
robert@coup.net.nz
Who am I?

• Koordinates
• Open data
  open.org.nz

• Geek
• Pythonista
Datasets as
  websites
But I
want
to mix
it up!

         http://www.flickr.com/photos/bowbrick/2365377635
DATA
       http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
And when do I want it?




                 http://www.flickr.com/photos/davidmaddison/102584440
Just Scrape It
First Example


• Wanganui District Council Food Gradings
• http://j.mp/i4yNZ
Review
• POST to URLs for each Grade
• Parse HTML response for:
 • Business Name
 • Address
 • Grading
• Output as CSV
What to POST?
• Tools: Firebug, Charles
  http://www.wanganui.govt.nz/services/foodgrading/
  SearchResults.asp

  txtGrading=A
    [ B, C, D, E, “Exempt”, “Currently Not Graded” ]
  Submit=Go
POSTing in Python
import urllib
import urllib2

url = 'http://www.wanganui.govt.nz/services/foodgrading/
SearchResults.asp'
post_data = {
    'txtGrading': 'A',
    'Submit': 'Go',
}

post_encoded = urllib.urlencode(post_data)
html = urllib2.urlopen(url, post_encoded).read()

print html
Results
…
<TD class="bodytext">
  <h2>Search results...</h2>
  <B>39 South</B><br />
  159 Victoria Ave<br />
  Wanganui<br />
  Grading: <B>A</b>
  <hr />
  <B>Alma Junction Dairy</B><br />
  1 Alma Rd<br />
  Wanganui<br />
  Grading: <B>A</b>
  <hr />
  …
Getting Data Out
• Tools: BeautifulSoup

• Parses HTML-ish documents
• Easy navigation & searching of tree
Our Parser
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
container = soup.find('td', {'class':'bodytext'})

for hr_el in container.findAll('hr'):
    # <b>NAME</b><br/>ADDRESS_0<br/>ADDRESS_1<br/>Grading:<b>GRADE</b><hr/>
    text_parts = hr_el.findPreviousSiblings(text=True, limit=3)
    # ['Grading:', 'ADDRESS_1', 'ADDRESS_0']
    address = (text_parts[2], text_parts[1])
    el_parts = hr_el.findPreviousSiblings('b', limit=2)
    # [<b>GRADE</b>, <b>NAME</b>]
    grade = el_parts[0].string
    name = el_parts[1].string
    print name, address, grade
Putting it all together


• loop over the grading values
• write CSV output
Advanced Crawlers


• Form filling
• Authentication & cookies
Mechanize


•   http://wwwsearch.sourceforge.net/mechanize/

•   programmable browser in Python

•   fills forms, navigates links & pages, eats cookies
Data Parsing

• JSON: SimpleJSON (pre-Py2.6)
• XML: ElementTree
• HTML: BeautifulSoup
• Nasties: Abobe PDF, Microsoft Excel
      “PDF files are where data goes to die”
Reading nasties in
         Python

• Abobe PDF: PDFMiner, pdftable
• MS Excel: xlrd
Example Two


• Palmerston North City Food Gradings
• http://j.mp/31YuRH
Review
• Get HTML page
• Find current PDF link
• Download PDF
• Parse table
 • Name
 • Grading
Parsing PDF
import urllib2
from cStringIO import StringIO
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.layout import LAParams

pdf_file = StringIO(urllib2.urlopen(pdf_url).read())

text = StringIO()
rsrc = PDFResourceManager()
device = TextConverter(rsrc, text, laparams=LAParams())
process_pdf(rsrc, device, pdf_file)
device.close()

print text.getvalue()
Summary

• Python has some great tools for:
 • querying websites
 • parsing HTML & other formats

• Open data as data, not websites

More Related Content

Similar to /me wants it. Scraping sites to get data.

Getting Started with SharePoint Patterns and Practices Provisioning Engine-SP...
Getting Started with SharePoint Patterns and Practices Provisioning Engine-SP...Getting Started with SharePoint Patterns and Practices Provisioning Engine-SP...
Getting Started with SharePoint Patterns and Practices Provisioning Engine-SP...Prashant G Bhoyar (Microsoft MVP)
 
Choose Your Own Adventure: SEO For Web Developers | Unified Diff
Choose Your Own Adventure: SEO For Web Developers | Unified DiffChoose Your Own Adventure: SEO For Web Developers | Unified Diff
Choose Your Own Adventure: SEO For Web Developers | Unified DiffSteve Morgan
 
The Path through SharePoint Migrations
The Path through SharePoint MigrationsThe Path through SharePoint Migrations
The Path through SharePoint MigrationsBrian Caauwe
 
SharePoint NYC search presentation
SharePoint NYC search presentationSharePoint NYC search presentation
SharePoint NYC search presentationjtbarrera
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producingkurtgessler
 
Getting Started With SharePoint REST API in Nintex Workflows for Office 365 I...
Getting Started With SharePoint REST API in Nintex Workflows for Office 365 I...Getting Started With SharePoint REST API in Nintex Workflows for Office 365 I...
Getting Started With SharePoint REST API in Nintex Workflows for Office 365 I...Prashant G Bhoyar (Microsoft MVP)
 
How We Localize & Mobilize WP Sites - Pubcon 2013
How We Localize & Mobilize WP Sites - Pubcon 2013How We Localize & Mobilize WP Sites - Pubcon 2013
How We Localize & Mobilize WP Sites - Pubcon 2013Search Commander, Inc.
 
How Search Works
How Search WorksHow Search Works
How Search WorksAhrefs
 
SEO Seminar Presentation
SEO Seminar PresentationSEO Seminar Presentation
SEO Seminar PresentationRommel Paras
 
SharePoint: Getting started
SharePoint: Getting startedSharePoint: Getting started
SharePoint: Getting startedSoHo Dragon
 
Getting Started with Office 365 Developers Patterns and Practices Provisionin...
Getting Started with Office 365 Developers Patterns and Practices Provisionin...Getting Started with Office 365 Developers Patterns and Practices Provisionin...
Getting Started with Office 365 Developers Patterns and Practices Provisionin...Prashant G Bhoyar (Microsoft MVP)
 
Web Design Workflow and 
Tools that Make Life Easy
Web Design Workflow and 
Tools that Make Life EasyWeb Design Workflow and 
Tools that Make Life Easy
Web Design Workflow and 
Tools that Make Life EasySang-Min Yoon
 
The Path Through SharePoint Migrations
The Path Through SharePoint MigrationsThe Path Through SharePoint Migrations
The Path Through SharePoint MigrationsBrian Caauwe
 
SharePoint Fest DC 2016_Advanced Office365 SharePoint Online Workflows
SharePoint Fest DC 2016_Advanced Office365 SharePoint Online WorkflowsSharePoint Fest DC 2016_Advanced Office365 SharePoint Online Workflows
SharePoint Fest DC 2016_Advanced Office365 SharePoint Online WorkflowsPrashant G Bhoyar (Microsoft MVP)
 
SharePoint Search - SPSNYC 2014
SharePoint Search - SPSNYC 2014SharePoint Search - SPSNYC 2014
SharePoint Search - SPSNYC 2014Avtex
 
Smart Factory: Search Engine Optimization
Smart Factory: Search Engine OptimizationSmart Factory: Search Engine Optimization
Smart Factory: Search Engine OptimizationBust Out Solutions
 

Similar to /me wants it. Scraping sites to get data. (20)

Getting Started with SharePoint Patterns and Practices Provisioning Engine-SP...
Getting Started with SharePoint Patterns and Practices Provisioning Engine-SP...Getting Started with SharePoint Patterns and Practices Provisioning Engine-SP...
Getting Started with SharePoint Patterns and Practices Provisioning Engine-SP...
 
Choose Your Own Adventure: SEO For Web Developers | Unified Diff
Choose Your Own Adventure: SEO For Web Developers | Unified DiffChoose Your Own Adventure: SEO For Web Developers | Unified Diff
Choose Your Own Adventure: SEO For Web Developers | Unified Diff
 
The Path through SharePoint Migrations
The Path through SharePoint MigrationsThe Path through SharePoint Migrations
The Path through SharePoint Migrations
 
SharePoint NYC search presentation
SharePoint NYC search presentationSharePoint NYC search presentation
SharePoint NYC search presentation
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producing
 
Getting Started With SharePoint REST API in Nintex Workflows for Office 365 I...
Getting Started With SharePoint REST API in Nintex Workflows for Office 365 I...Getting Started With SharePoint REST API in Nintex Workflows for Office 365 I...
Getting Started With SharePoint REST API in Nintex Workflows for Office 365 I...
 
Microsoft Flow For Developers
Microsoft Flow For DevelopersMicrosoft Flow For Developers
Microsoft Flow For Developers
 
How We Localize & Mobilize WP Sites - Pubcon 2013
How We Localize & Mobilize WP Sites - Pubcon 2013How We Localize & Mobilize WP Sites - Pubcon 2013
How We Localize & Mobilize WP Sites - Pubcon 2013
 
How Search Works
How Search WorksHow Search Works
How Search Works
 
SEO Seminar Presentation
SEO Seminar PresentationSEO Seminar Presentation
SEO Seminar Presentation
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
SharePoint: Getting started
SharePoint: Getting startedSharePoint: Getting started
SharePoint: Getting started
 
Getting Started with Office 365 Developers Patterns and Practices Provisionin...
Getting Started with Office 365 Developers Patterns and Practices Provisionin...Getting Started with Office 365 Developers Patterns and Practices Provisionin...
Getting Started with Office 365 Developers Patterns and Practices Provisionin...
 
Web Design Workflow and 
Tools that Make Life Easy
Web Design Workflow and 
Tools that Make Life EasyWeb Design Workflow and 
Tools that Make Life Easy
Web Design Workflow and 
Tools that Make Life Easy
 
The Path Through SharePoint Migrations
The Path Through SharePoint MigrationsThe Path Through SharePoint Migrations
The Path Through SharePoint Migrations
 
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
 
SharePoint Fest DC 2016_Advanced Office365 SharePoint Online Workflows
SharePoint Fest DC 2016_Advanced Office365 SharePoint Online WorkflowsSharePoint Fest DC 2016_Advanced Office365 SharePoint Online Workflows
SharePoint Fest DC 2016_Advanced Office365 SharePoint Online Workflows
 
SharePoint Search - SPSNYC 2014
SharePoint Search - SPSNYC 2014SharePoint Search - SPSNYC 2014
SharePoint Search - SPSNYC 2014
 
ApacheCon 2005
ApacheCon 2005ApacheCon 2005
ApacheCon 2005
 
Smart Factory: Search Engine Optimization
Smart Factory: Search Engine OptimizationSmart Factory: Search Engine Optimization
Smart Factory: Search Engine Optimization
 

More from Robert Coup

Curtailing Crustaceans with Geeky Enthusiasm
Curtailing Crustaceans with Geeky EnthusiasmCurtailing Crustaceans with Geeky Enthusiasm
Curtailing Crustaceans with Geeky EnthusiasmRobert Coup
 
Map Analytics - Ignite Spatial
Map Analytics - Ignite SpatialMap Analytics - Ignite Spatial
Map Analytics - Ignite SpatialRobert Coup
 
Twisted: a quick introduction
Twisted: a quick introductionTwisted: a quick introduction
Twisted: a quick introductionRobert Coup
 
Distributed-ness: Distributed computing & the clouds
Distributed-ness: Distributed computing & the cloudsDistributed-ness: Distributed computing & the clouds
Distributed-ness: Distributed computing & the cloudsRobert Coup
 
Geo-Processing in the Clouds
Geo-Processing in the CloudsGeo-Processing in the Clouds
Geo-Processing in the CloudsRobert Coup
 
Maps are Fun - Why not on the web?
Maps are Fun - Why not on the web?Maps are Fun - Why not on the web?
Maps are Fun - Why not on the web?Robert Coup
 
Fame and Fortune from Open Source
Fame and Fortune from Open SourceFame and Fortune from Open Source
Fame and Fortune from Open SourceRobert Coup
 

More from Robert Coup (8)

Curtailing Crustaceans with Geeky Enthusiasm
Curtailing Crustaceans with Geeky EnthusiasmCurtailing Crustaceans with Geeky Enthusiasm
Curtailing Crustaceans with Geeky Enthusiasm
 
Map Analytics - Ignite Spatial
Map Analytics - Ignite SpatialMap Analytics - Ignite Spatial
Map Analytics - Ignite Spatial
 
Twisted: a quick introduction
Twisted: a quick introductionTwisted: a quick introduction
Twisted: a quick introduction
 
Django 101
Django 101Django 101
Django 101
 
Distributed-ness: Distributed computing & the clouds
Distributed-ness: Distributed computing & the cloudsDistributed-ness: Distributed computing & the clouds
Distributed-ness: Distributed computing & the clouds
 
Geo-Processing in the Clouds
Geo-Processing in the CloudsGeo-Processing in the Clouds
Geo-Processing in the Clouds
 
Maps are Fun - Why not on the web?
Maps are Fun - Why not on the web?Maps are Fun - Why not on the web?
Maps are Fun - Why not on the web?
 
Fame and Fortune from Open Source
Fame and Fortune from Open SourceFame and Fortune from Open Source
Fame and Fortune from Open Source
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

/me wants it. Scraping sites to get data.

  • 1. /me wants it Scraping Sites to get Data Rob Coup robert@coup.net.nz
  • 2. Who am I? • Koordinates • Open data open.org.nz • Geek • Pythonista
  • 3. Datasets as websites
  • 4. But I want to mix it up! http://www.flickr.com/photos/bowbrick/2365377635
  • 5. DATA http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
  • 6. And when do I want it? http://www.flickr.com/photos/davidmaddison/102584440
  • 8. First Example • Wanganui District Council Food Gradings • http://j.mp/i4yNZ
  • 9. Review • POST to URLs for each Grade • Parse HTML response for: • Business Name • Address • Grading • Output as CSV
  • 10. What to POST? • Tools: Firebug, Charles http://www.wanganui.govt.nz/services/foodgrading/ SearchResults.asp txtGrading=A [ B, C, D, E, “Exempt”, “Currently Not Graded” ] Submit=Go
  • 11. POSTing in Python import urllib import urllib2 url = 'http://www.wanganui.govt.nz/services/foodgrading/ SearchResults.asp' post_data = { 'txtGrading': 'A', 'Submit': 'Go', } post_encoded = urllib.urlencode(post_data) html = urllib2.urlopen(url, post_encoded).read() print html
  • 12. Results … <TD class="bodytext"> <h2>Search results...</h2> <B>39 South</B><br /> 159 Victoria Ave<br /> Wanganui<br /> Grading: <B>A</b> <hr /> <B>Alma Junction Dairy</B><br /> 1 Alma Rd<br /> Wanganui<br /> Grading: <B>A</b> <hr /> …
  • 13. Getting Data Out • Tools: BeautifulSoup • Parses HTML-ish documents • Easy navigation & searching of tree
  • 14. Our Parser from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) container = soup.find('td', {'class':'bodytext'}) for hr_el in container.findAll('hr'): # <b>NAME</b><br/>ADDRESS_0<br/>ADDRESS_1<br/>Grading:<b>GRADE</b><hr/> text_parts = hr_el.findPreviousSiblings(text=True, limit=3) # ['Grading:', 'ADDRESS_1', 'ADDRESS_0'] address = (text_parts[2], text_parts[1]) el_parts = hr_el.findPreviousSiblings('b', limit=2) # [<b>GRADE</b>, <b>NAME</b>] grade = el_parts[0].string name = el_parts[1].string print name, address, grade
  • 15. Putting it all together • loop over the grading values • write CSV output
  • 16. Advanced Crawlers • Form filling • Authentication & cookies
  • 17. Mechanize • http://wwwsearch.sourceforge.net/mechanize/ • programmable browser in Python • fills forms, navigates links & pages, eats cookies
  • 18. Data Parsing • JSON: SimpleJSON (pre-Py2.6) • XML: ElementTree • HTML: BeautifulSoup • Nasties: Abobe PDF, Microsoft Excel “PDF files are where data goes to die”
  • 19. Reading nasties in Python • Abobe PDF: PDFMiner, pdftable • MS Excel: xlrd
  • 20. Example Two • Palmerston North City Food Gradings • http://j.mp/31YuRH
  • 21. Review • Get HTML page • Find current PDF link • Download PDF • Parse table • Name • Grading
  • 22. Parsing PDF import urllib2 from cStringIO import StringIO from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.layout import LAParams pdf_file = StringIO(urllib2.urlopen(pdf_url).read()) text = StringIO() rsrc = PDFResourceManager() device = TextConverter(rsrc, text, laparams=LAParams()) process_pdf(rsrc, device, pdf_file) device.close() print text.getvalue()
  • 23. Summary • Python has some great tools for: • querying websites • parsing HTML & other formats • Open data as data, not websites

Editor's Notes

  1. We&amp;#x2019;ve ended up with this datasets-as-websites problem.
  2. I might want to create an alternative presentation. Use it for something different, that the creator would never have conceived of. Or maybe just compare or combine it with other data. http://www.flickr.com/photos/bowbrick/2365377635
  3. So, I need the raw data. Not some pretty webpages. http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
  4. At 3am on a Sunday morning of course. When my interest is up. No use having some mail-in-take-21-working-days option. http://www.flickr.com/photos/davidmaddison/102584440
  5. Usually it&amp;#x2019;s easier to ask forgiveness than permission.