9. Review
• POST to URLs for each Grade
• Parse HTML response for:
• Business Name
• Address
• Grading
• Output as CSV
10. What to POST?
• Tools: Firebug, Charles
http://www.wanganui.govt.nz/services/foodgrading/
SearchResults.asp
txtGrading=A
[ B, C, D, E, “Exempt”, “Currently Not Graded” ]
Submit=Go
11. POSTing in Python
import urllib
import urllib2
url = 'http://www.wanganui.govt.nz/services/foodgrading/
SearchResults.asp'
post_data = {
'txtGrading': 'A',
'Submit': 'Go',
}
post_encoded = urllib.urlencode(post_data)
html = urllib2.urlopen(url, post_encoded).read()
print html
17. Mechanize
• http://wwwsearch.sourceforge.net/mechanize/
• programmable browser in Python
• fills forms, navigates links & pages, eats cookies
18. Data Parsing
• JSON: SimpleJSON (pre-Py2.6)
• XML: ElementTree
• HTML: BeautifulSoup
• Nasties: Abobe PDF, Microsoft Excel
“PDF files are where data goes to die”
19. Reading nasties in
Python
• Abobe PDF: PDFMiner, pdftable
• MS Excel: xlrd
21. Review
• Get HTML page
• Find current PDF link
• Download PDF
• Parse table
• Name
• Grading
22. Parsing PDF
import urllib2
from cStringIO import StringIO
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.layout import LAParams
pdf_file = StringIO(urllib2.urlopen(pdf_url).read())
text = StringIO()
rsrc = PDFResourceManager()
device = TextConverter(rsrc, text, laparams=LAParams())
process_pdf(rsrc, device, pdf_file)
device.close()
print text.getvalue()
23. Summary
• Python has some great tools for:
• querying websites
• parsing HTML & other formats
• Open data as data, not websites
Editor's Notes
We’ve ended up with this datasets-as-websites problem.
I might want to create an alternative presentation. Use it for something different, that the creator would never have conceived of. Or maybe just compare or combine it with other data.
http://www.flickr.com/photos/bowbrick/2365377635
So, I need the raw data. Not some pretty webpages.
http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
At 3am on a Sunday morning of course. When my interest is up. No use having some mail-in-take-21-working-days option.
http://www.flickr.com/photos/davidmaddison/102584440
Usually it’s easier to ask forgiveness than permission.