This document discusses open data and tools for scraping, cleaning, and visualizing data from the web. It introduces ScraperWiki for scraping and refining data using scripts. Google Refine and Fusion Tables are presented as tools for cleaning and visualizing data. Finally, the document promotes connecting and encourages questions from the audience.
1. may 2 0 1 1
MAKING THE GOV DATA OPEN
MAREK SOTAK | ATOMIC ANT
www.atomicant.co.uk
2. OH HAI!
ABOUT ME & ATOMIC ANT
Marek Sotak
• Web designer, developer
• From Prague, Czech Republic
• Over 5 years with Drupal - since v4.6
• Rootcandy admin theme
• Organising events - Drupal Design Camp, Local Meet-ups
• @sotak on twitter
• http://sotak.co.uk - personal blog/experiments
6 : 0 2 : 1
atomicant.co.uk #justsaying ;)
3. OH HAI!
ABOUT ME & ATOMIC ANT
• Based in London & Prague
• Human interface design, training, branding, development
• Clients all over the world
• http://atomicant.co.uk
8. DATA MINING - SCRAPING
LET'S GET DIRTY
BigClean.org – Prague
atomicant.co.uk
9. DATA MINING - SCRAPING
LET'S GET DIRTY
There's a lot of data laying around on the internet that can be
useful → Crime reports, government reports, statistics,
missing pets register, current affairs
However sometimes they are in a format such as PDF, html,
etc... something you can't really take and perform
calculations, visualizations, filtering, etc... on.
Is it really that hard to publish something in a CSV, XML,.. ?
atomicant.co.uk
10. DATA MINING - SCRAPING
LET'S GET DIRTY
Ministry of the interior – Czech Republic
Public Collections
- open what?
atomicant.co.uk
11. DATA MINING - SCRAPING
LET'S GET DIRTY
atomicant.co.uk
12. DATA MINING - SCRAPING
LET'S GET DIRTY
atomicant.co.uk
13. DATA MINING - SCRAPING
LET'S GET DIRTY
atomicant.co.uk
14. DATA MINING - SCRAPING
LET'S GET DIRTY
atomicant.co.uk
15. DATA MINING - SCRAPING
LET'S GET DIRTY
Request a site/content
Run through the html – DOM - selectors
Do whatever you want with the data
Save the data
atomicant.co.uk
17. SCRAPERWIKI
WHAT IS IT? HOW TO USE IT
Scrape and link data using Ruby, Python and PHP scripts
that run maintenance-free in the cloud. Request data for
scoops and better decisions.
atomicant.co.uk
20. SCRAPERWIKI
WHAT IS IT? HOW TO USE IT
Why would you want to use SCRAPERWIKI rather than
other scraping tools or custom code?
atomicant.co.uk
21. SCRAPERWIKI
WHAT IS IT? HOW TO USE IT
• The dataset is available to everyone
• Anyone can access the data through API
• If the source changed and the scraper brakes, anyone can
fix the scraper
• Anyone can fork the scraper
atomicant.co.uk
24. GOOGLE REFINE
WHAT IS IT? HOW TO USE IT
Google Refine is a power tool for working with messy data,
cleaning it up, transforming it from one format into another,
extending it with web services,...
atomicant.co.uk
25. VISUALISE
TELL THE STORY
There is more to that
It's just not data with values in a spreadsheet or database
Data can tell the story!
atomicant.co.uk
26. GOOGLE FUSION TABLES
WHAT IS IT? HOW TO USE IT
Easy visualisation http://tables.googlelabs.com/
atomicant.co.uk
27. SCRAPING WITH DRUPAL
AND NOW FOR SOMETHING COMPLETELY DIFFERENT
Feeds – http://drupal.org/project/feeds
Scraping
Feeds query path parser - project/feeds_querypath_parser
Feeds xpath parser – project/feeds_xpathparser
Cleaning up data
Feeds tamper - http://drupal.org/project/feeds_tamper
atomicant.co.uk
28. VISUALISE WITH DRUPAL
AND NOW FOR SOMETHING COMPLETELY DIFFERENT
Mapping
- Location – http://drupal.org/project/location
- Openlayers – http://drupal.org/project/openlayers
- Gmap – http://drupal.org/project/gmap
Graphs/Charts
- Graphs
- Graphs Charts
- Open Flash Chart
- Views
atomicant.co.uk
29. GO! SCRAPE IT!
CHALLENGE
EU Open Data Challenge
- €20,000 to win
- 28 days left to enter
http://opendatachallenge.org/
atomicant.co.uk
30. TOOLS
SCRAPING DATA
ScraperWiki – http://scraperwiki.com
PHP Simple HTML DOM – http://bit.ly/phphtmldom
PHPQuery - http://code.google.com/p/phpquery/
Open Data Kit - http://opendatakit.org/
atomicant.co.uk
32. TOOLS
VISUALIZING DATA
Google fusion tables - http://tables.googlelabs.com/
The Best Tools for Visualization - http://rww.to/toolsforvis
atomicant.co.uk
34. THANK YOU
Q&A | LETS CONNECT
QUESTIONS?
@sotak - twitter
http://sotak.co.uk - personal blog
http://atomicant.co.uk - company website
atomicant.co.uk