SlideShare a Scribd company logo
1 of 126
Data wrangling with open
source tools
Tony Hirst
Dept of Communication & Systems
The Open University, UK
Premises
“I take data
from wherever I
can get it”
1
“Appropriate
everything”
2
Conversations
with data
3
Visual
Conversations
with data3
(Accession Plot)
@mediaczar
If a picture’s worth a
thousand words,
maybe it should take
as long to read?
Most learning
analytics won’t be
performed by
learning analytics
researchers
How can we help
people fashion
their own tools to
support data
conversations?
Recipes
site:open.ac.uk
Have a
conversation
with the data…
Ask the right
questions…
xkcd.com/1138
Sometimes a question
makes most sense in
the context of
questions previously
asked and answers
previously received
DATA
USERS
Educators
Learners
Planners
Marketers
Policymakers
Researchers
Press
NGOs
“
D
E
V
E
L
O
P
E
R
S
”
Have
dashboard,
so what?
A tools and
issues
based view
DATA
TOOLS
USERS
PROBLEMS
Example – Google Fusion Tables
Fusion Table
https://www.google.com/fusiontables/DataSource?docid=1VKG7iCbFlsEYJzTuQppf4xoIqq1ABxWTdW6O_7o#rows:id=1
http://is.gd/qhuaoA
Walkthrough
http://blog.ouseful.info/2012/11/16/a-quick-look-at-gcsealevel-certificate-awards-market-share-by-examination-board/
http://is.gd/f9YAbG
DATA
TOOLS
USERS
PROBLEMS
Access/obtain data
Make sense of data
Ask specific questions of data
Communicate in a data-centric way
Load data
Clean data
Merge/enrich data
DATA
Issues
TOOLS
DATA
Other
TOOLS
Issues
TOOLS
“Tool based
programming”
A barrier to access
(for the tool user) is
data format
JSON XMLCSVXLS
TSV
.db
HTML
PDF DOCTXT
GLUE LOGIC(Glue code)
=importHTML(URL, “table”, N)
HTML
QUERYABLE
DATA
Try it…
Example Page
http://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_the_United_States_by_endowment
http://is.gd/7Vbg6n
Google Spreadsheets as a database
Explorer
https://views.scraperwiki.com/run/google_spreadsheet_query/
http://is.gd/jiMJoh
Walkthrough
http://schoolofdata.org/2013/05/24/asking-questions-of-data-garment-factories-data-expedition/
http://is.gd/qJHihu
=importCSV(URL, N)
HTML
INTERACTIVE
DASHBOARD
Google Charts
Google Chart
Visualization API
https://code.google.com/apis/ajax/playground/
http://is.gd/TTHIUh
Google
Visualisation
API
googleVis
(R)
https://developers.facebook.com/
docs/reference/api/examples/
http://is.gd/7cRnvS
A barrier to access
(for the tool user) is
data shape
A barrier to access
(for the tool user) is
data cleanliness
Questions of
identity
The Open University
Open University
OU
Open Uni
Open University, UK
NORMALISATION/RECONCILIATION
Reconciliation to
a canonical name
and/orto a
unique identifier
A stumbling block
(for the data user)
is data enrichment
A stumbling block
(for the data user)
is joining datasets
A stumbling block
(for the data user)
is joining partially
matched data
Rolling your own
interactive data
exploration tools
R Shiny
Apps
ui.R server.R
RCharts
Many chart tools
do the work for
you if the data is
in the right shape
DATA
TOOLS
USERS PROBLEMS
Justask…
ask.SchoolOfData.org
blog.ouseful.info
@psychemedia

More Related Content

What's hot

Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
Christoph Trattner
 
Cognitive Models in Recommender Systems
Cognitive Models in Recommender SystemsCognitive Models in Recommender Systems
Cognitive Models in Recommender Systems
Christoph Trattner
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 

What's hot (9)

Library Connect Webinar - Calculating sharing metrics: Possible approaches
Library Connect Webinar - Calculating sharing metrics: Possible approaches Library Connect Webinar - Calculating sharing metrics: Possible approaches
Library Connect Webinar - Calculating sharing metrics: Possible approaches
 
Investigating Performance
Investigating PerformanceInvestigating Performance
Investigating Performance
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
Library Connect Webinar - The secret life of articles: From download metrics ...
Library Connect Webinar - The secret life of articles: From download metrics ...Library Connect Webinar - The secret life of articles: From download metrics ...
Library Connect Webinar - The secret life of articles: From download metrics ...
 
Data science as a science
Data science as a scienceData science as a science
Data science as a science
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Cognitive Models in Recommender Systems
Cognitive Models in Recommender SystemsCognitive Models in Recommender Systems
Cognitive Models in Recommender Systems
 
Investigating Performance
Investigating PerformanceInvestigating Performance
Investigating Performance
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 

Viewers also liked (7)

Funding entrepreneurship cj cornell-december3rd 2012
Funding entrepreneurship cj cornell-december3rd 2012Funding entrepreneurship cj cornell-december3rd 2012
Funding entrepreneurship cj cornell-december3rd 2012
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
 
Propel Arizona Crowdfunding Essentials - for NACET
Propel Arizona Crowdfunding Essentials - for NACETPropel Arizona Crowdfunding Essentials - for NACET
Propel Arizona Crowdfunding Essentials - for NACET
 
How My Comic Book Obsession Birthed a New Functional Testing Tool
How My Comic Book Obsession Birthed a New Functional Testing ToolHow My Comic Book Obsession Birthed a New Functional Testing Tool
How My Comic Book Obsession Birthed a New Functional Testing Tool
 
Propel Arizona: Crowdfunding for Communities
Propel Arizona:  Crowdfunding for CommunitiesPropel Arizona:  Crowdfunding for Communities
Propel Arizona: Crowdfunding for Communities
 
Functional Programming With Python (EuroPython 2008)
Functional Programming With Python (EuroPython 2008)Functional Programming With Python (EuroPython 2008)
Functional Programming With Python (EuroPython 2008)
 

Similar to Lasi datawrangling

Educational Transformation with Media
Educational Transformation with MediaEducational Transformation with Media
Educational Transformation with Media
TerryKH2006
 

Similar to Lasi datawrangling (20)

Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Data Models And Details About Open Data
Data Models And Details About Open DataData Models And Details About Open Data
Data Models And Details About Open Data
 
Educational Transformation with Media
Educational Transformation with MediaEducational Transformation with Media
Educational Transformation with Media
 
The evolution of research on social media
The evolution of research on social mediaThe evolution of research on social media
The evolution of research on social media
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Data visualization and digital humanities research
Data visualization and digital humanities researchData visualization and digital humanities research
Data visualization and digital humanities research
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
Strata 2012: Big Data and Bibliometrics
Strata 2012: Big Data and BibliometricsStrata 2012: Big Data and Bibliometrics
Strata 2012: Big Data and Bibliometrics
 
Ebi
EbiEbi
Ebi
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
 
020610
020610020610
020610
 
LuisValeroInterests
LuisValeroInterestsLuisValeroInterests
LuisValeroInterests
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
The crowd and the library
The crowd and the libraryThe crowd and the library
The crowd and the library
 
Using technologies to promote projects
Using technologies to promote projectsUsing technologies to promote projects
Using technologies to promote projects
 
Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...
 
Bibliotheek & Onderzoek 2.0?
Bibliotheek & Onderzoek 2.0?Bibliotheek & Onderzoek 2.0?
Bibliotheek & Onderzoek 2.0?
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of Publishing
 

More from Tony Hirst

Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Tony Hirst
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
Tony Hirst
 

More from Tony Hirst (20)

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiesta
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyter
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robot
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptx
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptx
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacks
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyter
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 Workshop
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wire
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interest
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKX
 
Week4
Week4Week4
Week4
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
 
Conversations with data
Conversations with dataConversations with data
Conversations with data
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingo
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Lasi datawrangling

Editor's Notes

  1. I am not a journalist, but it seems to me that a large part of your work, and indeed a large part of the work of a scientist or an analyst, is in asking the right questions of a source, and knowing how to frame those questions.The data journalist knows how to ask questions of data.
  2. Also – high incidence of crime around police stations (no location, so police station used as default location); Russell Square as a murder hotspot.
  3. Another nice example of this, and one used by many advocates of data visualisation, is the famous example of Anscombe’s quartet, for sets of two dimensional data with some interesting properties.
  4. For example, many of the “classic” summary statistics for the corresponding columns in these data sets are to all intents and purposes the same.
  5. But when we look at the datasets as a set of scatterplots, we see how the data tells very different stories.
  6. People learn the skills they need, as they need them.