SlideShare ist ein Scribd-Unternehmen logo
1 von 10
2013-06-26
Knowledge Workers Toronto
Methods
1
We are surrounded by data
2013-06-26
Knowledge Workers Toronto
Methods
2
... by MESSY data
- Multiple standards and formats
Structured vs unstructured
Field nomination and format varies ...
- Human Error (misspellings, errors, etc)
- Non-normalized inputs (free-text entries)
- Incomplete data (laziness)
....
2013-06-26
Knowledge Workers Toronto
Methods
3
2013-06-26
Knowledge Workers Toronto
Methods
4
OpenRefine the
- Swiss army knife for data manipulation!
- glue step between your IT systems
2013-06-26
Knowledge Workers Toronto
Methods
5
What's OpenRefine
(former Google Refine, former Gridworks)
- A Cross platform Web Application that runs
locally
- A Community based project hosted on GitHub
- Which have two distributions and multiple
extensions
- Something between a spreadsheet and SQL
2013-06-26
Knowledge Workers Toronto
Methods
6
Four use cases
1. Data Profiling *
2. Data Cleaning *
3. Data extension *
4. ETL (Extract Transform Load) Prototyping
2013-06-26
Knowledge Workers Toronto
Methods
7
File 1: Data Profiling &
Cleaning
City Subject Thesaurus XML file with 5431
concept.
What we will do:
- Explore the file
- Fix inconsistencies
- Transpose / Merge fields
2013-06-26
Knowledge Workers Toronto
Methods
8
File 2: The Economist Best
City Contest 2012
Prepare Data for an application to the Economist
Intelligence Units (EIU) Best City Contest 2012
using G. Hofstede Values Survey Module 2008
What we will do:
- Clean Duplicate
- Create New Data
- Add data from a different project
- Use Project History
2013-06-26
Knowledge Workers Toronto
Methods
9
OpenRefine
http://openrefine.org
@OpenRefine
Martin Magdinier
martin.magdinier@gmail.com
@magdmartin
Thanks!
2013-06-26
Knowledge Workers Toronto
Methods
10
DESCRIPTOR The preferred term
FAC Facet
SC Subject category of the term
SN Scope note
SRC Source of term
UF Used for
BT Broader term
City Subject Thesaurus
Legend
NT Narrower term
RT Related term
STA Term status
INP Input date
APP Approval date
UPD Modified date
TNR Term number

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Data Structures
Introduction to Data StructuresIntroduction to Data Structures
Introduction to Data Structuresnayanbanik
 
Preparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001ePreparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001eGihan Wikramanayake
 
StaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataStaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataArtem Lutov
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...David Milward
 
Data structure (basics)
Data structure (basics)Data structure (basics)
Data structure (basics)ShrushtiGole
 
Data structure unitfirst part1
Data structure unitfirst part1Data structure unitfirst part1
Data structure unitfirst part1Amar Rawat
 
Importing data in Oasis Montaj
Importing data in Oasis MontajImporting data in Oasis Montaj
Importing data in Oasis MontajAmin khalil
 
self designed Linked List
self designed Linked Listself designed Linked List
self designed Linked ListDAYASAGAR KADAM
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...National Institute of Informatics
 
Introduction to data structures (ss)
Introduction to data structures (ss)Introduction to data structures (ss)
Introduction to data structures (ss)Madishetty Prathibha
 

Was ist angesagt? (16)

Introduction to Data Structures
Introduction to Data StructuresIntroduction to Data Structures
Introduction to Data Structures
 
Preparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001ePreparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001e
 
StaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataStaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked Data
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
 
Data structures
Data structuresData structures
Data structures
 
Linked lists
Linked listsLinked lists
Linked lists
 
Data structure (basics)
Data structure (basics)Data structure (basics)
Data structure (basics)
 
Data structure unitfirst part1
Data structure unitfirst part1Data structure unitfirst part1
Data structure unitfirst part1
 
Data mining
Data miningData mining
Data mining
 
Building intelligent systems with FAIR data
Building intelligent systems with FAIR dataBuilding intelligent systems with FAIR data
Building intelligent systems with FAIR data
 
Importing data in Oasis Montaj
Importing data in Oasis MontajImporting data in Oasis Montaj
Importing data in Oasis Montaj
 
PPL, OQL & oodbms
PPL, OQL & oodbmsPPL, OQL & oodbms
PPL, OQL & oodbms
 
self designed Linked List
self designed Linked Listself designed Linked List
self designed Linked List
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...
 
Introduction to data structures (ss)
Introduction to data structures (ss)Introduction to data structures (ss)
Introduction to data structures (ss)
 
Ef overview
Ef overviewEf overview
Ef overview
 

Andere mochten auch

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Administracion Empresarial 122116
Administracion Empresarial 122116Administracion Empresarial 122116
Administracion Empresarial 122116sena
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)GOKb Project
 

Andere mochten auch (6)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
LOD2 Webinar Series: Zemanta / Open refine
LOD2 Webinar Series: Zemanta / Open refine LOD2 Webinar Series: Zemanta / Open refine
LOD2 Webinar Series: Zemanta / Open refine
 
Administracion Empresarial 122116
Administracion Empresarial 122116Administracion Empresarial 122116
Administracion Empresarial 122116
 
Administracion Empresarial 122116
Administracion Empresarial 122116Administracion Empresarial 122116
Administracion Empresarial 122116
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)
 

Ähnlich wie 20130626 OpenRefine Introduction

Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfRAKESHG79
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Erdenebayar Erdenebileg
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things PayamBarnaghi
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
Introduction to Smart Data Models
Introduction to Smart Data ModelsIntroduction to Smart Data Models
Introduction to Smart Data ModelsFIWARE
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsMarkus Neteler
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemCameron Kiddle
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 

Ähnlich wie 20130626 OpenRefine Introduction (20)

20130206 open refine
20130206  open refine20130206  open refine
20130206 open refine
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
 
DBMS.ppt
DBMS.pptDBMS.ppt
DBMS.ppt
 
Uc13.chapter.14
Uc13.chapter.14Uc13.chapter.14
Uc13.chapter.14
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things
 
unit-4-notes.pdf
unit-4-notes.pdfunit-4-notes.pdf
unit-4-notes.pdf
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Introduction to Smart Data Models
Introduction to Smart Data ModelsIntroduction to Smart Data Models
Introduction to Smart Data Models
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formats
 
Emerging Technologies in IT
Emerging Technologies in ITEmerging Technologies in IT
Emerging Technologies in IT
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management System
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 

Kürzlich hochgeladen

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

20130626 OpenRefine Introduction