SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Intro to Open Refine
An overview & walkthrough to get you started.
 intro/overview (15 min)
 walkthrough (45 min)
 intro to advanced (10 min)
 q&a (20 min)
http://www.txdhc.org/txdhc-training-webcast-materials/
Jennifer Hecker Liz Grumbach
“a tool for working
with messy data”
Cleaning up data that is:
 in a simple tabular format
 is inconsistently formatted
 has inconsistent terminology
 get an overview of a data set
 resolve inconsistencies
 split data up into more granular parts
 match local data up to other data sets
 enhance a data set with data from
other sources
https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/text-facet-openrefine.png
https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/clustering-openrefine.png
https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/clustering-openrefine.png
Freebase Gridworks
=
GoogleRefine
=
OpenRefine
=
Refine
…ask some questions about your data set:
 What type of data is it & what format is it in?
 What’s the size of your data set?
 What question do you want to ask your data?
 What do you need to do to find the answer?
Excel
familiarity, better for data entry, cut and paste
operation, no paging to navigate
Google Spreadsheets
similar to Excel, can get external data
relatively easily, easy to collaborate and share
Google Fusion Tables if you just want to filter, easy to share
Text editor powerful text editor can do many things
Unix tools
more challenging to use, but quick and some
things (finding things, sorting) are easy
Writing code most sophisticated and most to learn!
<And now Liz attempts the
dangerous LIVE DEMO!>
Regular expressions
 “wildcards on steroids” that allow for
more granular data manipulation
(http://www.regular-expressions.info)
Transformations using Open Refine
Expression Language (GREL)
 kind of like a formula in Excel
Retrieve data from online sources
 example: use names to retrieve birth/death dates
from Virtual International Authority File (VIAF)
Match data to external data sources using
 Extensions for RDF, DBpedia, Named-Entity
Recognition (NER), etc…
 And ‘reconciliation’ services
Use ‘cross’ function to compare
contents of two Refine projects, or
share data between the two projects.
 TxDHC blog post on this webinar http://www.txdhc.org/txdhc-training-
webcast-materials/
 The OpenRefine Wiki https://github.com/OpenRefine/OpenRefine/wiki
 OpenRefine User Documentation
https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users
 The ‘Free your metadata’ site http://freeyourmetadata.org...
 …and book http://book.freeyourmetadata.org
 The OpenRefine mailing list and forum
http://groups.google.com/d/forum/openrefine
http://bit.ly/1uGPd0f
Please email us if you have any questions:
Jennifer = jenniferraehecker@gmail.com
Liz = egrumbac@tamu.edu
credits * acknowledgements * citations
These slides were developed by Jennifer Hecker (j.hecker@Austin.utexas.edu) and Liz Grumbach (egrumbac@tamu.edu )
on behalf of University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, Media and Culture, and the Texas
Digital Humanities Consortium using many resources including the wonderful course material developed by Owen
Stephens on behalf of the British Library (http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-
using-openrefine/).
Unless otherwise stated, all images, audio or video content are separate works with their own license, and should not be
assumed to be CC-BY in their own right. This work is licensed under a Creative Commons Attribution 4.0 International
License http://creativecommons.org/licenses/by/4.0/. It is suggested when crediting this work, you include the phrase
“Developed by Liz Grumback and Jennifer Hecker on behalf of the university of Texas, Texas A&M, and the TXDHC.”
Thanks to University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, and the Texas Digital Humanities
Consortium for facilitating this presentation.

Weitere ähnliche Inhalte

Was ist angesagt?

Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebAnkit Solanki
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersSpazioDati
 
Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RESChristophe Guéret
 
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...Michael Cummings
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Juan Sequeda
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APISpazioDati
 
Introduction to Elastic with a hint of Symfony and Docker
Introduction to Elastic with a hint of Symfony and DockerIntroduction to Elastic with a hint of Symfony and Docker
Introduction to Elastic with a hint of Symfony and DockerDaniel Platt
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashupsgiurca
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xmlFelix Sasaki
 
Emerging technologies in academic libraries
Emerging technologies in academic librariesEmerging technologies in academic libraries
Emerging technologies in academic librariesMichael Cummings
 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxRuben Verborgh
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlPrimal Pappachan
 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsRuben Verborgh
 
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Nuxeo
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?Yu-Chang Ho
 
Linked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryLinked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryRuben Schalk
 
Reinhard LAWDI Presentation
Reinhard LAWDI PresentationReinhard LAWDI Presentation
Reinhard LAWDI Presentationcharinos
 
Ruby on Rails and the Semantic Web
Ruby on Rails and the Semantic WebRuby on Rails and the Semantic Web
Ruby on Rails and the Semantic WebNathalie Steinmetz
 

Was ist angesagt? (20)

Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
 
Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RES
 
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion API
 
Introduction to Elastic with a hint of Symfony and Docker
Introduction to Elastic with a hint of Symfony and DockerIntroduction to Elastic with a hint of Symfony and Docker
Introduction to Elastic with a hint of Symfony and Docker
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
 
Emerging technologies in academic libraries
Emerging technologies in academic librariesEmerging technologies in academic libraries
Emerging technologies in academic libraries
 
Linked data tooling XML
Linked data tooling XMLLinked data tooling XML
Linked data tooling XML
 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked Lascaux
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
 
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
Linked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryLinked Open Data Utrecht University Library
Linked Open Data Utrecht University Library
 
Reinhard LAWDI Presentation
Reinhard LAWDI PresentationReinhard LAWDI Presentation
Reinhard LAWDI Presentation
 
Ruby on Rails and the Semantic Web
Ruby on Rails and the Semantic WebRuby on Rails and the Semantic Web
Ruby on Rails and the Semantic Web
 

Ähnlich wie TXDHC OpenRefine Training

Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkMichael Häusler
 
Research software and Dataverse
Research software and DataverseResearch software and Dataverse
Research software and Dataversephilipdurbin
 
Flexible Resources In 3 6 And E4
Flexible Resources In 3 6 And E4Flexible Resources In 3 6 And E4
Flexible Resources In 3 6 And E4szbra
 
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaRMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaJazz Yao-Tsung Wang
 
(PROJEKTURA) open data big data @tgg osijek
(PROJEKTURA) open data big data @tgg osijek(PROJEKTURA) open data big data @tgg osijek
(PROJEKTURA) open data big data @tgg osijekRatko Mutavdzic
 
Semantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsSemantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsEmanuele Della Valle
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
30° Nexa Lunch Seminar - Linked Data Platform vs real world
30° Nexa Lunch Seminar - Linked Data Platform vs real world30° Nexa Lunch Seminar - Linked Data Platform vs real world
30° Nexa Lunch Seminar - Linked Data Platform vs real worldDiego Valerio Camarda
 
Build Secure Cloud-Hosted Apps for SharePoint 2013
Build Secure Cloud-Hosted Apps for SharePoint 2013Build Secure Cloud-Hosted Apps for SharePoint 2013
Build Secure Cloud-Hosted Apps for SharePoint 2013Danny Jessee
 
Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...Hans-Joerg Happel
 
Dave de Roure - The myExperiment approach towards Open Science
Dave de Roure - The myExperiment approach towards Open ScienceDave de Roure - The myExperiment approach towards Open Science
Dave de Roure - The myExperiment approach towards Open Scienceshwu
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)Frank van Harmelen
 
Visualization of Information (ProQuest)
Visualization of Information (ProQuest)Visualization of Information (ProQuest)
Visualization of Information (ProQuest)Michael Adcock
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersEmanuele Della Valle
 
O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...
O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...
O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...NCCOMMS
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at AirbnbNeo4j
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 

Ähnlich wie TXDHC OpenRefine Training (20)

Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and Flink
 
Research software and Dataverse
Research software and DataverseResearch software and Dataverse
Research software and Dataverse
 
Flexible Resources In 3 6 And E4
Flexible Resources In 3 6 And E4Flexible Resources In 3 6 And E4
Flexible Resources In 3 6 And E4
 
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaRMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
 
(PROJEKTURA) open data big data @tgg osijek
(PROJEKTURA) open data big data @tgg osijek(PROJEKTURA) open data big data @tgg osijek
(PROJEKTURA) open data big data @tgg osijek
 
Semantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsSemantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientists
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
30° Nexa Lunch Seminar - Linked Data Platform vs real world
30° Nexa Lunch Seminar - Linked Data Platform vs real world30° Nexa Lunch Seminar - Linked Data Platform vs real world
30° Nexa Lunch Seminar - Linked Data Platform vs real world
 
Build Secure Cloud-Hosted Apps for SharePoint 2013
Build Secure Cloud-Hosted Apps for SharePoint 2013Build Secure Cloud-Hosted Apps for SharePoint 2013
Build Secure Cloud-Hosted Apps for SharePoint 2013
 
DataHub
DataHubDataHub
DataHub
 
Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...
 
Dave de Roure - The myExperiment approach towards Open Science
Dave de Roure - The myExperiment approach towards Open ScienceDave de Roure - The myExperiment approach towards Open Science
Dave de Roure - The myExperiment approach towards Open Science
 
My Experiment
My ExperimentMy Experiment
My Experiment
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 
Visualization of Information (ProQuest)
Visualization of Information (ProQuest)Visualization of Information (ProQuest)
Visualization of Information (ProQuest)
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS Practitioners
 
O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...
O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...
O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 

Kürzlich hochgeladen

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 

Kürzlich hochgeladen (20)

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

TXDHC OpenRefine Training

  • 1. Intro to Open Refine An overview & walkthrough to get you started.
  • 2.  intro/overview (15 min)  walkthrough (45 min)  intro to advanced (10 min)  q&a (20 min) http://www.txdhc.org/txdhc-training-webcast-materials/
  • 4. “a tool for working with messy data”
  • 5. Cleaning up data that is:  in a simple tabular format  is inconsistently formatted  has inconsistent terminology
  • 6.  get an overview of a data set  resolve inconsistencies  split data up into more granular parts  match local data up to other data sets  enhance a data set with data from other sources
  • 7.
  • 8.
  • 9.
  • 14. …ask some questions about your data set:  What type of data is it & what format is it in?  What’s the size of your data set?  What question do you want to ask your data?  What do you need to do to find the answer?
  • 15. Excel familiarity, better for data entry, cut and paste operation, no paging to navigate Google Spreadsheets similar to Excel, can get external data relatively easily, easy to collaborate and share Google Fusion Tables if you just want to filter, easy to share Text editor powerful text editor can do many things Unix tools more challenging to use, but quick and some things (finding things, sorting) are easy Writing code most sophisticated and most to learn!
  • 16. <And now Liz attempts the dangerous LIVE DEMO!>
  • 17. Regular expressions  “wildcards on steroids” that allow for more granular data manipulation (http://www.regular-expressions.info)
  • 18. Transformations using Open Refine Expression Language (GREL)  kind of like a formula in Excel
  • 19. Retrieve data from online sources  example: use names to retrieve birth/death dates from Virtual International Authority File (VIAF) Match data to external data sources using  Extensions for RDF, DBpedia, Named-Entity Recognition (NER), etc…  And ‘reconciliation’ services
  • 20. Use ‘cross’ function to compare contents of two Refine projects, or share data between the two projects.
  • 21.  TxDHC blog post on this webinar http://www.txdhc.org/txdhc-training- webcast-materials/  The OpenRefine Wiki https://github.com/OpenRefine/OpenRefine/wiki  OpenRefine User Documentation https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users  The ‘Free your metadata’ site http://freeyourmetadata.org...  …and book http://book.freeyourmetadata.org  The OpenRefine mailing list and forum http://groups.google.com/d/forum/openrefine
  • 22. http://bit.ly/1uGPd0f Please email us if you have any questions: Jennifer = jenniferraehecker@gmail.com Liz = egrumbac@tamu.edu
  • 23. credits * acknowledgements * citations These slides were developed by Jennifer Hecker (j.hecker@Austin.utexas.edu) and Liz Grumbach (egrumbac@tamu.edu ) on behalf of University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, Media and Culture, and the Texas Digital Humanities Consortium using many resources including the wonderful course material developed by Owen Stephens on behalf of the British Library (http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data- using-openrefine/). Unless otherwise stated, all images, audio or video content are separate works with their own license, and should not be assumed to be CC-BY in their own right. This work is licensed under a Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/. It is suggested when crediting this work, you include the phrase “Developed by Liz Grumback and Jennifer Hecker on behalf of the university of Texas, Texas A&M, and the TXDHC.” Thanks to University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, and the Texas Digital Humanities Consortium for facilitating this presentation.

Hinweis der Redaktion

  1. Howdy there everybody! Thanks for joining this inaugural webinar from the Texas Digital Humanities Consortium. We are testing out this format for ongoing consortial training use.
  2. This session is being recorded and, you may follow along with these slides, or access the recording, slides and supplementary materials on the Texas Digital Humanities Consortium website. Also on the website is a link to a three-question survey and we would very much appreciate any feedback you are willing to provide. During the webinar Liz and I are going to trade off presenting and chat-window-monitoring duties. Please be patient and cross your fingers for us!
  3. I’m going to introduce today’s presenters very quickly. I’m Jennifer Hecker and I work at the University of Texas Libraries. I specialize in brining my years of experience as an archivist to bear on our digital access challenges. I also work in the digital humanities space, coordinating collaborations and projects with students, faculty and staff all over UT. I also direct the Austin Fanzine Project and do a lot of outreach and mentoring work. Liz Grumbach works for the Initiative for Digital Humanities, Media, and Culture at Texas A&M as a Research Associate, where she supports faculty, staff, and student Digital Humanities projects and endeavors. She's also the Project Manager for the Advanced Research Consortium and 18thConnect.org, where she organizes peer review, supports the creation of digital editions, and maintains the digital records for all ARC research nodes. She's involved in the management of the Early Modern OCR Project (eMOP), which aims to teach machines how to read early modern fonts and make open source software packages available to other institutions seeking to auto generate transcriptions of large page image data sets. 
  4. An open-source tool for working with messy data. Runs in a browser, but locally – your data don’t leave your machine. Active development community – people creating extensions – and discussion list.
  5. This is some of the basic stuff you might use Refine for. In a little bit, Liz is going to walk you through these functions. Refine does a lot more, too, but today we’re just going to get your feet wet. I’ll come back after the demo and talk a little bit about some of the more advanced possibilities that you can explore…
  6. Refine lets you
  7. Here’s a slide from a webinar I attended a couple of weeks ago. It’s an example of OpenRefine in action – here being used to normalize data as one step in the workflow of a larger metadata aggregation project. So what does it look like?
  8. Refine let’s you split out data that is in one cell into multiple cells – and vice versa.
  9. Here are some simple examples of what we mean when we talk about “normalizing metadata”. Refine lets you easily batch edit data so that it uniformly adheres to your standards.
  10. Here’s what text faceting looks like. It’s useful for getting an overview of your data. Here’s you can quickly see some inconsistencies you might want to address.
  11. Refine also lets you do something called clustering. – change slide – This is my personal favorite part!
  12. Here’s a little bigger view… Liz will go into more detail during the demo, but basically, Refine groups data according to a number of factors that you can adjust that it thinks is similar so that you can review, modify and batch edit. Faceting and clustering are by far the two functions I tend to use most in Refine.
  13. A little background: In conversation, you’ll probably hear all three of these names for this tool. Nobody calls it Freebase Gridworks any more, but the other three are all common. Google originally developed Refine, but then abandoned the project & it became open source, hence the name OpenRefine. Lots of folks – myself included – take the lazy approach and just call it Refine.
  14. There are a number of tools out there that can help you manipulate data sets in a variety of ways. How do you know which is right for you? First, ask yourself some questions about your data.
  15. Here’s a matrix that can help guide your tool selection. It’s not comprehensive, there are more tools out there for sure (and all these tools do more than the brief description above would imply – for example Google Fusion Tables can be used to geocode location information and automatically generate maps, stuff like that), but these are the most common tools and this gives you an idea of what to expect from each of them… Ok, now I’m going to attempt to hand over the presentation to Liz, a couple hundred miles to my East.
  16. Ok, so now that you’re all excited about what you can do with Refine, I’m going to quickly run thorough some of the more advanced functions. By using regular expressions, which I’ve seen described as “wildcards on steroids”, you can more finely filter an manipulate your data.
  17. Using those same regular expressions, Refine helps you use GREL, the Open Refine Expression Language, to perform transformations on your data.
  18. Using various community-developed extensions which you can easily select and install, you can retrieve data from online sources such as VIAF, and you can match data to external sources such as Dbpedia.
  19. Thanks for tuning in y’all! We hope this was helpful and we welcome any questions or feedback y’all might have!