TXDHC OpenRefine Training

•Als PPTX, PDF herunterladen•

3 gefällt mir•1,475 views

Presented by Jennifer Hecker and Elizabeth Grumbach and hosted by the Texas Consortium on Digital Humanities, these are the slides for the TXDHC training webcast on OpenRefine, February 12th, 2015.

Bildung

Intro to Open Refine
An overview & walkthrough to get you started.

 intro/overview (15 min)
 walkthrough (45 min)
 intro to advanced (10 min)
 q&a (20 min)
http://www.txdhc.org/txdhc-training-webcast-materials/

Cleaning up data that is:
 in a simple tabular format
 is inconsistently formatted
 has inconsistent terminology

 get an overview of a data set
 resolve inconsistencies
 split data up into more granular parts
 match local data up to other data sets
 enhance a data set with data from
other sources

https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/text-facet-openrefine.png

https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/clustering-openrefine.png

Freebase Gridworks
=
GoogleRefine
=
OpenRefine
=
Refine

…ask some questions about your data set:
 What type of data is it & what format is it in?
 What’s the size of your data set?
 What question do you want to ask your data?
 What do you need to do to find the answer?

Excel
familiarity, better for data entry, cut and paste
operation, no paging to navigate
Google Spreadsheets
similar to Excel, can get external data
relatively easily, easy to collaborate and share
Google Fusion Tables if you just want to filter, easy to share
Text editor powerful text editor can do many things
Unix tools
more challenging to use, but quick and some
things (finding things, sorting) are easy
Writing code most sophisticated and most to learn!

<And now Liz attempts the
dangerous LIVE DEMO!>

Regular expressions
 “wildcards on steroids” that allow for
more granular data manipulation
(http://www.regular-expressions.info)

Transformations using Open Refine
Expression Language (GREL)
 kind of like a formula in Excel

Retrieve data from online sources
 example: use names to retrieve birth/death dates
from Virtual International Authority File (VIAF)
Match data to external data sources using
 Extensions for RDF, DBpedia, Named-Entity
Recognition (NER), etc…
 And ‘reconciliation’ services

Use ‘cross’ function to compare
contents of two Refine projects, or
share data between the two projects.

 TxDHC blog post on this webinar http://www.txdhc.org/txdhc-training-
webcast-materials/
 The OpenRefine Wiki https://github.com/OpenRefine/OpenRefine/wiki
 OpenRefine User Documentation
https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users
 The ‘Free your metadata’ site http://freeyourmetadata.org...
 …and book http://book.freeyourmetadata.org
 The OpenRefine mailing list and forum
http://groups.google.com/d/forum/openrefine

http://bit.ly/1uGPd0f
Please email us if you have any questions:
Jennifer = jenniferraehecker@gmail.com
Liz = egrumbac@tamu.edu

credits * acknowledgements * citations
These slides were developed by Jennifer Hecker (j.hecker@Austin.utexas.edu) and Liz Grumbach (egrumbac@tamu.edu )
on behalf of University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, Media and Culture, and the Texas
Digital Humanities Consortium using many resources including the wonderful course material developed by Owen
Stephens on behalf of the British Library (http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-
using-openrefine/).
Unless otherwise stated, all images, audio or video content are separate works with their own license, and should not be
assumed to be CC-BY in their own right. This work is licensed under a Creative Commons Attribution 4.0 International
License http://creativecommons.org/licenses/by/4.0/. It is suggested when crediting this work, you include the phrase
“Developed by Liz Grumback and Jennifer Hecker on behalf of the university of Texas, Texas A&M, and the TXDHC.”
Thanks to University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, and the Texas Digital Humanities
Consortium for facilitating this presentation.

Weitere ähnliche Inhalte

Was ist angesagt?

Evolutionary & Swarm Computing for the Semantic WebAnkit Solanki

ISWC 2014 - Dandelion: from raw data to dataGEMs for developersSpazioDati

Informal presentation about RESChristophe Guéret

ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...Michael Cummings

Consuming Linked Data 4/5 Semtech2011Juan Sequeda

Using entity extraction extension with OpenRefine and Dandelion APISpazioDati

Introduction to Elastic with a hint of Symfony and DockerDaniel Platt

Semantic Pipes and Semantic Mashupsgiurca

Linked data-tooling-xmlFelix Sasaki

Emerging technologies in academic librariesMichael Cummings

Linked data tooling XMLFREMEProjectH2020

The Digital Cavemen of Linked LascauxRuben Verborgh

Cenitpede: Analyzing WebcrawlPrimal Pappachan

Initial Usage Analysis of DBpedia's Triple Pattern FragmentsRuben Verborgh

Apache Stanbol  and the Web of Data - ApacheCon 2011Nuxeo

What is Web-scraping?Yu-Chang Ho

Web Scraping BasicsKyle Banerjee

Linked Open Data Utrecht University LibraryRuben Schalk

Reinhard LAWDI Presentationcharinos

Ruby on Rails and the Semantic WebNathalie Steinmetz

Was ist angesagt? (20)

Evolutionary & Swarm Computing for the Semantic Web

ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Informal presentation about RES

ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...

Consuming Linked Data 4/5 Semtech2011

Using entity extraction extension with OpenRefine and Dandelion API

Introduction to Elastic with a hint of Symfony and Docker

Semantic Pipes and Semantic Mashups

Linked data-tooling-xml

Emerging technologies in academic libraries

Linked data tooling XML

The Digital Cavemen of Linked Lascaux

Cenitpede: Analyzing Webcrawl

Initial Usage Analysis of DBpedia's Triple Pattern Fragments

Apache Stanbol  and the Web of Data - ApacheCon 2011

What is Web-scraping?

Web Scraping Basics

Linked Open Data Utrecht University Library

Reinhard LAWDI Presentation

Ruby on Rails and the Semantic Web

Ähnlich wie TXDHC OpenRefine Training

Tapping into Scientific Data with Hadoop and FlinkMichael Häusler

Research software and Dataversephilipdurbin

Flexible Resources In 3 6 And E4szbra

RMLL 2013 : Build Your Personal Search Engine using CrawlzillaJazz Yao-Tsung Wang

(PROJEKTURA) open data big data @tgg osijekRatko Mutavdzic

Semantic Web, an introduction for bioscientistsEmanuele Della Valle

State of the Semantic WebIvan Herman

30° Nexa Lunch Seminar - Linked Data Platform vs real worldDiego Valerio Camarda

Build Secure Cloud-Hosted Apps for SharePoint 2013Danny Jessee

DataHubAditya Parameswaran

Semantic Result Formats: Automatically Transforming Structured Data into usef...Hans-Joerg Happel

Dave de Roure - The myExperiment approach towards Open Scienceshwu

My ExperimentFrancesco Izzo

How the Web can change social science research (including yours)Frank van Harmelen

Visualization of Information (ProQuest)Michael Adcock

Flink Apachecon PresentationGyula Fóra

Introduction to Semantic Web for GIS PractitionersEmanuele Della Valle

O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...NCCOMMS

Democratizing Data at AirbnbNeo4j

Data Integration And VisualizationIvan Ermilov

Ähnlich wie TXDHC OpenRefine Training (20)

Tapping into Scientific Data with Hadoop and Flink

Research software and Dataverse

Flexible Resources In 3 6 And E4

RMLL 2013 : Build Your Personal Search Engine using Crawlzilla

(PROJEKTURA) open data big data @tgg osijek

Semantic Web, an introduction for bioscientists

State of the Semantic Web

30° Nexa Lunch Seminar - Linked Data Platform vs real world

Build Secure Cloud-Hosted Apps for SharePoint 2013

DataHub

Semantic Result Formats: Automatically Transforming Structured Data into usef...

Dave de Roure - The myExperiment approach towards Open Science

My Experiment

How the Web can change social science research (including yours)

Visualization of Information (ProQuest)

Flink Apachecon Presentation

Introduction to Semantic Web for GIS Practitioners

O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...

Democratizing Data at Airbnb

Data Integration And Visualization

Kürzlich hochgeladen

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD

microwave assisted reaction. General introductionMaksud Ahmed

The Most Excellent Way | 1 Corinthians 13Steve Thomason

Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic

Advance Mobile Application Development class 07Dr. Mazin Mohamed alkathiri

1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics

Código Creativo y Arte de Software | Unidad 1Maestría en Comunicación Digital Interactiva - UNR

A Critique of the Proposed National Education Policy ReformChameera Dedduwage

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxRAM LAL ANAND COLLEGE, DELHI UNIVERSITY.

Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31

Interactive Powerpoint_How to Master effective communicationnomboosow

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur

Holdier Curriculum Vitae (April 2024).pdfagholdier

Paris 2024 Olympic Geographies - an activityGeoBlogs

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching

Advanced Views - Calendar View in Odoo 17Celine George

Kürzlich hochgeladen (20)

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...

microwave assisted reaction. General introduction

The Most Excellent Way | 1 Corinthians 13

Key note speaker Neum_Admir Softic_ENG.pdf

Advance Mobile Application Development class 07

1029-Danh muc Sach Giao Khoa khoi 6.pdf

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...

Código Creativo y Arte de Software | Unidad 1

A Critique of the Proposed National Education Policy Reform

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx

Unit-IV- Pharma. Marketing Channels.pptx

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...

Interactive Powerpoint_How to Master effective communication

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...

Holdier Curriculum Vitae (April 2024).pdf

Paris 2024 Olympic Geographies - an activity

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...

Advanced Views - Calendar View in Odoo 17

TXDHC OpenRefine Training

1. Intro to Open Refine An overview & walkthrough to get you started.

2.  intro/overview (15 min)  walkthrough (45 min)  intro to advanced (10 min)  q&a (20 min) http://www.txdhc.org/txdhc-training-webcast-materials/

3. Jennifer Hecker Liz Grumbach

4. “a tool for working with messy data”

5. Cleaning up data that is:  in a simple tabular format  is inconsistently formatted  has inconsistent terminology

6.  get an overview of a data set  resolve inconsistencies  split data up into more granular parts  match local data up to other data sets  enhance a data set with data from other sources

10. https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/text-facet-openrefine.png

11. https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/clustering-openrefine.png

12. https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/clustering-openrefine.png

13. Freebase Gridworks = GoogleRefine = OpenRefine = Refine

14. …ask some questions about your data set:  What type of data is it & what format is it in?  What’s the size of your data set?  What question do you want to ask your data?  What do you need to do to find the answer?

15. Excel familiarity, better for data entry, cut and paste operation, no paging to navigate Google Spreadsheets similar to Excel, can get external data relatively easily, easy to collaborate and share Google Fusion Tables if you just want to filter, easy to share Text editor powerful text editor can do many things Unix tools more challenging to use, but quick and some things (finding things, sorting) are easy Writing code most sophisticated and most to learn!

16. <And now Liz attempts the dangerous LIVE DEMO!>

17. Regular expressions  “wildcards on steroids” that allow for more granular data manipulation (http://www.regular-expressions.info)

18. Transformations using Open Refine Expression Language (GREL)  kind of like a formula in Excel

19. Retrieve data from online sources  example: use names to retrieve birth/death dates from Virtual International Authority File (VIAF) Match data to external data sources using  Extensions for RDF, DBpedia, Named-Entity Recognition (NER), etc…  And ‘reconciliation’ services

20. Use ‘cross’ function to compare contents of two Refine projects, or share data between the two projects.

21.  TxDHC blog post on this webinar http://www.txdhc.org/txdhc-training- webcast-materials/  The OpenRefine Wiki https://github.com/OpenRefine/OpenRefine/wiki  OpenRefine User Documentation https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users  The ‘Free your metadata’ site http://freeyourmetadata.org...  …and book http://book.freeyourmetadata.org  The OpenRefine mailing list and forum http://groups.google.com/d/forum/openrefine

22. http://bit.ly/1uGPd0f Please email us if you have any questions: Jennifer = jenniferraehecker@gmail.com Liz = egrumbac@tamu.edu

23. credits * acknowledgements * citations These slides were developed by Jennifer Hecker (j.hecker@Austin.utexas.edu) and Liz Grumbach (egrumbac@tamu.edu ) on behalf of University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, Media and Culture, and the Texas Digital Humanities Consortium using many resources including the wonderful course material developed by Owen Stephens on behalf of the British Library (http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data- using-openrefine/). Unless otherwise stated, all images, audio or video content are separate works with their own license, and should not be assumed to be CC-BY in their own right. This work is licensed under a Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/. It is suggested when crediting this work, you include the phrase “Developed by Liz Grumback and Jennifer Hecker on behalf of the university of Texas, Texas A&M, and the TXDHC.” Thanks to University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, and the Texas Digital Humanities Consortium for facilitating this presentation.

Hinweis der Redaktion

Howdy there everybody! Thanks for joining this inaugural webinar from the Texas Digital Humanities Consortium. We are testing out this format for ongoing consortial training use.
This session is being recorded and, you may follow along with these slides, or access the recording, slides and supplementary materials on the Texas Digital Humanities Consortium website. Also on the website is a link to a three-question survey and we would very much appreciate any feedback you are willing to provide. During the webinar Liz and I are going to trade off presenting and chat-window-monitoring duties. Please be patient and cross your fingers for us!
I’m going to introduce today’s presenters very quickly. I’m Jennifer Hecker and I work at the University of Texas Libraries. I specialize in brining my years of experience as an archivist to bear on our digital access challenges. I also work in the digital humanities space, coordinating collaborations and projects with students, faculty and staff all over UT. I also direct the Austin Fanzine Project and do a lot of outreach and mentoring work. Liz Grumbach works for the Initiative for Digital Humanities, Media, and Culture at Texas A&M as a Research Associate, where she supports faculty, staff, and student Digital Humanities projects and endeavors. She's also the Project Manager for the Advanced Research Consortium and 18thConnect.org, where she organizes peer review, supports the creation of digital editions, and maintains the digital records for all ARC research nodes. She's involved in the management of the Early Modern OCR Project (eMOP), which aims to teach machines how to read early modern fonts and make open source software packages available to other institutions seeking to auto generate transcriptions of large page image data sets.
An open-source tool for working with messy data. Runs in a browser, but locally – your data don’t leave your machine. Active development community – people creating extensions – and discussion list.
This is some of the basic stuff you might use Refine for. In a little bit, Liz is going to walk you through these functions. Refine does a lot more, too, but today we’re just going to get your feet wet. I’ll come back after the demo and talk a little bit about some of the more advanced possibilities that you can explore…
Refine lets you
Here’s a slide from a webinar I attended a couple of weeks ago. It’s an example of OpenRefine in action – here being used to normalize data as one step in the workflow of a larger metadata aggregation project. So what does it look like?
Refine let’s you split out data that is in one cell into multiple cells – and vice versa.
Here are some simple examples of what we mean when we talk about “normalizing metadata”. Refine lets you easily batch edit data so that it uniformly adheres to your standards.
Here’s what text faceting looks like. It’s useful for getting an overview of your data. Here’s you can quickly see some inconsistencies you might want to address.
Refine also lets you do something called clustering. – change slide – This is my personal favorite part!
Here’s a little bigger view… Liz will go into more detail during the demo, but basically, Refine groups data according to a number of factors that you can adjust that it thinks is similar so that you can review, modify and batch edit. Faceting and clustering are by far the two functions I tend to use most in Refine.
A little background: In conversation, you’ll probably hear all three of these names for this tool. Nobody calls it Freebase Gridworks any more, but the other three are all common. Google originally developed Refine, but then abandoned the project & it became open source, hence the name OpenRefine. Lots of folks – myself included – take the lazy approach and just call it Refine.
There are a number of tools out there that can help you manipulate data sets in a variety of ways. How do you know which is right for you? First, ask yourself some questions about your data.
Here’s a matrix that can help guide your tool selection. It’s not comprehensive, there are more tools out there for sure (and all these tools do more than the brief description above would imply – for example Google Fusion Tables can be used to geocode location information and automatically generate maps, stuff like that), but these are the most common tools and this gives you an idea of what to expect from each of them… Ok, now I’m going to attempt to hand over the presentation to Liz, a couple hundred miles to my East.
Ok, so now that you’re all excited about what you can do with Refine, I’m going to quickly run thorough some of the more advanced functions. By using regular expressions, which I’ve seen described as “wildcards on steroids”, you can more finely filter an manipulate your data.
Using those same regular expressions, Refine helps you use GREL, the Open Refine Expression Language, to perform transformations on your data.
Using various community-developed extensions which you can easily select and install, you can retrieve data from online sources such as VIAF, and you can match data to external sources such as Dbpedia.
Thanks for tuning in y’all! We hope this was helpful and we welcome any questions or feedback y’all might have!

TXDHC OpenRefine Training

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie TXDHC OpenRefine Training

Ähnlich wie TXDHC OpenRefine Training (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

TXDHC OpenRefine Training

Hinweis der Redaktion