Presented by Jennifer Hecker and Elizabeth Grumbach and hosted by the Texas Consortium on Digital Humanities, these are the slides for the TXDHC training webcast on OpenRefine, February 12th, 2015.
5. Cleaning up data that is:
in a simple tabular format
is inconsistently formatted
has inconsistent terminology
6. get an overview of a data set
resolve inconsistencies
split data up into more granular parts
match local data up to other data sets
enhance a data set with data from
other sources
14. …ask some questions about your data set:
What type of data is it & what format is it in?
What’s the size of your data set?
What question do you want to ask your data?
What do you need to do to find the answer?
15. Excel
familiarity, better for data entry, cut and paste
operation, no paging to navigate
Google Spreadsheets
similar to Excel, can get external data
relatively easily, easy to collaborate and share
Google Fusion Tables if you just want to filter, easy to share
Text editor powerful text editor can do many things
Unix tools
more challenging to use, but quick and some
things (finding things, sorting) are easy
Writing code most sophisticated and most to learn!
19. Retrieve data from online sources
example: use names to retrieve birth/death dates
from Virtual International Authority File (VIAF)
Match data to external data sources using
Extensions for RDF, DBpedia, Named-Entity
Recognition (NER), etc…
And ‘reconciliation’ services
20. Use ‘cross’ function to compare
contents of two Refine projects, or
share data between the two projects.
21. TxDHC blog post on this webinar http://www.txdhc.org/txdhc-training-
webcast-materials/
The OpenRefine Wiki https://github.com/OpenRefine/OpenRefine/wiki
OpenRefine User Documentation
https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users
The ‘Free your metadata’ site http://freeyourmetadata.org...
…and book http://book.freeyourmetadata.org
The OpenRefine mailing list and forum
http://groups.google.com/d/forum/openrefine
23. credits * acknowledgements * citations
These slides were developed by Jennifer Hecker (j.hecker@Austin.utexas.edu) and Liz Grumbach (egrumbac@tamu.edu )
on behalf of University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, Media and Culture, and the Texas
Digital Humanities Consortium using many resources including the wonderful course material developed by Owen
Stephens on behalf of the British Library (http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-
using-openrefine/).
Unless otherwise stated, all images, audio or video content are separate works with their own license, and should not be
assumed to be CC-BY in their own right. This work is licensed under a Creative Commons Attribution 4.0 International
License http://creativecommons.org/licenses/by/4.0/. It is suggested when crediting this work, you include the phrase
“Developed by Liz Grumback and Jennifer Hecker on behalf of the university of Texas, Texas A&M, and the TXDHC.”
Thanks to University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, and the Texas Digital Humanities
Consortium for facilitating this presentation.
Hinweis der Redaktion
Howdy there everybody! Thanks for joining this inaugural webinar from the Texas Digital Humanities Consortium. We are testing out this format for ongoing consortial training use.
This session is being recorded and, you may follow along with these slides, or access the recording, slides and supplementary materials on the Texas Digital Humanities Consortium website. Also on the website is a link to a three-question survey and we would very much appreciate any feedback you are willing to provide. During the webinar Liz and I are going to trade off presenting and chat-window-monitoring duties. Please be patient and cross your fingers for us!
I’m going to introduce today’s presenters very quickly. I’m Jennifer Hecker and I work at the University of Texas Libraries. I specialize in brining my years of experience as an archivist to bear on our digital access challenges. I also work in the digital humanities space, coordinating collaborations and projects with students, faculty and staff all over UT. I also direct the Austin Fanzine Project and do a lot of outreach and mentoring work.
Liz Grumbach works for the Initiative for Digital Humanities, Media, and Culture at Texas A&M as a Research Associate, where she supports faculty, staff, and student Digital Humanities projects and endeavors. She's also the Project Manager for the Advanced Research Consortium and 18thConnect.org, where she organizes peer review, supports the creation of digital editions, and maintains the digital records for all ARC research nodes. She's involved in the management of the Early Modern OCR Project (eMOP), which aims to teach machines how to read early modern fonts and make open source software packages available to other institutions seeking to auto generate transcriptions of large page image data sets.
An open-source tool for working with messy data.
Runs in a browser, but locally – your data don’t leave your machine.
Active development community – people creating extensions – and discussion list.
This is some of the basic stuff you might use Refine for. In a little bit, Liz is going to walk you through these functions. Refine does a lot more, too, but today we’re just going to get your feet wet. I’ll come back after the demo and talk a little bit about some of the more advanced possibilities that you can explore…
Refine lets you
Here’s a slide from a webinar I attended a couple of weeks ago. It’s an example of OpenRefine in action – here being used to normalize data as one step in the workflow of a larger metadata aggregation project. So what does it look like?
Refine let’s you split out data that is in one cell into multiple cells – and vice versa.
Here are some simple examples of what we mean when we talk about “normalizing metadata”. Refine lets you easily batch edit data so that it uniformly adheres to your standards.
Here’s what text faceting looks like. It’s useful for getting an overview of your data. Here’s you can quickly see some inconsistencies you might want to address.
Refine also lets you do something called clustering.
– change slide –
This is my personal favorite part!
Here’s a little bigger view…
Liz will go into more detail during the demo, but basically, Refine groups data according to a number of factors that you can adjust that it thinks is similar so that you can review, modify and batch edit. Faceting and clustering are by far the two functions I tend to use most in Refine.
A little background: In conversation, you’ll probably hear all three of these names for this tool. Nobody calls it Freebase Gridworks any more, but the other three are all common. Google originally developed Refine, but then abandoned the project & it became open source, hence the name OpenRefine. Lots of folks – myself included – take the lazy approach and just call it Refine.
There are a number of tools out there that can help you manipulate data sets in a variety of ways. How do you know which is right for you? First, ask yourself some questions about your data.
Here’s a matrix that can help guide your tool selection. It’s not comprehensive, there are more tools out there for sure (and all these tools do more than the brief description above would imply – for example Google Fusion Tables can be used to geocode location information and automatically generate maps, stuff like that), but these are the most common tools and this gives you an idea of what to expect from each of them…
Ok, now I’m going to attempt to hand over the presentation to Liz, a couple hundred miles to my East.
Ok, so now that you’re all excited about what you can do with Refine, I’m going to quickly run thorough some of the more advanced functions. By using regular expressions, which I’ve seen described as “wildcards on steroids”, you can more finely filter an manipulate your data.
Using those same regular expressions, Refine helps you use GREL, the Open Refine Expression Language, to perform transformations on your data.
Using various community-developed extensions which you can easily select and install, you can retrieve data from online sources such as VIAF, and you can match data to external sources such as Dbpedia.
Thanks for tuning in y’all! We hope this was helpful and we welcome any questions or feedback y’all might have!