The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
Di d dlf_handout
1. Chuck Henry and Christa Williford, DLF Forum, November 2011
Lessons from the Digging Into Data Challenge
What Information Professionals Should Know about Computationally Intensive Research in the Humanities and
Social Sciences
For the past two years, the Council on Library and Information Resources (CLIR) has partnered with the National
Endowment for Humanities Office of Digital Humanities (NEH‐ODH) in an intensive assessment of the inaugural
year of the Digging Into Data grant program. Launched in 2009, this unprecedented international initiative involved
four funding agencies in three countries and supported eight international collaborative research projects in the
social sciences and humanities, all of which bring innovative applications of computer technology to bear on the
collection, mining, and interpretation of large data corpora. Here is a sampling of what CLIR has learned:
Lesson 1: Computationally intensive research requires open sharing of resources among participants. Essential
resources include hardware, software, data corpora, and communication tools. Information professionals can
facilitate open sharing by helping researchers forge partnership agreements based upon trust and transparency.
Example: To support the project “Digging Into Data to Answer Authorship Related Questions,” participants drafted
a Memorandum of Understanding that made clear how shared resources would be funded as well as established a
plan for project communication and credit sharing. See: Michael Simeone, Jennifer Guiliano, Rob Kooper and Peter
Bajcsy, "Digging into Data Using New Collaborative Infrastructures Supporting Humanities‐based Computer Science
Research." First Monday 16.5 (2 May 2011):
http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/viewArticle/3372/2950
Lesson 2: Computationally intensive research projects rely upon diverse kinds of expertise: domain (or subject)
expertise, analytical expertise, data management expertise, and project management expertise. Information
professionals can offer and/or develop skills and knowledge in each of these areas, enabling them to participate
actively as research partners.
Example: For their project, “Digging Into the Enlightenment: Mapping the Republic of Letters,” Stanford University
provided resources and project management support to their international partners through “embedded”
information professional Nicole Coleman, who is based at the Stanford Humanities Center. As Academic
Technology Specialist, Nicole’s focus is on finding new research opportunities and supporting the production of
new knowledge, and she has developed expertise in the kinds of infrastructure and management practices that
contribute to successful research collaborations. For more information about this project, see:
http://enlightenment.humanitiesnetwork.org/
Lesson 3: When it comes to analytical tools, one size does not fit all. As their questions evolve throughout their
projects, researchers want the flexibility to alternate between looking closely at select data and performing
“distant” readings of entire corpora. Information professionals can educate researchers to help them refine their
questions, select appropriate tools, and use their tools effectively.
Example: While both close and distant readings of evidence characterized most of the Digging Into Data project
methodologies, Richard Healey, co‐principal investigator of “Railroads and the Making of Modern America,” has an
interesting take on why humanities and social science data requires the continual adaptation and evolution of
analytical tools. He hypothesizes many “different levels of data‐related operations,” and these levels determine
the research outcomes that are possible at each level. He writes:
The levels relate to the degree of scholarly input involved and I see them…as a data ‘hierarchy’:
• Level 0 ‐ Data so riddled with error it should come with a serious intellectual health
warning! (We have much more of this than most people seem willing to admit and much
of the Google data from scanned railroad reports admirably fits into this category).
• Level 1 ‐ Raw datasets…corrected for obvious errors.
2. Chuck Henry and Christa Williford, DLF Forum, November 2011
• Level 2 ‐ Value‐added datasets: those that have been standardised/coded etc. in a
consistent fashion according to some recognised scheme or procedure, which may
require significant domain expertise [to produce]…)
• Level 3 ‐ Integrated data resources: These will contain value‐added datasets
but…explicit linkages have been made between multiple related datasets (or they have
been coded/tagged in such a way that the linkages can be made by software. Hence,
these are not just 'data' because so much additional research time has been invested in
them, which is why I prefer the word ‘resource’…. Many GIS resources are of this kind,
because they require linkage of spatial and non‐spatial data.
• Level 4 ‐ 'Digging Enabler' or 'Digging Key' data/classificatory resources: These require
extensive domain expertise, and use of/analysis of multiple sources/relevant literature
to create. They facilitate extensive additional types of digging activity to be undertaken
on substantive projects beyond those of the investigators who created them, i.e they
become 'authority files' for the wider research community. Gazetteers, structured
occupational coding systems, data cross‐ classifiers etc. fit into this category.
Lesson 4: Big data isn’t just for scientists anymore. Not only do humanists and social scientists work with big data,
their research can also produce large data corpora. Some scholars engaged in computationally intensive research
see the new data they create as their most significant research outcomes. Researchers risk losing their valuable
data unless they take steps to protect and sustain them. As practices for publishing research data evolve,
information professionals can curate this data, working with scholars to appraise, normalize, validate, provide
access to and, ultimately, preserve research data for the long term.
Example: In the final white paper for “Mining a Year of Speech,” John Coleman draws a compelling comparison
between the sizes of data sets with which current major science and humanities projects are engaged (see below).
This paper is available at:
http://www.phon.ox.ac.uk/files/pdfs/MiningaYearofSpeechWhitePaper.pdf