The document discusses trailblazing in research data management. It defines key terms like data, data management, and big data. It outlines why various stakeholders like funding agencies, universities, researchers, and libraries are venturing into research data management. It reviews assessments of data management needs conducted at various universities, examples of existing research data management programs, and available tools and resources. Finally, it discusses how institutions can blaze their own trail in research data management by identifying needs, partners, priorities, and potential services and policies to develop.
1. Trailblazing in the Wilderness of
Data Management
Where are we going and how do we get
there from here.
Stephanie Wright
Data Services Coordinator
University of Washington Libraries
2. Click to edit Master title style
AGENDA
• Definitions
• Why venture out
• Paths already taken
–Assessments of needs
–Existing programs
–Tools & resources
• Blazing your own trail
Montana State University – 21 June 2013
3. Definitions
• Data
• Data Management
• Big Data
• Long Tail of Data
• Acronyms
www.lib.washington.edu
4. Definitions
www.lib.washington.edu
DATA
By data, we do not mean a synonym for information. We
mean research data, that which is collected, observed,
or created, for purposes of analyzing to produce
original research results.
Research data may be created in tabular, textual,
statistical, numeric, geospatial, image, multimedia
or other formats.
(Adapted from DISC-UK DataShare Project, p. 16)
5. Definitions
www.lib.washington.edu
DATA
Data can be produced from a variety of processes
(e.g., observation, experimentation, simulation,
derivation, compilation), represented in numerous
forms and stored in many digital formats (e.g.,
ASCII, PDF, SPSS, Excel, TIFF, Java, FITS, CIF, ZVI)
The scope of this definition includes data from
disciplines in the sciences, social sciences, and
humanities.
(Adapted from MIT Libraries, “What is Data?”, 2009)
6. Definitions
www.lib.washington.edu
DATA MANAGEMENT
Pertains to the collection, cleaning, storage, sharing,
access, disposal, preservation and/or archiving of
research data.
(Adapted from University of North Carolina, Research Data Stewardship
Report, 2012)
13. Researchers
www.lib.washington.edu
• Verifiability & reproducibility
• Increased citation rates for publications
– (Piwowar et al, 2007)
• Preservation of individual scholarly record
• Save time by planning early
14. Libraries
www.lib.washington.edu
•Digital Preservation Network (DPN)
“The Digital Preservation Network is being
created by research-intensive universities to
ensure long-term preservation of the complete
digital scholarly record.”
http://d-p-n.org/
15. Libraries
www.lib.washington.edu
NSF Proposal & Award Policies &
Procedures Guide (Oct 2012)
“Instructions for preparation of the
Biographical Sketch have been revised to
rename the "Publications" section to
"Products" ....
(P)roducts may include, but are not limited
to, publications, data sets, software,
patents, and copyrights.”
16. Paths Already Taken
• Assessments
• Existing programs
• Tools & Resources
www.lib.washington.edu
Image credit: John W. Ridge
(http://commons.wikimedia.org/wiki/File:Yellowstone_Trail_Map.jpg)
17. Assessments
www.lib.washington.edu
• UNC (2012) “Research Data Stewardship
Report”
• University of Colorado Boulder (2012)
“Research Data Management @ UCB”
• Purdue “Data Curation Profiles Directory”
(http://docs.lib.purdue.edu/dcp/)
• More: Georgia Tech, Cornell, Houston,
Oregon….
18. Findings
www.lib.washington.edu
• Researchers use a wide variety of data
types – across disciplines
• Most researchers rely on themselves for
data management
• Researchers want to maintain control of
their data
• Many are unaware of existing services
• They want tools that work in existing
workflows
20. Existing Programs
www.lib.washington.edu
• Cornell
– Research Data Management Service Group
• Sr VP for Research and University Librarian
• Faculty Advisory Board
– 9 faculty across disciplines
– OSP & Office of Research Integrity & Assurance
• Management Council
– 2 librarians, 2 faculty, 2 IT, 1 research institute
21. Existing Programs
www.lib.washington.edu
• Purdue
– D2C2: Distributed Data Curation Center
• Executive Committee
– Dean of Libraries, VP of Research & VP of IT
• Library: consulting & metadata support
• IT: storage & research computing support
22. Existing Programs
www.lib.washington.edu
• University of Washington
– Data Services Program (1.5 FTE)
• Data Services Coordinator
• Data Services Communications & Curriculum Libn
– Data Services Team (10 members)
– Partnerships
• Research Centers (eSci, CSDE, IHME)
• Office of Research (OSP)
• Campus IT
• iSchool
24. Blazing Your Own Trail
www.lib.washington.edu
Image credit: Michigan State University Department of History,
HST 321: History of the American West
(http://history.msu.edu/hst321/files/2010/07/colter.jpg)
25. www.lib.washington.edu
• Identify needs
• Consider potential partners
• Scope
– Disciplines
– Specific areas of the data lifecycle
• Determine priorities
– New services? Enhance existing? Market
existing?
Where do you want to go?
26. www.lib.washington.edu
• Objective L1
– Assess and improve where needed, student
learning of critical knowledge & skills
• Objective D1
– Elevate the research excellence and
recognition of MSU faculty
• D1.2
• Objective D2
– Enhance infrastructure in support of research,
discovery and creative activities
MSU Strategic Plan
27. www.lib.washington.edu
• Support for active data storage
• Data security guidance
• Backup services
• Development of tools that can be
inserted into existing workflows
Campus IT
28. www.lib.washington.edu
• Guidance on legal / ethical
considerations
• Incorporate DM planning into
grant submission process
• New faculty data management
orientations
Office of Research
29. www.lib.washington.edu
• Market and provide access to
existing RDM resources
• Provide learning opportunities on
RDM best practices
• DMP consultation
• Storage (final)
• Metadata consultation
Libraries
32. Stephanie Wright
Data Services Coordinator
swright@uw.edu
@shefw
http://guides.lib.washington.edu/swright
Data Management Guide
http://guides.lib.washington.edu/dmg
ResearchWorks Data Services
http://researchworks.lib.washington.edu/rw-data.html
Hinweis der Redaktion
Here is where I admit that perhaps my use of the terms trailblazing and wilderness of data mgmt might have been colored by the fact that y’all are so close to Yellowstone which has been one of my favorite places to visit since I was a child. But I defend my use of those words and hope to convince you over the next hour or so that I wasn’t really venturing too far into the realm of hyperbole when I came up with that title.
Here is my map for this little journey. And here I want to take a moment to let you know that we have arranged for Q&A time at the end of my presentation portion but I also want you all to feel comfortable stopping me at any time and asking questions as I go along. Data management is a multi-faceted topic and I don’t want you to feel like you have to remember your ?’s til I’m done yakking then say “Remember that slide you had up 20 minutes ago?” I also recognize that people are at varying levels of understanding of the issues surrounding data management. In reality, everyone is new to this. I understand not everyone reads data mgmt needs assessments for fun. Please don’t be afraid to ask me to clarify anything.
I don’t want to get bogged down in terminology & definitions, but I do want to make sure that I’m not speaking a different dialect or even a different language up here so I’ve outlined a few terms where I thought it might be useful to have some clarification.
First, there’s “data”. You would not believe how many definitions you are for such a tiny word. This one used to be my favorite definition and was the one we used in our research data management needs survey we conducted last Fall. It’s adapted from the DISC-UK DataShare Report and I like it because 1) it’s short and 2) it doesn’t overtly align itself to a particular discipline or data format. It can be textual, images, videos, computer models… it’s all data. And when we’re talking data services, at least at UW, we’re mostly looking at supporting digital data services.
Even with this definition some folks (usu Hum) don’t see what they do as “data”. So I’ve added another piece to my favorite definition.
This is adapted from an MIT Libraries definition and I like it because it adds the variety of processes that can be used in the collection of data, as well as specifically stating that it is discipline agnostic. I don’t know if I would have gotten more responses from our Humanities colleagues on our survey if I had added this to the definition but when we get around to doing our focus groups with those researchers, I will ask them.
Now there are many processes that data goes through. I already mentioned collection, but just as there is a lifecycle associated with research, there is also a lifecycle associated with data. There are a multitude of data lifecycle models out there. In essence, data management pertains to the various processes involved in managing data through the entire data lifecycle – from planning and collection, all the way through to preservation and archiving.
This is not my favorite term but one hears it so much these days, I feel I need to talk about it.
Many people refer to big data as data that are high volume, high velocity, and/or high variety information assets that require new forms of processing for decision making and insight.
Large amounts of data (gigabytes, petabytes, yottabytes)
Highly complex sets of data / flat schemas, few complex interrelationships
Loosely structured data… or highly structured
Technology that handles large and complex data sets
Process for analyzing large and complex data sets
Data sets that can generate insights previously impossible
Availability of massive amounts of data
In short, “big data” can mean any # of things, which is why I don’t use the term. So moving on.
This is the term that probably requires the most explanation and you will probably most frequently hear it used in conjunction with the previous term because this is usually what “big data” is not and this graph actually explains it pretty well.
The vertical axis (the up and down line) is Frequency of Use. The horizontal axis (side to side) is the total inventory of data – everything, all collected data. The green part represents datasets that are popular, widely used and well managed (think of all the climate data collected and maintained over a hundred years by the National Climate Data Center). The yellow part represents datasets that are less frequently used and are managed in some informal manner (maybe on departmental shared network folders). The red part – that is the long tail. It’s data that is rarely used and not managed in any kind of organized fashion. And it’s estimated that it’s 80-85% of all data collected.
It’s that red part where many organizations tend to focus with data services because that’s where the needs are greatest. It’s the data that a researcher collected 10 yrs, 5yrs, 1 yr ago that’s sitting on a floppy, a CD or a thumb drive in, or worse, under a researcher’s desk. You may notice that size of the dataset is not represented on this graph. Size is not a factor in determining if a data set falls into the long tail.
I try to avoid acronyms but after the first ten times, even I get tired of saying research data management over and over.
Don’t think I need to define RDM any further since I already specified research data in my definition of data and defined data management.
IR – Central location for storing an institution’s digital assets and intellectual outputs (e.g., MSU’s ScholarWorks)
DR - Repository specifically designed for storage and access to data sets. Can be part of an IR.
DMP - a document outlining how a researcher plans to manage data during and after a research project including how it will be organized, maintained and shared.
Alright, definitions done. On with the meat of it. So why is data management such a hot topic. Why do we even need to do anything differently than we’ve been doing in the past? I’m going to break things down by the different players involved.
“As long as empirical research has existed, researchers have been doing “data management”
in one form or another. However, funding agency mandates for doing formal data management are
relatively recent.
1998 – NSF instituted DMP requirement
2003 – NIH implemented data sharing policy
2011 – NSF more strongly enforced DMP requirement
2013 – NSF changed merit review criteria for grant proposals to allow inclusion of datasets (Jan?); OSTP mandate for public access to federally funded research (Feb); OMB mandate for government Open Data (May); NIH enforcing public-access policy
http://grants.nih.gov/grants/policy/data_sharing/
http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/gpg_sigchanges.jsp
http://scholarlykitchen.sspnet.org/2013/02/25/expanding-public-access-to-the-results-of-federally-funded-research-first-impressions-on-the-us-governments-policy/
http://www.federalnewsradio.com/513/3316130/White-House-mandates-open-data-releases-new-tools
http://www.nature.com/nm/journal/v19/n1/full/nm0113-3.html
By providing support for data management, universities increase the competitiveness of their researchers for obtaining grants
Maximize potential of researchers as they can reuse data already collected by others.
Don’t think I need to say anything extra about that third point.
Encourages innovation & discovery by allowing researchers to think of research questions in new ways using existing data
As it gets harder and harder to obtain dollars for research, researchers are under increasing scrutiny to be able to verify their research. By following data mgmt best practices, you can produce your data and the associated documentation if needed to verify and reproduce your research.
Heather Piwowar and friends published research in 2007 showing that publications with publicly available associated data had a 69% increase in citations.
Let me tell you a story about a Nursing faculty member I interviewed as part of our RDM survey and follow-up interviews project this year. The week I went to interview her, she had just been told that the IT folks could not recover the over 30 years of research she had been saving on the departmental server when the hard drive failed. And when I say 30 years of research, I mean all her papers, her data, her codebooks, her scripts. Everything. Gone. She did not have this research anywhere else because she was under the assumption that it was being backed up.
On that last point, planning for data mgmt tasks at the beginning of the research process is a lot less time consuming then doing the data forensics after the research project is over, over 5, 10, 15 years down the road.
Alright, so why am I including libraries in this.
Earlier this year the Digital Preservation Network (or DPN) was formed and the UW became a member and while not focused solely on data, its mission is to “ensure long-term preservation of the complete digital scholarly record”. Not just ejournal articles and ebooks, the COMPLETE digital scholarly record and data certainly is part of that. And if you’re wondering what this has to do with libraries, look at that last part of the mission statement again. Take out “digital” and isn’t that why academic libraries were born?
Data is recognized as a valuable scholarly output. The NSF made that even more explicit in October of last year when it made this change to it’s Proposal Award & Policies Guide.
If libraries don’t step up to the plate and provide data management support, everybody is going to try to figure out a way to do it themselves and that meets their individual needs. To put it in perspective, imagine if every department on campus was maintaining the books and journals in their own subject areas. This is what Libraries are supposed to do… it’s why we’re here.
And now it’s time to take those skills librarians have always used to do those same things we’ve always done for traditional scholarly outputs and adapt them to meet scholarly data needs. Skills like how to organize information, metadata creation, providing access to information. Reference skills in particular are key: ability to liaise, to communicate across disciplines, to refer, consult, to teach.
Off my librarian soapbox… for now.
There has been a lot of work done in this area over the last several years. In order to get to the next section, about blazing your own trail, I thought it might be helpful to look at what’s already been done. There have been several data management needs assessments, there are some existing programs to look at, and a lot of useful tools have been developed to help support data mgmt needs.
In my former life, I was an assessment librarian so it does my heart good to see so many folks out there that have been doing needs assessment for data management. I’ve listed a few of my favorites here. I’m extremely impressed with what Purdue has done with their Data Curation Profiles and they have now created a directory of profiles from not only Purdue, but profiles submitted by other institutions, as well. I’ve mentioned that we did a survey & interviews recently, though we haven’t yet published our results. I did get to present our preliminary findings at a conference recently with Georgia Tech, Cornell & Purdue and though there were differences in our methodologies, populations and findings, there are some needs that keep coming up across multiple assessments.
Wide variety of data types, wide variety of file sizes
Wide variety of data types, wide variety of file sizes
It is centered in the Research Department of the Purdue University Libraries. D2C2 is comprised of four core researchers who work closely with subject specialist liaisons in discipline areas throughout the Libraries
3 FTE who work with subject librarians
An open source tool helping researchers document, manage, and archive their tabular data, DataUp operates within the scientist's workflow and integrates with Microsoft® Excel.
tool for helping people identify and locate online repositories of research data
rate the current state of the researcher’s data management practices. the system compares the information collected during the data interview process with these data management best practice statements. a framework for comparing and improving departmental data management practices
Alright, so we’ve talked about why data management is important, what’s been done in the area so far, let’s walk forward on how to provide support here at MSU.
Let’s start with your strategic plan because you already have objectives listed there where parts of a data services program would fit in nicely
Develop a separate RDM strategic plan
I won’t go into the whole strategic planning process… there are several ways to go about it. UW Libraries uses the Balanced Scorecard system for its strategic plan. The Data Services Team and I have been working on a logic model to help us develop our programmatic strategic plan.
Here are a few things to consider.
You already have some starting points. Look at the MSU strategic plan.
Data management isn’t just important for current researchers, but also for future researchers, as well. At UW, we are developing data management learning opportunities for librarians, faculty & students. Consider the integration of data literacy into grad research methods courses.
D1.2 specifically mentions measuring achievement in this objective thru peer-reviewed publications and journal citations. I would suggest including in here other alternative metrics such as data set downloads and citations. Reuse of existing datasets for new research.
D2 Sounds like you are already on your way with your recent release of the IR ScholarWorks. If so desired, you can also use your IR to support data management by allowing for the deposit of data sets in your IR.
Now I’m spending the rest of my day after this presentation talking with different groups on campus so I can’t even begin to make any specific recommendations, but here are a few ideas. Some possible roles for campus IT.
When I say active data storage, I’m talking about storage during the phase of research where data is being actively collected, accessed, manipulated and shared among collaborators. As opposed to the final version of a data set that is preserved for future reuse.
Here are a few ideas for Office of Research.
At UW we’re working with our Office of Sponsored Programs on that last bullet point. In a recent meeting, we talked about looking into the feasibility of coordination between my shop and OSP when a researcher is submitting a grant proposal to a funding agency that requires a DMP.
I’ve already mentioned how librarians have certain skillsets that are conducive to data management support. Here is just a smattering of possible services they can provide. At the UW we provide all of these, though not at as high a level as I would like, but we’re working on that.
And that’s something to keep in mind, as well. You don’t have to come out of the gate with everything polished. We sure didn’t. When the NSF announced it was enforcing the DMP mandate, I threw up a quick and dirty LibGuide on DMPs. The next year I rec’d a Friends of the Libraries grant to develop a more robust data management guide. It’s better, but it’s still not the site I want it to be.
Consider what can be done at a broad university level, as well, not just by individual groups on campus. Here are a few suggestions on that front.
In short, research data management services works well with the saying “It takes a village.” There are lots of parts to be played and there are some units more suited for fulfilling certain roles than others. There are many things that can be done to support data management. Some are low hanging fruit, some you might need a stepladder.
The key is to do something. Because doing nothing really isn’t an option.