Information technology and resources are an integral and indispensable part of the contemporary academic enterprise. In particular, technological advances have nurtured a new paradigm of data-intensive research. However, far too much of this activity still takes place in silos, to the detriment of open scholarly inquiry, integrity, and advancement. To counteract this tendency, the University of California Curation Center (UC3) has been developing and deploying a comprehensive suite of curation services that facilitate widespread data management, preservation, publication, sharing, and reuse. Through these services UC3 is engaging with new communities of use: in addition to its traditional stakeholders in cultural heritage memory organizations, e.g., libraries, museums, and archives, the UC3 service suite is now attracting significant adoption by research projects, laboratories, and individual faculty researchers. This webinar will present an introduction to five specific services – DMPTool, DataUp, EZID, Merritt, Web Archiving Service (WAS) – applicable to data curation throughout the scholarly lifecycle, two recent initiatives in collaboration with UC campuses, UC Berkeley Research Hub and UC San Francisco DataShare, and the ways in which they encourage and promote new communities of practice and greater transparency in scholarly research.
1. Building Communities and Services in
Support of Data-Intensive Research
Stephen Abrams
University of California Curation Center
California Digital Library
August 20, 2013
2. Topics
Data curation
UC3 services
DMPTool
DataUp
EZID
Merritt
WAS
Collaborative initiatives
DataShare
Research Hub
Conclusions
3. Why is data curation important?
Integrity
Enabling appropriate scrutiny, debate, reproduction, and
verification of results
Efficiency
Avoiding needless duplication of effort
Policy
Complying with institutional policies, publication requirements,
and funder mandates
“*Data] is a valuable national asset whose value is multiplied when it is made
easily accessible to the public”
– Office of Science and Technology Policy
4. Why is data curation important?
Catalyzing
Promoting progress through new collaborations and creative
(re)use of data
“If I have seen further it is by standing on the shoulders of giants”
– Isaac Newton, 1676
5. What is the library’s role?
A continuation of its long-standing mission and practice to
connect patrons with content of interest in meaningful ways
across barriers of space and time
Cf. Tenopir et al. (2012), “Academic librarians and research data services: Preparation and attitudes,” 78th
IFLA General Conference and Assembly, Helsinki, http://conference.ifla.org/past/ifla78/116-tenopir-en.pdf
Offering solutions that enhance the natural points of
alignment between the scholarly research and information
lifecycles
Publish
Reuse
ShareCreate
Discover
Collect
PreserveAccessResearchResearch CurationCuration
Scholarly lifecycle Information lifecycle
6. Why is data curation hard?
Ever increasing number, size, and diversity of content
Inevitability of disruptive change
Resources not keeping pace with growth
Stakeholders outside of traditional cultural heritage domains,
with lots of questions
Who can give me advice on what I should do?
How should I describe and package my data?
How can I cite my data in order to receive
credit for it?
How can I share my data?
What can I do with web published data?
…
7. DMPTool – guidance and resources
Finalist, 2012
DPC Award for
Research and
Innovation
http://dmptool.org/
Create, edit, and share data
management plans
Meet funder requirements
Provide institutional guidance
Links to local resources
8. DMPTool – guidance and resources
Finalist, 2012
DPC Award for
Research and
Innovation
http://dmptool.org/
Create, edit, and share data
management plans
Meet funder requirements
Provide institutional guidance
Links to local resources
9. DMPTool – guidance and resources
Two recently
funded projects
Functional
enhancements
and open source
community
development
Sloan Foundation
Training and
outreach
IMLS
http://dmptool.org/
New options for DMP
collaboration and formal
and ad hoc review
Stronger administrative
control and customization
10. DataUp – description and packaging
http://dataup.cdlib.org/ http://www.dataup.org/
“It’s easier to augment systems than
it is to change behavior”
Curation for tabular datasets
Excel add-in
Azure cloud service
11. DataUp – description and packaging
http://dataup.cdlib.org/ http://www.dataup.org/
Best practices check
Data description
Identifier and citation generation
Repository submission to
ONEShare
Curation for tabular datasets
Excel add-in
Azure cloud service
12. DataUp – description and packaging
http://dataup.cdlib.org/ http://www.dataup.org/
What researchers don’t need to know
Schema definition and XML syntax
Identifier registration procedures
Citation format
Repository packaging and submission
Harvesting for aggregation
2013 Innovation Award winner
Recently funded project
Functional enhancements and open
source community development
NSF
13. EZID – identification and citation
http://n2t.net/ezid/
UC3 is a founding
member of the
DataCite consortium
Mint DOI and
ARK
Add descriptive
metadata
Receive QR code
Global resolution
Aggregated
discovery
Updatable
resolution URLs
Establish and maintain persistent two-way
linkages between the literature and the data
that underlies its results
14. EZID – identification and citation
UC3 is a founding
member of the
DataCite consortium
Mint DOI and
ARK
Add descriptive
metadata
Receive QR code
Global resolution
Updatable
resolution URLs
Link to dataset in repository
http://n2t.net/ezid/
15. EZID – identification and citation
UC3 is a founding
member of the
DataCite consortium
Mint DOI and
ARK
Add descriptive
metadata
Receive QR code
Global resolution
Updatable
resolution URLs
Link from dataset landing page to article
citing the data
16. EZID – identification and citation
UC3 is a founding
member of the
DataCite consortium
Mint DOI and
ARK
Add descriptive
metadata
Receive QR code
Global resolution
Updatable
resolution URLs
Link from article back to dataset
17. EZID – identification and citation
UC3 is a founding
member of the
DataCite consortium
Aggregated discovery via DataShare and Ex Libris Primo
Later this year, aggregation via T-R Data Citation Index
18. EZID – identification and citation
UC3 is a founding
member of the
DataCite consortium
SEI for public visibility in leading search engines
19. Merritt – preservation and access
Content agnostic,
model free
Micro-service
architecture
UI and RESTful API
26 curatorial units
271 collections
325,000 objects
450,000 versions
4,500,000 files
13 TB
http://merritt.cdlib.org/
Enforceable Data Use Agreements (DUAs) in
response to concerns over potential loss of
control over dissemination and reuse
Open to the UC
community and
external partners
Dark archive for
long-term
assurance
Bright archive for
sharing
Integration with
preservation grids
Integration with
public access
portals
Integration with
CMS
20. Merritt – preservation and access
Content agnostic,
model free
Micro-service
architecture
UI and RESTful API
26 curatorial units
271 collections
325,000 objects
450,000 versions
4,500,000 files
13 TB
For curatorially-designated collections and
objects, a download request triggers …
Open to the UC
community and
external partners
Dark archive for
long-term
assurance
Bright archive for
sharing
Integration with
preservation grids
Integration with
public access
portals
Integration with
CMS
21. Merritt – preservation and access
Content agnostic,
model free
Micro-service
architecture
UI and RESTful API
26 curatorial units
271 collections
325,000 objects
450,000 versions
4,500,000 files
13 TB
Open to the UC
community and
external partners
Dark archive for
assurance
Bright archive for
sharing
Integration with
preservation grids
Integration with
public access
portals
Integration with
CMS
Click-through DUA; acceptance of terms of
use triggers …
22. Merritt – preservation and access
Content agnostic,
model free
Micro-service
architecture
UI and RESTful API
26 curatorial units
271 collections
325,000 objects
450,000 versions
4,500,000 files
13 TB
Open to the UC
community and
external partners
Dark archive for
assurance
Bright archive for
sharing
Integration with
preservation grids
Integration with
public access
portals
Integration with
CMS
From: no-reply-merritt@ucop.edu
Subject:Merritt DUA acceptance
Name: Stephen Abrams
Affiliation: California Digital Library
Collection: UCSF DataShare
Object: Frontotemporal Lobar Degeneration (FTLD)
Date: 2013-05-3109:50:34PDT
Terms of use: As part of this agreement, Consumer submits to the following
statements:
(1) I will receive access to de-identified data and will not attempt to establish the
identity of any of the study subjects.
(2) I will share these data only with my immediate co-workers, and I will not transfer
these data to other research groups. I understand that these data are available to
other research groups through the process by which I obtain them.
(3) I will require anyone in my group who utilizes these data, or anyone with whom I
share these data to comply with this data use agreement
...
Email notification to consumer and curator
Delivery of requested content
23. Web Archiving Service
http://was.cdlib.org/
Collect, describe,
manage, preserve,
and provide access
to web sites
Analysis tools
Full-text search
27 curatorial units
185 collections
10,772 web sites
97,121 captures
64 TB
“You can’t study life
in our time without
the Internet, so we
must preserve it”
– René Vourburg, KB
Initially developed
as part of the
NDIIPP-funded Web
at Risk project
The web has become the publication platform
of choice
Source of important primary and secondary
research data
24. Web Archiving Service
http://was.cdlib.org/
Collect, describe,
manage, preserve,
and provide access
to web sites
Analysis tools
Full-text search
27 curatorial units
185 collections
10,772 web sites
97,121 captures
64 TB
“You can’t study life
in our time without
the Internet, so we
must preserve it”
– René Vourburg, KB
Initially developed
as part of the
NDIIPP-funded Web
at Risk project
For example, California water district web sites
supplement UC Davis source water assessment
and protection (SWAP) Merritt collections
25. Connecting to communities of practice
Engage with new user communities where and how they
already work
Shifting user roles, shifting expectations
Institutional individual researcher
Behavioral expectations set by the commercial/mobile web
26. DataShare – catalyzing science
UCSF Clinical and Translational Science Institute
http://ctsi.ucsf.edu/
UCSF Library
http://www.library.ucsf.edu/
UCSF Center for Imaging of Neurodegenerative Disease
http://www.radiology.ucsf.edu/cind/
http://datashare.ucsf.edu/
“Making data transparent
and available is going to
accelerate all of science;
it's a relatively
inexpensive way to get
more value out of all of
the work that we do”
– Michael Weiner, UCSF
Pilot project in
biomedical imaging
“The goal is to
catalyze widespread
sharing of scientific
research data”
Prepare
Describe
Upload
Curate
Discover
Share
27. DataShare – catalyzing science
UCSF-developed submission client, supporting intuitive
drag & drop operation and metadata entry
EZID for DOIs; Merritt for preservation
XTF-based faceted search/browse portal
http://xtf.cdlib.org/
http://datashare.ucsf.edu/
“Making data transparent
and available is going to
accelerate all of science;
it's a relatively
inexpensive way to get
more value out of all of
the work that we do”
– Michael Weiner, UCSF
Pilot project in
biomedical imaging
“The goal is to
catalyze widespread
sharing of scientific
research data”
Prepare
Describe
Upload
Curate
Discover
Share
28. Research Hub – content mgmt and collaboration
3,900 users
770 projects
Alfresco CMS
Desktop sync
Mobile apps
Abode Creative
Suite
Personal file
management
Project
collaboration
Departmental
resource pooling
Research data
sharing
“Powerful tools for
management and
collaboration”
Create
Organize and enrich
Keep safe
Share
http://hub.berkeley.edu/
UC Berkeley Information Services &Technologies
http://ist.berkeley.edu/
29. Research Hub – content mgmt and collaboration
3,900 users
770 projects
Alfresco CMS
Desktop sync
Mobile apps
Abode Creative
Suite
Personal file
management
Project
collaboration
Departmental
resource pooling
Research data
sharing
“Powerful tools for
management and
collaboration”
Create
Organize and enrich
Keep safe
Share
http://hub.berkeley.edu/
Primary discovery and access via Research Hub
EZID for DOIs; Merritt for preservation
Merritt access called for in succession plans
30. Data curation
“Access to and sharing of data are essential for the conduct and
advancement of science”
— Arzberger et al. (2004), “Promoting access to public research data for scientific, economic,
and social development,” Data Science Journal 3: 135-52, doi:10.2481/dsj.3.135
Pro-active curation of research outputs is necessary to ensure
their ongoing viability and use
Good for research; good for researchers
Quicker, more innovative science; higher impact factor
Increasingly necessary for conformance to institutional
policies, publication requirements, and funder mandates
31. Data curation
Widespread adoption is dependent on outreach, education,
and minimal intrusion into existing disciplinary workflows and
common community practices
The most effective – and sustainable – curation services are
composed from best-of-breed components
Libraries are a natural curation partner for the research
community
32. For more information
UC Curation Center
http://www.cdlib.org/uc3/
uc3@ucop.edu
Stephen Abrams David Loy
Patricia Cruse Mark Reyes
Shirin Faenza Joan Starr
Scott Fisher Carly Strasser
Erik Hetzner Marisa Strong
Joshua Hubbard Bhavitavya Vedula
Greg Janée Kenneth Weiss
John Kunze Perry Willet
Rosalie Lack
DataShare
http://datashare.ucsf.edu/
Geoffrey Boushey Megan Laurance
Anirvan Chatterjee Angela Rizk-Jackson
Maninder Kahlon Michael Weiner
Julia Kochi
Research Hub
http://hub.berkeley.edu/
Ian Crew Patrick McGrath
Michael McCarthy Noah Wittman