This document summarizes a collaboration between UCSF, UCSF Library, and California Digital Library to create DataShare, an open data repository. It describes the need to share research data due to funding and publishing requirements. The partners each contribute different expertise: UCSF provides access to researchers and data, the Library provides metadata and programming skills, and CDL provides preservation tools. DataShare uses CDL technologies like Merritt for storage, EZID for identifiers, and XTF for search/browse. An ingest tool simplifies submission. Outreach and incentives are needed to encourage adoption, like providing visibility, credit, and fulfilling requirements. Technical and policy challenges remain around standards, ownership, and interoperability.
1. DataShare:
Collaboration Yields Promising Tool
Julia Kochi, UCSF Library
Angela Rizk-Jackson, UCSF CTSI
Perry Willett, California Digital Library
CNI 2013 Meeting
San Antonio, TX
3. What is DataShare?
An open data repository for the UCSF
researcher
A concept initially envisioned by Michael
Weiner, M.D.
A collaboration between UCSF CTSI, UCSF
Library, and the California Digital Library
4. The Problem
Increasing requirements to share data
• NIH grants >$500k
• Publisher requirements
Unequal availability of national repositories
Campus priorities
FASTR, White House Directive
5. The Partners
UCSF CTSI
• Knowledge of the researcher, access to the data
UCSF Library
• Metadata expertise, programming resources
UC3
• Preservations tools, services and expertise
8. Merritt Repository Service
Built on “micro-services” principles
Content and format agnostic
Has a UI and RESTful APIs to submit and
retrieve content, and check statuses
Can serve as either “dark” or “bright” archive
Added public access, data use
agreements, asynchronous downloads as part
of Datashare project
9. EZID
Service for creation and management of long-
term identifiers
Currently supports ARKs and DOIs; other types
in planning stages
Registers DOIs with DataCite
Has a UI and APIs with good documentation
10. XTF
eXtensible Text Framework
Developed and maintained by CDL
Runs several CDL services:
• eScholarship
• Online Archive of California
• Calisphere
Faceted browsing, full-text search, other
desirable features
11.
12.
13. Ingest tool
Submitting content to a digital repository is
hard and costly
An attempt to simplify several aspects:
• Digital object creation
• Metadata creation
• Object submission
14.
15. Interactions for submission
Ingest
Tool
Creates Metadata
Assembles Dataset
Submits to Merritt
Merritt
EZID
Datacite
Requests DOI
Submits Metadata
to EZID
Registers DOI and Metadata
XTF
Requests ATOM feed for collection
Retrieves Metadata
Index metadata
Receives DOI
Packages object
Gets ATOM feed
16. Process for Endusers
Search, browse
Request dataset download
Fill out Data Use Agreement
Receive dataset
17.
18.
19.
20. Lessons learned
Partnerships
• Many hands make light work
• Real users uncover hidden assumptions
Scale
• Object size
• Number of files
• Upload and download
21. If you build it, will they come?
Angela Rizk-Jackson
UCSF CTSI
22. What will it take?
Sketch by Juliana Olivera Silva via Flickr
+
23. Providing Incentives: Requirements
Organization Data Access Requirement # UCSF Studies
Funding
NIH Grants >$500K (2003 on), Specific
programs
318 (active
projects)
693 (inactive)
NSF All funded projects (2005 on) 19
Foundations
(e.g. Moore, Gates,
Hewlett)
All funded projects 3, 31, 19
Publishing
Nature
Publishing Group
(Nature, Science,
etc.)
All published studies (2009-2011) 58
Cell Press
(Cell, Neuron, etc.)
All published studies (2009-2011) 48
PNAS All published studies (2005-2011) 26
27. Providing Incentives: Institutional
UCLA Royce Hall photo courtesy of Adam Fagen via Flickr
• Support researcher needs
• Improved archiving efficiency
• Cost savings
28. Eliminating Barriers
1. Time / Effort
- Minimal requirements
- Specific tools (e.g. ingest)
- Integrate into existing workflow
2. Control
- Data Use Agreement
- Centralized service
3. Cultural Paradigm
- Outreach
- Demonstrate value
30. Lessons Learned
Don’t underestimate technical matters
• Separating data & metadata
Standards are not standard
• Metadata schema (Dublin Core DataCite)
• Interpretation
Policy issues are ever-present
• Data Ownership & Data Use Agreements
• Privacy & Consent (Human subjects)
Keep in mind the entire lifecycle: ALL users
• Discoverability & interoperability
• README File
32. Discussion Topics
What incentives have you found useful to
encourage adoption of this type of resource?
Are you using data use agreements? Uniform
or individualized?
Where do you see institutional data
repositories fitting in the larger ecosystem?
33. More info
Datashare: http://datashare.ucsf.edu
CDL: http://www.cdlib.org
• Merritt: https://merritt.cdlib.org
• EZID: http://n2t.net/ezid
• XTF: http://xtf.cdlib.org
UCSF Library: http://www.library.ucsf.edu/
UCSF CTSI: http://ctsi.ucsf.edu/
NCATS – NIH Grant # UL1 TR000004
Hinweis der Redaktion
Mission: enable individual researchers to share their research data sets with the global communityA researcher at UCSF. In his work as the Principal Investigator of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) he concluded that widespread data sharing can be achieved now, with great scientific and economic benefits. All ADNI raw data is immediately shared, without embargo, with all scientists in the world. The project is very successful: more than 300 publications have resulted from use of the ADNI data resource. This success demonstrates the feasibility and benefits of sharing data.Clinical and Translational Sciences InstituteWorking together to develop a resource that meets the needs of the researcher while leverging the
Cell Press, Nature Publishing Group, PNASOver 100 papers published between 2009-11 in journals from 3 publishers that have data sharing requirementsSome researchers have national repositories for their data (e.g. GenBank) while others don’t.Campus focused on developing infrastructure for storing and analyzing data but not sharing it generally. Additionally, the current focus is on clinical data, especially anonymized data from the electronic health record, and not basic or social sciences data.
CTSI: Mission is to accelerate the research enterprise and saw the sharing of data as one way to accomplish this mission. Library: Interest in as well as an extension of the support of the open access ‘UC3: provide the tools to the UC community to promote digital scholarship.
Screenshot of eScholarship, running XTF
Screenshot of Datashare, running XTF
Datashare website; enduser selects title
Full information on dataset; enduser selects download
Data Use Agreement (DUA) for enduser.
Fulfills requirements, existing and emerging
Increases visibility of work
The new TR Data Citation Index provides a mechanism to discover data for re-use in the same familiar fashion as discovering publications
Long term preservation, easy access to your own dataMerritt repository is an active archival environ with format migration and integrity checks – a smart filing cabinet for digital assets
Centralizing resources improves efficiency by streamlining/standardizing the process and saves money in the aggregateCurrently gather data to support this