Varsha Khodiyar, PhD
Data Curation Editor, Scientific Data
Nature Publishing Group
@varsha_khodiyar
@scientificdata
Tweet with #SDJPN16
Gaining credit for sharing research data
Data publishing with Scientific Data
RIKEN Center for Life Science Technologies 4th March 2016
My background
• Joined Scientific Data in October 2014
• Professional data curator since 2003
• PhD in Molecular Biology from the University of
Leicester
• Contributed to the Human Genome Project as
member of the Human Gene Nomenclature
Committee (HGNC)
• Gene Ontology curator for 8 years, at University
College London, UK
• 3 years of open data publishing experience
2
Generating research data is expensive
Just 18.1% NIH grant applications funded in 2014*
• Hours spent writing grants?
• Hours spent reviewing grants?
Resources are finite/expensive
• Modified animals
• Specialized reagents
Time and effort taken in the laboratory to generate
good, valid data
* report.nih.gov/success_rates/Success_ByIC.cfm
Irreproducibility of published science
Figure 1 - Ioannidis JPA. et al. Repeatability of published microarray gene
expression analyses. Nature Genetics 41, 149–55 (2009) doi:10.1038/ng.295
Withholding data impacts on human health
Clinical study reports, detailed data and software code available at Dryad
Digital Repository doi:10.5061/dryad.bv8j6 and www.Study329.org
• Diversity of analyses and opinion
• New research
• testing of new hypotheses
• new analysis methods
• meta-analyses to create new
datasets
• studies on data collection methods
• Education of new researchers
• Increased return on investment in
research
Vickers AJ: Whose data set is it anyway? Sharing raw
data from randomized trials. Trials 2006, 7:15
Hrynaszkiewicz I, Altman DG: Towards agreement on
best practice for publishing raw clinical trial data.
Trials 2009, 10:17
Sharing data promotes
Researchers already share data
• Most researchers are sharing
data, and using the data of
others
• Direct contact between
researchers (on request) is a
common way of sharing data
• Repositories are second most
common method of sharing
Kratz and Strasser (2015) doi: 10.1371/journal.pone.0117619 9
Some problems…
• Sharing upon request relies heavily on trust
• Informally stored data associated with published works disappears at a
rate of ~17% per year (Vines et al. 2014; doi: 10.1016/j.cub.2013.11.014)
• Datasets not referenced in a manuscript are essentially invisible (a.k.a
“Dark data”)
• If data are available, they are often not interpretable or reusable
because sufficient detail is not included
• Data producers do not get appropriate credit for their work
Credit – Scholarly credit for publishing data; all publications are indexed
and citeable.
Reuse – Standardized and detailed descriptions enables easier reuse of
published research data.
Quality – Rigorous peer-review on technical quality and reusability.
Editorial Board of experts in their field maintain community standards.
Discovery – Curated, machine-readable metadata for dataset discovery.
Validated links to published data in each article.
Open – Use of CC-BY licence for articles and CC0 for metadata. Promote
use of open licences for published data.
Service – Commitment to excellent service for authors and readers.
Data Descriptors have human and machine readable
components
13
Human readable
representation of
study
i.e. article (HTML &
PDF)
Human readable
representation of
study
i.e. article (HTML
& PDF)
Machine
readable
representation
of study
i.e. metadata
Synthesis
Analysis
Conclusions
What did I do to generate the data?
How was the data processed?
Where is the data?
Who did what and when?
Methods and technical analyses supporting the quality of the measurements.
Do not contain tests of new scientific hypotheses
Comparison of Data Descriptor to traditional article
What types of data can be published?
15
Decades
old
dataset
Standalone
dataset
Data that has been
used in an analysis
article
Large
consortium
dataset
Data from a
single
experiment
Data that the
researcher finds
valuable and that
others might find
useful too
Data associated
with a high impact
analysis article
When can a Data Descriptor be published?
16
After data
analysis has
been
published
Before analysis
has been
published
Authors not
intending to
analyse data
Data Descriptors can be
submitted and published
at any point in the
research workflow, i.e.
whenever it makes most
sense for your data
After data
analysis has
been
published
Before the
analysis has
been published
Publication
alongside analysis
article
Scientific Data’s Repository List
Browse our recommended data repositories online.
• We currently list almost 80 repositories, across biological, medical,
physical and social sciences
• When required, we provide guidance to authors on the best place to
store their data
www.nature.com/sdata/data-policies/repositories
• We want to capture metadata about the dataset being described in each Data
Descriptor
• The manuscript captures human readable metadata needed for data reuse
• The curated metadata records capture machine readable metadata needed for
machine based data discovery
Metadata at Scientific Data
ISA-Tab format for machine readable metadata
22
• Study workflow
• Key sample characteristics
needed for data discovery
• Relates samples to data files
• Shows location of dataset
• Uses controlled vocabularies
and ontologies (where
possible)
Use of community endorsed ontologies and controlled
vocabularies
23
Controlled vocabulary = list of standardized phrases of scientific concepts
Ontology = controlled vocabulary with defined relationships between terms
Structured Summary table from curated metadata
24
Investigation file
Study file
Sample characteristics reported in Structured Summary table:
Organism
Organism part
Cell line
Geographical location
Environment type
Citing my own data
1. In the
article text
2. In the Data
Citation section
Citing data I’ve reused
1. In the
article text
2. In the
References
section
Clinical researchers support sharing, but…
Rathi V, Dzara K, Gross CP, Hrynaszkiewicz I, Joffe S, Krumholz HM, Strait KM, Ross JS:
Sharing of clinical trial data among trialists: a cross sectional survey. BMJ 2012;345:e7570
• Sharing de-identified data via repositories should be
required (236 respondents, 74%)
• Investigators should share de-identified data on request
(229 respondents, 72%)
…clinical data producers have specific concerns
Rathi V, Dzara K, Gross CP, Hrynaszkiewicz I, Joffe S, Krumholz HM, Strait KM, Ross JS: Sharing of
clinical trial data among trialists: a cross sectional survey. BMJ 2012;345:e7570
Example initiatives for sharing clinical data
Yale Open Data Access (YODA) & Clinical Study Data
Request (CSDR) projects:
• Data Use Agreements (DUAs)
• Controlled access environment
• Scientific validity of reanalysis checked
• Independent governance
• Data anonymisation checks
http://yoda.yale.edu/
https://www.clinicalstudydatarequest.com/
Clinical data publication at Scientific Data
• Identify repositories able to archive clinical data
• Work with identified repositories to establish workflows for
peer review and publication, whilst maintaining patient
privacy
• Facilitate specialist peer review process for clinical data, for
example ensure peer reviewers have agreed to terms of data
use agreement
Hrynaszkiewicz, I., Khodiyar, V., Hufton, A. & Sansone, S. A. Publishing descriptions of non-
public clinical datasets: guidance for researchers, repositories, editors and funding
organisations. BioRxiv http://dx.doi.org/10.1101/021667 (2015).
Data reuse by other researchers in the same field
39
“The Data Descriptor made it easier
to use the data, for me it was critical
that everything was there…all the
technical details like voxel size.”
Professor Daniele Marinazzo
Data reuse by the non-research community
42
http://www.nytimes.com/interactive/2014/12/30/science/history-of-ebola-in-24-outbreaks.html
Data Descriptors…
• …enable you to gain scholarly credit for your data gathering
efforts.
• …are human AND machine readable.
• …can be published with, or independently of, an analysis article.
• …can be published point in the research workflow.
• …allow the publication and discovery of clinical data, whilst
maintaining your patients privacy.
• …result in greater reuse and citation by fellow members of your
research community.
• …extend the impact of your research data by enabling access to
and reuse by the non-research community.
43