The document summarizes a presentation about making scientific data FAIR (Findable, Accessible, Interoperable, Reusable). It discusses the concept of FAIR data and several of the presenter's related projects. Examples are provided of using standards like ISA-Tab to structure metadata and make datasets interoperable. The presentation outlines the presenter's roles in data capture, publication, and standards development efforts to promote FAIR data principles. Scientific Data, a new journal for peer-reviewed data descriptions, is introduced as a way to make datasets more discoverable and reusable.
1. Data Consultant,
Honorary Academic Editor
Associate Director,
Principal Investigator
The rise of the data-centric !
research and publication enterprises!
Susanna-Assunta Sansone, PhD!
!
RIKEN Yokohama, 25 June, 2014
http://www.slideshare.net/SusannaSansone
2. • About myself!
o activities and interests!
• FAIR data!
o concept!
o my related projects!
• Scientific Data!
o rationale!
o Data Descriptors!
o examples!
Outline!
3. My areas of activity:!
• Data capture and curation!
• Data (nano)publication!
• Data provenance !
• Open, community ontologies
and standards!
• Semantic web!
• Software development!
• Training!
Communities I work with/for:! As part of:!
• UK, European and international
consortia!
• Pre-competitive informatics
public-private partnerships!
• Standardization initiatives!
with e.g.:!
4. Notes in Lab Books
(information for humans)
Spreadsheets andTables
( the compromise)
Facts as RDF statements
(information for machines)
Notes and narrative! Spreadsheets and tables! Linked data and nanopublications!
Enabling reproducible research and open science,
driving science and discoveries !
Increase the level of annotation at the source, tracking provenance and using community standards
7. § Researchers and bioinformaticians in both academic and commercial
science, along with funding agencies and publishers, embrace the
concept that both
• DATA: entities of interest e.g., genes, metabolites, phenotypes and
• METADATA: experimental steps e.g., provenance of study materials,
technology and measurement types
should be Findable, Accessible, Interoperable and Reusable
Worldwide movement for FAIR data
8. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
8
sample characteristic(s)
experimental design
experimental variable(s)
technology(s)
measurement(s)
protocols(s)
data file(s)
......
9. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
9
• make annotation explicit
and discoverable
• structure the descriptions for
consistency
• ensure/regulate access
• deposit and publish
• etc….
§ To make this dataset ‘FAIR’, one
must have tools, standards and
best practices to:
• report sufficient details
• capture all salient features of
the experimental workflow
10.
11. General-purpose, configurable format,
designed to support:
• description of the experimental metadata,
making the annotation explicit and
discoverable
• provenance tracking
• use community standards, such as minimal
reporting guidelines and terminologies
• designed to be converted to - a growing
number of - other metadata formats, e.g.
used by EBI repositories
analysis !
method! script!
Data file or !
record in a
database!
12.
13. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
ISA powers data collection, curation resources and repositories, e.g.:
14. Reporting standards and data interoperability
Including minimum
information reporting
requirements, or
checklists to report the
same core, essential
information
Including controlled
vocabularies, taxonomies,
thesauri, ontologies etc. to
use the same word and
refer to the same ‘thing’
Including conceptual
model, conceptual
schema from which an
exchange format is derived
to allow data to flow from
one system to another
Community-developed, standards are pivotal to structure, enrich the
description and share datasets, facilitating understanding and reuse!
16. Which standards and database can we use/recommend
I work in the field of cell
migration research,
which one are
applicable to me?
I us cell migration in
translational research, are
there specific clinical
standards?
17.
18. Registering and cataloging is just step one; the next one are:
• Develop assessment criteria for usability and popularity of standards
• Associate standards to data policies and databases
• Assemble journal and funder policies re data storage
• Make fully cross-searchable
• Intended goal: help stakeholders make informed decisions
19. • About myself!
o activities and interests!
• FAIR data!
o concept!
o my related projects!
• Scientific Data!
o rationale!
o Data Descriptors!
o examples!
Outline!
20. FAIR data - roles and responsibilities
• Data has to become an integral part of
the scholarly communications!
• Responsibilities lie across several
stakeholder groups: researchers, data
centers, librarians, funding agencies and
publishers!
• But publishers occupy a “leverage point”
in this process!
22. Helping you publish, discover and reuse research data
Visit
nature.com/scientificdata
Email
scientificdata@nature.com
Tweet
@ScientificData
Supported by:!
Honorary Academic Editor
Susanna-Assunta Sansone, PhD
Managing Editor
Andrew L Hufton, PhD
Editorial Curator
Victoria Newman
Advisory Panel and Editorial Board
including senior researchers, funders,
librarians and curators
23.
!
!
!
Launched on May 27th, 2014
A new online-only publication for descriptions of scientifically valuable datasets
in the life, environmental and biomedical sciences, but not limited to these!
Credit for sharing
your data
Focused on reuse
and reproducibility
Peer reviewed,
curated
Promoting Community
Data Repositories
Open Access
24. !
!
!
Experimental metadata or!
structured component!
(in-house curated, machine-
readable formats)!
Data Descriptor: narrative and structure!
Article or !
narrative component!
(PDF and HTML)!
25.
Data Descriptor: narrative!
Sections:!
• Title!
• Abstract!
• Background & Summary!
• Methods!
• Technical Validation!
• Data Records!
• Usage Notes !
• Figures & Tables !
• References!
• Data Citations!
!
Focus on data reuse!
Detailed descriptions of the methods and technical analyses supporting the
quality of the measurements.!
Does not contain tests of new scientific hypotheses!
In traditional publications this
information is not provided in a
sufficiently detailed manner
However this information is
essential for understanding,
reusing, and reproducing
datasets
26.
Data Descriptor: narrative!
Sections:!
• Title!
• Abstract!
• Background & Summary!
• Methods!
• Technical Validation!
• Data Records!
• Usage Notes !
• Figures & Tables !
• References!
• Data Citations!
!
Focus on data reuse!
Detailed descriptions of the methods and technical analyses supporting the
quality of the measurements.!
Does not contain tests of new scientific hypotheses!
Joint Declaration of Data Citation Principles by the
Data Citation Synthesis Group, incl.:
- CODATA
- Research Data Alliance (RDA),
- Force11
27. In-house curation team:!
• assists users to submit the structured
content via simple templates and an
internal authoring tool!
• performs value-added semantic
annotation of the experimental
metadata!
For advanced users/service providers
willing to export ISA-Tab for direct
submission, we will release a technical
specification:!
analysis !
method! script!
Data file or !
record in a
database!
Data Descriptor: structure (CC0)!
28. Export to various formats
(ISA_tab, RDF, etc)
Linking between research papers, Data Descriptors, and data records
Making data discoverable
!
29. 2
4
3
10 4
1
4
3
4
DNA and protein sequence
Functional genomics
Genetic association and genome variation
Metagenomics
Molecular interactions
Organism- or disease-specific
Proteomics
Taxonomy and species diversity
Traces and sequencing reads
“Omics” is emphasized
among basic life-
sciences repositories
• We currently recognize over 50 public data repositories!
• We have integrated systems with both:!
!
!
Helping authors find the right place for the data!
30. 1. Broadly support and recognition within their scientific community !
2. Ensure long-term persistence and preservation of datasets in their
published form !
3. Provide expert curation !
4. Implement relevant, community-endorsed reporting requirements !
5. Provide for confidential review of submitted datasets !
6. Provide stable identifiers for submitted datasets !
7. Allow public access to data without unnecessary restrictions !
30
Data repositories - criteria!
31. Data: the primary datasets will reside in public
repositories. Partnering with FigShare and Dryad,
which are both CC0!
Data Descriptor - structured component (ISA-Tab):
as NPG has already done with its existing Linked
Data Portal, the metadata about data descriptors in
Scientific Data will be CC0!
Data Descriptor - narrative component: describing
the methodology of data generation/collection and
processing will be licensed under either of the
following, by author choice:
Big
data
|
CSE
2014
3
Open Access - APC supported!
32. !
!
!
!
!
!
!
!
Scientific hypotheses:!
Synthesis!
Analysis!
Conclusions!
Methods and technical analyses supporting
the quality of the measurements:!
What did I do to generate the data?!
How was the data processed?!
Where is the data?!
Who did what when!
Relation with traditional articles - content and time!
BEFORE: get your data to the community as soon as possible (see NPG pre-publication policy)
AT THE SAME TIME: publish your Data Descriptor(s) alongside research article(s)
AFTER: expand on your research articles, adding further information for reuse of the data
33. • Neuroscience, ecology, epidemiology, environmental
science, functional genomics, metabolomics, toxicology!
• New datasets and previously published data sets!
• Datasets in figshare, OpenfMRI, GEO, GenomeRNAi,
ArrayExpress and MetaboLights !
• Code deposited in figshare and GitHub!
• Individual datasets, compendium and citizen science!
• First dataset part of a collection !
• Academic and industry authors!
33
Current content is diverse – bimonthly releases !
40. Evaluation is not be based on the perceived impact or novelty of the findings!
• Experimental Rigour and Technical Data Quality!
o Were the data produced in a rigorous and methodologically sound manner?!
o Was the technical quality of the data supported convincingly with technical validation
experiments and statistical analyses of data quality or error, as needed?!
o Are the depth, coverage, size, and/or completeness of these data sufficient for the types of
applications or research questions outlined by the authors?!
• Completeness of the Description!
o Are the methods and any data-processing steps described in sufficient detail to allow others to
reproduce these steps?!
o Did the authors provide all the information needed for others to reuse this dataset or integrate it
with other data?!
o Is this Data Descriptor, in combination with any repository metadata, consistent with relevant
minimum information or reporting standards?!
• Integrity of the Data Files and Repository Record!
o Have you confirmed that the data files deposited by the authors are complete and match the
descriptions in the Data Descriptor?!
o Have these data files been deposited in the most appropriate available data repository?!
Peer review process focused on quality and reuse!
41. • Do you run a data resource we should recognize?!
o See on our website the list of criteria databases should meet!!
• Are you interested in facilitating submission to us? !
o See our ISA-Tab specification on the website!
- you can implement and export in this format from your authoring/curation tool, or
from your database!!
• Do you want to submit Data Descriptor(s)?!
o Check suitability by sending a pre-submission enquire, we accept:!
- Submissions in the life, environmental and biomedical sciences; but not limited to!
- Experimental, observational and computational datasets!
- Individual datasets, curated aggregations, and collections!
- Unpublished data and follow-up, with additional information for wider reuse, e.g.:!
ü a fuller, more in-depth look at the data processing steps, supported by additional data
files and code from each step!
ü additional tutorial-like information for scientists interested in reusing or integrating the
data with their own!
Interested in collaborating and/or submitting?!