1. The buzz around reproducible bioscience data:
the policies, the communities and the standards
Susanna-Assunta Sansone, PhD
Principal Investigator and Team Leader,
University of Oxford e-Research Centre, Oxford, UK
Slides at:
http://www.slideshare.net/SusannaSansone
SPSAS e-SciBioEnergy Sao Paolo School of Advanced Science on
e-Science for Bioenergy Research, 22-26 Oct, 2012, Campinas, Brazil
5. Oxford e-Research Centre
Providing research
computing, high-
performance
computing
Integrating with
national and
international
infrastructure
Supporting leading
edge facilities through
education and training
6. Oxford e-Research Centre
Collaborating with European and wider
international groups in, e.g.:
• energy,
• radio astronomy,
• biological data federation,
• life sciences simulation,
• biodiversity,
• computational chemistry,
• neuroscience,
• digital humanities tools,
• digital music analysis
Research in
• computation,
• data infrastructure and analysis,
• visualisation
7. My team’s activities and groups we work with
data management and biocuration, collaborative development
of software and database, standards and ontology
• environmental genomics • stem cell discovery
• metabolomics • system biology
• metagenomics • transcriptomics
• nanotechnology • toxicogenomics
• proteomics • environmental health
env
agro
tox/pharma
health
9. Outline
“The buzz around reproducible bioscience data:
the policies, the communities and the standards”
“The reality from the buzz:
how to deliver reproducible bioscience data”
10. Preserve
institutional /
corporate
memory
Harmonize collection across sites
Find matching studies
Data dissemination
Long-term data stewardship
10
14. Address
reproducibility /
reuse
of public data
Ioannidis et al., Repeatability of published microarray
gene expression analyses. Nature Genetics 41(2),
14
149-55 (2009) doi:10.1038/ng.295
19. COMPREHENSIBLE
INTEROPERABLE
REPRODUCIBLE
REUSABLE
http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY
20. Growing, worldwide movement for reproducible research
Shared, annotated research data and methods offer new discovery
opportunities and prevent unnecessary repetition of work.
Improved data sharing underpins science of the future
“Publicly-funded research data are a public good,
produced in the public interest”
“Publicly-funded research data should be openly available
20
to the maximum extent possible”
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
21. Growing, worldwide movement for reproducible research
esoteric formats comprehensible?
lack of sufficient interoperable?
contextual
information reusable?
hoc or proprietary
terminology reproducible?
§ Researchers and bioinformaticians in both academic and commercial
science, along with funding agencies and publishers, embrace the
concept that community-developed standards are pivotal to structure
and enrich the annotation of
• entities of interest (e.g., genes, metabolites, phenotypes) and
• experimental steps (e.g., provenance of study materials,
technology and measurement types)
22. Structure and enrich description of the experiments
§ Describe and communicate the information in an unambiguous,
human and machine readable manner
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, RNA prepared…etc.
Age value
Unit
Strain name
Subject of the experiment
Type of diet and Type of protocol - sample treatment
experimental condition Type of protocol - nucleic acid extraction
Anatomy part
23. Structure and enrich description of the experiments
§ Describe and communicate the information in an unambiguous,
human and machine readable manner
Figure: credit to
OBI consortium
28. Today’s bioscience research
Publications
Experimental
and
computational
data
§ Is interdisciplinary and integrative in character
• need to deal with new and existing datasets
• deal with a variety of data types
§ ‘How the organism works’ is the focus
• Twenty years ago data was the center
Source of the figure: EBI website
29. 29 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
Source: http://ebbailey.wordpress.com
www.ebi.ac.uk/net-project
30. Example from the toxicogenomics domain
Study looking at the effect of a
compound inducing liver damage
by characterizing/measuring
- the metabolic profile by MS and
NMR
- protein expression in liver by MS
- gene expression by DNA
microarray
- conducting genetic and
phenotypical analysis
Information contributing to the
construction and validation of
system biology models
31. Example of experiments by
InnoMed PredTox
31 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 a FP6 public-private consortium
Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
32. Structured description of datasets
§ Capture all salient features
of the experimental workflow
§ Make annotation explicit and
discoverable
§ Structure the descriptions
for consistency, tracking
§ independent variables
§ dependent variables
using
§ cross reference and
resolvable identifiers
33. Not too much, not too little, just ‘right’
§ We must strike a balance
between
• depth and breadth of
information; and
• sufficient information
required to reuse the data
35. Information intensive experiments
To make the experiments
comprehensible and reusable,
underpinning future
investigations, we need
common ways to report and
share the experimental details
and the associated data.
Consistent reporting will have a
positive and long-lasting impact
on the value of collective
scientific outputs.
36. Common ways to report and share
§ The challenges we face
• Large in volume: lots of data types and metadata!
• Lots of free text descriptions: hard to mine, subject to mistakes!
• Babel of terminologies: lack of definitions, hard to map!
• Heterogeneous file formats: software lock-in!
§ Need for reporting standards
• Minimal reporting descriptors
- Report the same ‘core essentials’
• Controlled vocabularies or ontology
- Use the same word and mean the same thing
• Common exchange formats
- Make tools interoperable, allow data exchange and integration
37. Reporting standards – the benefits
§ Describe and communicate the information to others, in an
unambiguous manner
§ To unlock the value in the data
• Compare, query and evaluate data
- Facilitate scientific validation of the findings
• Understand variability within/between different technologies and
protocols
- Facilitate technical validation
- Enable optimization of the experimental designs
- Identify critical checkpoints and develop quality metrics
§ To define submission and/or publication requirements
• Journals
• Databases
§ To ensure data integrity, reproducibility and (re)use
38. Escalating number of standardization efforts in bioscience,
e.g.:
Genomics Standards
Genome annotation Consortium (GSC)
www.geneontology.org gensc.org
Functional Enzymology data
Genomics Data standards
Society (FGED) www.strenda.org
www.fged.org
HUPO- Proteomics
Standards Initiative (PSI) Systems modelling
http://www.psidev.info standards
www.co.mbine.org
Cheminformatics
www.ebi.ac.uk/chebi
Pathways
www.biopax.org
Metabolomics Standards Initiative (MSI)
http://www.metabolomicssociety.org
39. Different community, different norms and standards, e.g.:
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
Challenges:
lack of coordination, fragmentation and uneven coverage
40. Is this ‘general mobilization’ good or bad?
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
§ Difference in structures and processes:
• organization types (open, close to members, society, WG…)
• standards development (how to design, develop, evaluate, maintain…)
• adoption, uptake, outreach (link to journals, funders, commercial sector…)
• funds (sponsors, memberships, grants, volunteering…)
41. Is this ‘general mobilization’ good or bad?
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
§ Fragmentation of the standards is a major issue
• Being focused on particular communities’ interests, be their individual
technologies or biological/biomedical disciplines, leads to duplication of effort,
and more seriously, the development of (largely arbitrarily) different standards
• This severely hinders the interoperability of databases and tools and ultimately
the integration of datasets
42. Fragmentation of the databases and data, e.g.
Access
Storage
Submission
Three EBI
omics systems
43. Fragmentation of the databases and data, e.g.
Access
Storage
Submission
Three EBI
omics systems
44. Fragmentation of the databases and data, e.g.
Access
Storage
Submission
Three EBI
omics systems
45. Fragmentation of the databases and data, e.g.
Access
DIFFERENT
Download formats
DIFFERENT
- Core requirements
Storage
represented
- Representation of the
studies and related
samples
- Curation practices
DIFFERENT
Submission
Formats, terminologies and
tools
Three EBI
omics systems
46. To integrate data we need interoperable standards
epidemiology
plant biology microbiology
Biologically-delineated
views of the world
Generic features ( common core )
- description of source biomaterial
- experimental design components
MS MS
Arrays Gels NMR Technologically-delineated
Columns FTIR
views of the world
Scanning Arrays &
Scanning Columns
transcriptomics metabolomics
transcriptomics
47. Need to address the fragmentation
§ Promote synergies
• Among basic academic (omics) research but also regulatory- or
healthcare-driven initiatives
§ Much could be learned from exchange of ideas and practices
• Although, regulatory- or healthcare-driven initiatives have far stricter
guidelines
• Although, often SDOs have ‘close’ discussions, require membership
§ Create interoperable standards
• Fit neatly into a jigsaw, resolving inconsistency and filling gaps
§ Overcome several barriers
• Technical
• Funding issue
• Sociological......
48. Eloquent quotes
“Biologists would rather share their toothbrush
than their gene name”
Michael Ashburner, Professor Genetics,
University of Cambridge, UK
“Any customer can have a car painted any
colour that he wants so long as it is black”
Henry Ford, you know who he is…
48 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
49. Standards – an old issue, e.g. engineering in 1850
§ Buying nuts and bolts is easy today
• But in the 19th century it was very complicated!
50. Standards – an old issue, e.g. engineering in 1850
§ Buying nuts and bolts is easy today
• But in the 19th century it was very complicated!
§ Nuts and bolts were custom made
• Products from different shops were incompatible
• Craftsmen liked the monopoly
- Customers were ‘locked in’ !!
§ In 1864 William Sellers initiated the standardization
• Mass production
• Get interchangeable parts
• Standardized way to make nuts and bolts
§ Generally adopted only after WWII, though …. !!
51. Social engeneering
51 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
52. Ownership of open standards
can be problematic in broad,
grass-root collaborations; it
requires improved models, to
encourage maintenance of and
contributions to these efforts,
supporting their evolutions
52 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
53. The extensive community
liaison needs to be managed
and funded; rewards and
incentives need to be identified
for all contributors
53 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
54. The cost of implementing a
standards-supported data
sharing vision is as large as the
number of stakeholders that
must operate synchronously
54 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
55. 1. Funders actively developing data policies
§ Several data preservation, management and sharing policies have
emerged in response to increased funding for omics domains
§ Even if in general terms, standards are recognized as necessary ‘tools’ to
unambiguously represent, describe and communicate research data
56.
57. 2. Similar trend in the regulatory arena
§ “… lack of standardized data affects CDER’s review processes by curtailing a
reviewer’s ability to perform integral tasks such as rapid acquisition, storage,
analysis......efficient management of a portfolio of standards projects will
require coordinated efforts and clear roles for multiple participants within/outside
FDA”
58.
59. 3. Publishes have become strong advocators
§ Continue to support the development of open standards and tools
• to support sharing of sufficiently well annotated datasets
59 • to enable comprehensible, reusable, www.ebi.ac.uk/net-project research
reproducible
The International Conference on Systems Biology (ICSB), 22-28 August, 2008
Susanna-Assunta Sansone
60. ….the rise of data-driven journals, e.g.:
partnering with:
61. The rise of data-driven journals, e.g.:
partnering with:
62. 4. Similar trend in the commercial sector
§ R&D has invested heavily in procedures and tools that integrate external
information with their own data to enhance the decision-making process
• Now joining forces to streamline non-competitive elements of the life
science workflow by the specification of common standards, business
terms, relationships and processes
63. ....their information landscape is evolving
Yesterday Today Tomorrow
Proprietary
Public content
content provider
provider
Big Life
Science Big Life CRO
Academic
Company Science
group
Company
Regulatory
authorities
Service provider
Software vendor
Yesterday Today Tomorrow
Innovation Innovation inside Searching for Innovation Heterogeneity of collaborations; part of
the wider ecosystem
Model
IT Internal apps & data Struggling with change Cloud, services
security and trust
Data Mostly inside In and out Distributed
Portfolio Internally driven and owned Partially shared Shared portfolio
Credit to: Pistoia Alliance
67. Take home messages
“The buzz around reproducible bioscience data:
the policies, the communities and the standards”
u Contribute to the reproducible research movement
u Learn about open community-standards in your area
u Consider data science as a career path
68. Outline
“The buzz around reproducible bioscience data:
the policies, the communities and the standards”
“The reality from the buzz:
how to deliver reproducible bioscience data”
69. How do we achieve this? Is it possible to achieve a common,
structured representation of diverse bioscience experiments
that:
• “The buzz around reproducible bioscience data:
follows the appropriate community standards and
COMPREHENSIBLE
• the policies, E R Ocommunities research?standards”
delivers I N T the P E R A B L E and the
REPRODUCIBLE
REUSABLE
“The reality from the buzz:
how to deliver reproducible bioscience data”
72. But how much do we know about these standards
MAGE-Tab! AAO! miame!
GCDML! MIAPA!
CHEBI!
SRAxml! OBI! MIRIAM!
VO!
SOFT! MIQAS!
FASTA! PATO! MIX!
CML! ENVO! REMARK!
DICOM! MIGEN!
GELML! MOD!
SBRML! MIAPE! MIQE!
TEDDY!
MITAB! MzML! XAO! CIMR! CONSORT!
BTO!
ISA-Tab! SEDML…! DO PRO! IDO…! MIASE! MISFISHIE….!
73. But how much do we know about these standards
Which tools and I use high throughput
databases sequencing technologies,
implement which which one are applicable
standards? to me?
How can I get
What are the
involved to
criteria to evaluate
propose
their status and
extensions or
value?
modifications?
Which one are I work on plants,
mature enough for are these just for
me to use or biomedical
recommend? applications?
74. But how much do we know about these standards
§ A bewildering array of standards is available, but
• these are hard to find, at different levels of maturity; in
some areas duplications or gaps in coverage also exist
§ Standards are just a ‘means to an end’, therefore
• we want to make them discoverable and accessible,
maximizing their use to assist the virtuous data cycle,
from generation to standardization through publication to
subsequent sharing and reuse
76. Towards Lego-like ontologies
§ Compound terms should be formed out of simpler constituents:
• Body weight
weight (quality ontology, PATO)
that
inheres_in (relation ontology, RO)
whole_organism (anatomy ontology, CARO)
• Xylene contaminated soil
soil (environmental ontology, EnvO)
that
has_contaminated (relation ontology, RO)
xylene (chemical ontology, ChEBI)
76 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
78. § Serves researchers, biocurators, journal editors
and reviewers, and funders to
§ discover checklists for a particular domain
§ monitor progress of extant efforts
§ facilitate collaborations
80. A catalogue to map the
landscape of standards and the
systems implementing them:
Over 400 bio-standards
(public and in curation)
Field*, Sansone* et al., Omics data sharing. Science
80 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
326, 234-36 (2009) doi:0.1126/science.1180598
www.ebi.ac.uk/net-project
81. • A coherent, curated and searchable catalogue of data sharing resources
• Bioscience standards and associated data-sharing policies, publications, tools and databases
• Assessment criteria for usability and popularity of standards
• Relationships among standards
• Encouragement for communication & interaction among groups
• Promoting interoperability & informed decisions about standards
82. Smith et al, 2007
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
83. Smith et al, 2007
Taylor, Field, Sansone et al, 2008
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
84. List of databases, linked to standards a collaboration with Database Issue
84 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
85. List of databases, linked to standards a collaboration with Database Issue
85 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
86. List of databases, linked to standards a collaboration with Database Issue
86 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
87. Major challenge: define ‘relations’ among standards
CREDIT:
The relationship among popular standard formats for pathway information Demir, et al., The BioPAX
BioPAX and PSI-MI are designed for data exchange to and from databases and community standard for
pathway and network data integration. SBML and CellML are designed to pathway data sharing,
support mathematical simulations of biological systems and SBGN represents 2010.
pathway diagrams.
87 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
93. An exemplar approach to the status quo
§ A grass-root collaborative that works to facilitate collection, curation
and sharing of experiments using a common, structured representation
of the experiments that
• transcends individual biological and technological domains and
• can be ‘configured’ to implement (several of) the community
standards
TOWARDS INTEROPERABLE BIOSCIENCE DATA doi:10.1038/ng.1054
Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann
S, Tong W, Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B,
Clark T, Coleman LA, Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S,
Evelo C, Forster M, Gaudet P, Gilbert J, Goble C, Griffin J, Jacob D, Kleinjans J, Harland
L, Haug K, Hermjakob H, Sui S, Laederach A, Liang S, Marshall S, Merrill E, McGrath A,
Feb 2012
Reilly D, Roux M, Shamu C, Shang C, Steinbeck C, Trefethen A, Williams-Jones B,
www.biosharing.org www.isacommons.org
Wolstencroft K, Xenarios J, Hide W.
www.isacommons.org
94. An exemplar approach to the status quo
§ A grass-root collaborative that works to facilitate collection, curation
and sharing of experiments using a common, structured representation
of the experiments that
• transcends individual biological and technological domains and
• can be ‘configured’ to implement (several of) the community
standards
95. metadata tracking framework
user community
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
96. General-purpose, configurable format,
designed to support the use of several
standards checklists, terminologies and
conversions to (a growing number of) other
metadata formats, used by public
repositories, e.g.
MAGE-Tab Pride-xml
SRA-xml SOFT
97. ISA software suite: supporting standards-compliant experimental
annotation and enabling curation at the community level
(Rocca-Serra et al, 2010)
a collaborative effort of international research/service groups:
University of Oxford, EBI, Harvard School of Public Health, NERC Environmental
Bioinformatics Centre, Genomic Standards Consortium, US FDA Center for
Bioinformatics, Leibniz Institute of Plant Biochemistry and more….
98.
99. Create template(s) to fit the type of
experiments to be described
Create templates detailing the steps to be
reported for different investigations, complying
to community standards, e.g. configuring the
value(s) allowed for each field to be
• text (with/without regular expression testing),
• ontology terms,
• numbers etc.
1
100. Describe, curate your experiment
with geographically- distributed
collaborators
Report and edit the description of the
investigation using customized Google Spreadsheets
(importing the ‘template’ created by the ISA
configurator) enabled with ontology search and
term-tagging features.
2a
101. Or describe, curate your experiment
using a desktop-based tool
Report and edit the description using this tool,
(also customized using the templates) with a
spreadsheet like look and feel, packed with
functionalities such as
• ontology search (access via )
• term-tagging features
• import from spreadsheets etc…
2b
102. ISMB tag:
#PP44
To mint DOIs
102 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
empowering researchers to use standards
103. Perform data analysis
We are building relevant ISA modules for GenomeSpace,
R-based BioConductor and Galaxy tools
3
104. Share your experiments with the
world as Linked Open Data
Through conversion to RDF; work in
collaboration with the W3C HCLSIG
4
105. Share your experiments with the
world as Linked Open Data
Through conversion to RDF; work in
collaboration with the W3C HCLSIG
4
Tim Berners-Lee’s 5-star deployment scheme for Linked Open Data
106. Submit your experiments to public repositories
Directly in ISA-Tab or reformatting using the ISAconverter
5
107. Create your own repository
Store the investigations in the database, assign access rights and
conduct maintenance tasks.
Share, browse, query and view investigations, their
descriptions and access associated data files.
6
108. Maguire E, Rocca-Serra P, Sansone SA, Davies J and Chen M.
Taxonomy-based Glyph Design -- with a Case Study on Visualizing
Workflows of Biological Experiments,
IEEE Transactions on Visualization and Computer Graphics, volume 18, 2012
(in press)
109. A growing ecosystem of over 30 public and internal resources
using the ISA metadata tracking framework (ISA-Tab and/or
format) to facilitate standards-compliant collection, curation,
management and reuse of investigations in an increasingly diverse
set of life science domains, including:
• environmental health • stem cell discovery
• environmental genomics • system biology
• metabolomics • transcriptomics
• metagenomics • toxicogenomics
• nanotechnology • also by communities working to build
• proteomics, a library of cellular signatures
113. Implementation at the EBI
113 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
115. Extensions of the
Nanotechnology
Informatics Working Group
115 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
116. Open source code
Community involvement and uptake!
1st ISA-Tab workshop! 3rd ISA-Tab workshop! User workshops/visits - start! 1st public instance: !
2nd ISA-Tab workshop! Other tools implement ! Harvard Stem Cell ! Growing number of
ISA-Tab! Discovery Engine! systems starts to adopt
ISA framework!
Core developments!
Conversions to ! Links to
Pride-XML/SRA-XML/! analysis tools
Strawman ISA-Tab spec! ISA software v1! MAGE-Tab and more! starts!
Final ISA-Tab spec! Database instance !
at EBI! RDF format starts!
Publications!
Stem Cell !
ISA-Tab and ! Discovery ! ISA Commons!
Omics data sharing!
Workshop reports! ISA software suite! Engine!
(Science)! (Nature Genetics)!
(Bioinformatics)! (NAR)!
2007 2008 2009 2010 2011 2012
Development timeline
117. Final remarks
“The buzz around reproducible bioscience data:
the policies, the communities and the standards”
“The reality from the buzz:
how to deliver reproducible bioscience data”
118. Your research and all (publicly
funded) research should make
make an … impact
http://www.flickr.com/photos/equinoxefr/2620239993/ CC BY
118 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
119. …..the biggest possible impact!
http://www.flickr.com/photos/webhamster/2582189977/ CC BY
119 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
121. We must increase the level of annotation
Notes in Lab Books Spreadsheets and Tables Facts as RDF statements
(information for humans) ( the compromise) (information for machines)
• Invest in curating and manage data at the source using:
• a common metadata tracking framework, such as ISA
• publicly available and community-developed terminologies
• recording sufficient contextual information of the experimental steps
§ Progressively datasets will become more comprehensible, interoperable,
reproducible and (re)usable, underpinning future investigations
122. 122 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project