The document discusses the ISA infrastructure, which provides a framework for tracking metadata in bioscience experiments from data collection to sharing in linked data clouds. The infrastructure includes a metadata syntax, open source software tools, and a user community. It allows annotation of experimental metadata, materials, and processes using ontologies to make semantics explicit and enable integration and knowledge discovery. The infrastructure is growing with over 30 public and private resources adopting it to facilitate standards-compliant sharing of investigations across life science domains.
1. The
ISA
Infrastructure
for
the
biosciences
from
data
curaDon
at
source
to
the
linked
data
cloud
Alejandra
Gonzalez-‐Beltran
University
of
Oxford
e-‐Research
Centre,
UK
Alejandra.GonzalezBeltran@oerc.ox.ac.uk
Conference on Semantics in Healthcare and Life Sciences (CSHALS)
Boston, USA Feb 27- Mar 1 2013
2. Outline
• The
infrastructure
:
a
metadata
tracking
framework
in
the
biosciences:
the
format,
a
set
of
open
source
soMware
tools
and
the
user
community
• The
syntax
and
its
implicit
semanDcs
• The
component
of
the
infrastructure
•
for
mapping
the
syntax
to
ontologies
• A
couple
of
mappings,
architecture,
conversion
5. Need
for
a
generic
representaDon,
applied
to:
•microarray
based
experiments
(MAGE)
•sequencing
based
experiments
(SRA)
•flow
cytometry
based
experiments
(FuGE-‐Flow
Cyt)
•mass
spectrometry
and
NMR
spectroscopy
experiments
(Metabolights
and
PRIDE)
6. ISA
soMware
suite:
supporDng
standards-‐compliant
experimental
infrastructure
annotaDon
and
enabling
curaDon
at
the
community
level
Rocca-‐Serra
et
al,
2010
BioinformaDcs
• Assist
in
the
annotaDon
and
management
of
experimental
metadata
at
source,
supporDng
data
provenance
tracking
• Deal
with
high-‐throughput
studies
using
one
or
a
combinaDon
of
omics
and
other
technologies
• Empower
users
to
uptake
community-‐defined
checklists
and
ontologies
• Facilitate
data
sharing,
re-‐use,
comparison
and
reproducibility
of
experiments,
submission
to
internaDonal
public
repositories
7. Towards
interoperable
bioscience
data
Sansone
et
al,
2012
Nature
GeneDcs
A
growing
ecosystem
of
over
30
public
and
internal
resources
using
the
ISA
metadata
tracking
framework
to
facilitate
standards-‐compliant
collecDon,
curaDon,
management
and
reuse
of
invesDgaDons
in
an
increasingly
diverse
set
of
life
science
domains.
10. HybridizaDon
Derived
Array
Data
File
Sample
Name
Material
Type
Assay
Design
REF
Array
Data
File
Protocol
REF
Assay
Name
sample1
genomic
DNA
assay1
A-AFFY-107" assay1.cel
data
normalizaDon
assay1.txt
sample2
genomic
DNA
assay2
A-AFFY-107" assay2.cel
data
normalizaDon
assay2.txt
sample3
genomic
DNA
assay3
A-AFFY-107" assay3.cel
data
normalizaDon
assay3.txt
Material transformations...
Material Node
Data File Node
"
" DATA!
Characteristics[…]
Material! Derived Data File
Factor Value[…]
(independent Protocol
variables)
Process
Material Type
Comment[…]
Parameter Value […]
" "
Material! DATA! Raw Data File
Performer (operator effect)
Date (day effect)
11. 11
Tagging:
from
free
text
to
ontology-‐based
• single
intervenDon
representaDon,
free
text
annotaDon
Factor
CharacterisDcs[organism]
Factor
Factor
Source
Name
Value[perturbaDon
Value[dose]
Value[duraDon]
agent]
individual1
human
aspirin
high
dose
12
weeks
• single
intervenDon,
ontology-‐based
annotaDon
Factor
CharacterisDcs[organism
Term
Source
Term
Accession
Value[chemical
Term
Source
Term
Accession
Source
Name
obi:0100026)])
REF
Number
compound
REF
Number
CHEBI_37577)]
individual1
Homo
sapiens
NCBITax
9606
aspirin
CHEBI
1231354
Factor
Term
Source
Term
Accession
Factor
Value[Dme
Term
Source
Term
Accession
Unit
Value[dose(OBI_0000984)
REF
Number
(PATO_0000165)]
REF
Number
low
dose
LNC
LP30872-‐3
12
week
UO
0000034
12. ToxBank
effort
developed
by
Nina
Jeliazkova
Health
Care
&
Life
Sciences
Kohonen
et
al.
The
ToxBank
Data
Warehouse:
a
Interest
Group
research
cluster
of
7
EU
FP7
Health
systems
toxicology
and
toxicogenomics
projects.
13. • Make
the
semanDcs
of
ISAtab
explicit,
including
materials
&
data
enDDes
&
processes
&
their
relaDonships
• Provide
incenDves
for
provision
of
ontology-‐based
annotaDons
in
ISA-‐TAB
datasets;
exploit
those
annotaDons
• Augment
ISA
syntax
with
new
elements
(e.g.
groups),
facilitaDng
the
understanding
&
querying
of
experimental
design
• Facilitate
data
integraDon
&
knowledge
discovery/
reasoning
15. • Ontology
search
and
automated
tagging
(relying
on
NCBO
Bioportal
services)
on
Google
Spreadsheets
• CollaboraDve
annotaDon;
support
for
distributed
users
• Version
control
&
history
OntoMaton:
a
Bioportal
powered
Ontology
widget
for
Google
Spreadsheets
Maguire
et
al,
2013
BioinformaDcs
16.
17. vocabularies
Chemical
Biomolecular
InformaDon
domain
domain
domain
Experimental
domain
Factor
CharacterisDcs[organi Term
Term
Term
Accession
Value[chemical
Term
Source
Name
smobi:0100026)])
Accession
Source
REF
Number
compound
Source
REF
Number
CHEBI_37577)]
individual1
Homo
sapiens
NCBITax
9606
aspirin
CHEBI
1231354
18. Open
Biological
and
Biomedical
Ontologies
(OBO)
Foundry
BFO
ChEBI
GO
IAO
Factor
CharacterisDcs[organi Term
OBI
Term
Term
Accession
Value[chemical
Term
Source
Name
smobi:0100026)])
Accession
Source
REF
Number
compound
Source
REF
Number
CHEBI_37577)]
individual1
Homo
sapiens
NCBITax
9606
aspirin
CHEBI
1231354
21. faahKO
dataset
Available
in
Bioconductor
(with
ISA-‐TAB
metadata)
Global
metabolite
profiling
Data
subset:
LC/
MS
peaks
from
the
spinal
cords
of
6
wild-‐type
and
6
FAAH
(fapy
acid
amyde
hydrolase)
knockout
mice
22.
23. • support
different
conversion
modes
(different
levels
of
granularity)
• querying
for
ISA-‐TAB
datasets,
across
mulDple
experiment
types
• reasoning
exploiDng
ontology
annotaDons
•
semanDc
validaDon
of
ISA-‐TAB
datasets
• augmented
annotaDon
over
naDve
ISA
syntax
• idenDficaDon
gaps
in
ontological
representaDons
• feedback
of
findings
to
community
ontologies
24. Increasing
level
of
structure
for
experimental
metadata
Notes
in
Lab
books
Spreadsheets
&
Tables
Facts
as
RDF
statements
(ISAtab
metadata)