Overview of three areas where the ENCODE DCC is facilitating the integration of diverse datasets: (1) defining a metadata standard (2) using ontologies for annotation (3) creating a RESTful interface for data access
1. The
ENCODE
metadata
standard
to
integrate
diverse
experimental
data
sets
Eurie
L.
Hong,
Ph.D.
(@elhong)
Project
Manager,
ENCODE
DCC
Department
of
GeneFcs
•
Stanford
University
School
of
Medicine
Intro
to
the
DCC
Metadata
definiFon
Using
ontologies
Accessing
metadata
2. 2
Not
pictured:
Tim
Dreszer,
Jorge
Garcia,
Donna
Karolchik,
Katrina
Learned,
Forrest
Tanaka,
Marcus
Ho
ENCODE
DCC
Galt
Barber,
Morgan
Maddren,
Nikhil
Podduturi,
Greg
Roe,
Kate
Rosenbloom,
Laurence
Rowe
Esther
Chan,
Venkat
Malladi,
Cricket
Sloan,
Seth
StraWan
Eurie
Hong,
Mike
Cherry
(PI),
Jim
Kent
(co-‐PI),
Ben
Hitz
Brian
Lee,
Stuart
Miyasato,
MaW
Simison,
Zhenhua
Wang
@encodedcc
encode-‐help@lists.stanford.edu
Data
Wranglers
So]ware
engineers
QA,
sysadmins,
admin
hWps://github.com/ENCODE-‐DCC/encoded
3. ProducFon
labs
Analysis
groups
Role:
Data
genera?on
Data
organiza?on
Data
access
Tasks:
Perform
assays
Data
processing
&
validaFon
Web-‐based
searches
Perform
analyses
Data
file
storage
Data
downloads
Validate
data
Metadata
curaFon
Submit
data
files
Submit
metadata
Genome
Browser
ENCODE
portal
(DCC)
Role
of
the
Data
CoordinaFon
Center
Data
files
Metadata
DCC
DCC
Integrative
websites!
Scientific!
community!
4. Challenge:
How
do
you
define
a
metadata
standard
for
diverse
assays
in
mulFple
species?
Modified
from
PLoS
Biol
9-‐e1001046,2011
(M.
Pazin)
5. Principles
driving
metadata
definiFon
• Provide
transparency
about
how
experiments
were
performed
• Capture
data
provenance
during
analyses
• Communicate
key
experimental
variables
of
an
experiment
• Communicate
quality
metrics
about
the
data
•
Help
analyze
and
interpret
the
data
•
Help
organize
and
find
the
data
6. Capture
the
experimental
design
Biological
replicate
1
Technical
replicate
1
Technical
replicate
2
Biological
replicate
2
Technical
replicate
1
Technical
replicate
2
Control
1
Control
2
Data
file
Technical
replicate
1
Data
file
Results
file
Experiment
Experiment
7. IdenFfy
reusable
experimental
variables
Biosamples
• Type
(e.g.
Fssue,
cell
line)
• Ontology
term
name
• Source,
product
id,
lot
id
• Treatments
• Knockdown
• Fusion
construct
informaFon
• Donor
or
strain
informaFon
• Dates
(e.g.
growth,
harvest,
procurement)
• Passage
number
• StarFng
amount
• Lab
assigned
IDs
AnFbodies
• Source,
product
id,
lot
id
• Isotype
• AnFgen
• Host
• PurificaFon
method
• ValidaFon
status
• NHGRI
approval
status
• Target
• Species
• Dbxrefs
Libraries
• Library
preparaFon
protocol
• Strand
specificity
• Size
selecFon
method
• ValidaFon
document
• Lysis
method
• SonicaFon
method
• ExtracFon
method
• Nucleic
acid
type
• Nucleic
acid
size
range
+
Files
Peak
calls
• Reference
genome
version
• Alignment
so]ware
• So]ware
parameters
• So]ware
version
• Quality
metrics
(e.g.
NRF,
FRiP)
Alignment
(selected
subset
of
all
metadata)
Experiment
with
replicates
8. Accession
them
Biosamples
• Type
(e.g.
Fssue,
cell
line)
• Ontology
term
name
• Source,
product
id,
lot
id
• Treatments
• Knockdown
• Fusion
construct
informaFon
• Donor
or
strain
informaFon
• Dates
(e.g.
growth,
harvest,
procurement)
• Passage
number
• StarFng
amount
• Lab
assigned
IDs
AnFbodies
• Source,
product
id,
lot
id
• Isotype
• AnFgen
• Host
• PurificaFon
method
• ValidaFon
status
• NHGRI
approval
status
• Target
• Species
• DBxrefs
Libraries
• Library
preparaFon
protocol
• Strand
specificity
• Size
selecFon
method
• ValidaFon
document
• Lysis
method
• SonicaFon
method
• ExtracFon
method
• Nucleic
acid
type
• Nucleic
acid
size
range
+
Files
Peak
calls
• Reference
genome
version
• Alignment
so]ware
• So]ware
parameters
• So]ware
version
• Quality
metrics
(e.g.
NRF,
FRiP)
Alignment
(selected
subset
of
all
metadata)
Experiment
with
replicates
(ENCSR000DRY)
ENCBS095DKV
(biosample)
ENCDO826IFN
(donors)
ENCAB964IAU
ENCLB239KAN
ENCFF254TDA
9. Define
their
relaFonship
to
each
other
Biosample
AnFbodies
Libraries
+
Files
Donor
Biosample
Replicate
has
has
has
has
has
has
Experiment
has
10. Challenge:
Find
common
biosamples
from
data
generated
by
two
consorFa
356
terms
hWp://encodeproject.org/ENCODE/cellTypes.html
Projects
are
internally
consistent…..
314
terms
GEO
characterisFcs:
common_name,
Fssue_type,
cell_type,
lines
11. 360
terms
Cell
type
…
but
only
3
biosample
names
match
exactly
between
projects
314
terms
GEO
IMR90
PBMC
Th17
12. Challenge:
Find
all
heart-‐related
Fssues?
Heart_OC
HCF
HCFaa
HCM
Others?
Fetal
Heart
Heart
Right
Atrium
Right
Ventricle
Others?
15. Metadata
database
Metadata
in
JSON-‐LD
Metadata
viewed
as
web
page
Scripts
Query
using
REST
API
commands:
GET,
PATCH,
POST
DCC
Challenge:
Provide
user-‐friendly
*AND*
programmaFc
access
to
the
data
Genome
Browser
17. Future
direcFons
• Metadata
definiFon:
Finalize
so]ware
and
file
provenance
• Ontology-‐based
searches:
Implement
searches
for
ChIP-‐seq
targets
using
GO
annotaFons
• ProgrammaFc
access:
Implement
addiFonal
validaFons
upon
data
submission
18. Intro
to
the
DCC
Metadata
definiFon
Using
ontologies
Accessing
metadata
We
developed
a
single
data
model
that
reflects
the
experimental
process
to
store
the
30+
assays
done
by
the
ENCODE
producFon
labs
Using
ontologies
to
annotate
metadata
provides
instant
interoperability
with
other
datasets
&
search
funcFonality
ApplicaFon
built
on
a
REST
API
&
JSON-‐LD
supports
programmaFc
querying
across
other
scienFfic
resources
Conclusions
19. 19
Acknowledgements
Brian
Lee,
Nikhil
Podduturi,
Greg
Roe,
Laurence
Rowe
Esther
Chan,
Venkat
Malladi,
Cricket
Sloan,
Seth
StraWan
Eurie
Hong,
Mike
Cherry
(PI),
Jim
Kent
(co-‐PI),
Ben
Hitz
@encodedcc
encode-‐help@lists.stanford.edu