Building (and traveling) the data-brick road: A report from the front lines of data integration

Building (and traveling) the
data-brick road:
A report from the front lines of
data integration
Melissa Haendel, PhD
NIH Data Commons Kickoff
2017-12-06

The data-brick road
is a means…
…not a destination

Reuse: the last (and hardest) mileCompliance
F A I R
VS
FAIR FAIR
Reuse requires (semantic) integration

Under-
appreciated
obstacles to
integration?
FAIR-TLC
“Traceability, Licensing,
and Connectedness --
OH MY!”
Traceable Licensed Connected

Traceable Licensed Connected
bit.ly/fair-tlc
• Data models
• Ontologies
• Concept alignment
• Common Data
Elements
• Identifiers in service of
above
Evidence
Provenance
Attribution
• Clearly stated
• Comprehensive and
non-negotiated
• Accessible
• Avoid restrictions on
kinds of (re)use
• Avoid restrictions on
who may (re)use
Traceability and Licensure
go hand-in-hand:
Licensing often a
(convoluted) credit hack

A vision for the future of scholarship

Thomas Markello
Dong Chen
Justin Y. Kwan
Iren Horkayne-
Szakaly
Alan Morrison
Olga Simakova
Irina Maric
Jay Lozier
Andrew R. Cullinane
Tatjana Kilo
Lynn Meister
Kourosh Pakzad
Sanjay Chainani
Roxanne Fischer
Camilo Toro
James G. White
David Adams
Cornelius Boerkoel
William A. Gahl
Cynthia J. Tifft
Meral Gunay-Aygun
Hans Goeble
Karen Balbach
Nadine Pfeifer
Sandra Werner
Christian Linden
Melissa Haendel
Peter Robinson
Chris Mungall
Sebastian Kohler
Cindy Smith
Nicole Vasilevsky
Sandra Dolken
Elizabeth Lee
Amanda Links
Will Bone
Murat Sincan
Damian Smedley
Jules Jacobson
Nicole Washington
Elise Flynn
Sebastian Kohler
Orion Buske
Marta Girdea
Michael Brudno
Jeremy Band
Melissa Haendel
David Adams
David Draper
Bailey Gallinger
Joie Davis
Nicole Vasilevsky
Heather Trang
Rena Godfrey
Gretchen Golas
Catherine Groden
Michele Nehrebecky
Ariane Soldatos
Elise Valkanas,
Colleen Wahl
Lynne Wolfe
Johannes Grosse
Attila Braun
David Varga-Szabo
Niklas Beyersdorf
Boris Schneider
Lutz Zeitlmann
Petra Hanke
Patricia Schropp
Silke Mühlstedt
Carolin Zorn
Michael Huber
Carolin Schmittwolf
Wolfgang Jagla
Philipp Yu
Thomas Kerkau
Harald Schulze
Michael Nehls
Bernhard Nieswandt
Clinicians/
care team
Pathologists Ontologists Informaticians Curators Basic
Research
The translational workforce:
It takes a village to solve disease

GEO dataset
Gemma DRG
Genes differentially expressed:
~8,000 gene comparisons
Genes significantly expressed or
unchanged:
~13,000 gene comparisons
Incongruous results
Evidence and provenance
Importance of raw data alignment
(never mind that the gene IDs had to be mapped from strings)
Increased: 1,640
Decreased: 1,110
Differential: 2,920
Increased: 4,264
Decreased: 3,833
Differential: 8,133
Both resources recorded 95% confidence intervals for significance

…how many of these data are truly reusable?
Openness is assumed, but …

Most licenses are vague, non-standard, or missing
(n = 51 DBs)
Reusabledata.org

Identifiers are the invisible bedrock of all scientific inquiry;
the more complex the question,
the greater the reliance on ID hygiene
What? Why?
How?
Identifiers
Identifiers &
Metadata
Identifiers &
MetaData &
Models
Requiredharmonization
Question complexity
FAIR
FAIR
How many?
FAIR

Identifier Reality: Not all IDs created equal
We need systems that accommodate the heterogeneity
Traditional
Literature
Non-
Traditional
Persistent
Ephemeral
Non-existent
IdentifierMaturity
Scholarly Output Maturity
Genomic
resources
Wild west of identifier
tumbleweed

(way) beyond “linkrot”:
Pain points in identifier tech/standards
• Versioning and
Content evolution
AKA “content drift”
• Identifier
Surrogacy /
Granularity
• Ambiguous
equivalence
• Distribution
(& replication)
of content
over multiple
providers
bit.ly/evidence-of-identifier-pain

Ambiguous
equivalence
case 2:
Eye of the
beholder

Tangible, actionable community best practice
on identifiers for data integration
doi:10.1371/journal.pbio.2001414 (bit.ly/id21c-plosbio)

What integrators are aiming to do is non-trivial

Ambiguous equivalence case 4: Post-hoc harmonization

Ambiguous equivalence case 3: Fuzzy Match on xrefs/content
How are these
11 records for
“Ehlers Danlos
Syndrome”
related to each
other?
Narrow synonym?
Broad? Exact?
Child? Parent?
Bayesian models
like k-BOOM can
help
Mungall
doi:10.1101/048843
bit.ly/xref-wildwest

Challenges in propagating knowledge:
Different sources associate phenotypic information with
different aspects of the genotype
fgf8ati282a/ti282a;shhatbx392/+[TL];
cdkn1caMO3-cdkn1ca
Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6]
daf-2(e1370)
ATP1A3(NM_152296.3)
[c.946G>A, p.Gly316Ser]
tin(ABD/346)
Includes gene knockdowns
Includes genetic background
Includes multiple alleles
Single allele w/zygosity
allele
variant/mutation
gene

Challenges in propagating knowledge:
Different sources associate phenotypic information with
different aspects of the genotype
fgf8ati282a/ti282a;shhatbx392/+[TL];
cdkn1caMO3-cdkn1ca
Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6]
daf-2(e1370)
ATP1A3(NM_152296.3)
[c.946G>A, p.Gly316Ser]
tin(ABD/346)
Includes gene knockdowns
Includes genetic background
Includes multiple alleles
Single allele w/zygosity
allele
variant/mutation
gene
Transverse spirally arranged
myofibrils are almost
completely absent.
Dystonia 12

Decomposition of complex concepts allows interoperability
“Palmoplantar
hyperkeratosis”
increased
Stratum corneum
layer of skin
=
Human phenotype PATO
Uberon
Species neutral ontologies, homologous concepts
Autopod
keratinization
GO
“Ulcerated
paws”
Mouse phenotype
=

Need for a
comprehensive
and connected
picture across
resources

Goldilocks
approach to
harmonizing
data dissonance

Genes Environment Phenotypes+ =
We need interoperability for not only the types of
things in our data….

G-P or D (disease)
• causes
• contributes to
• is risk factor for
• protects against
• correlates with
• is marker for
• modulates
• involved in
• increases susceptibility to
G-G (kind of)
• regulates
• negatively regulates (inhibits)
• positively regulates (activates)
• directly regulates
• interacts with
• co-localizes with
• co-expressed with
P/D - P/D
• part of
• results in
• co-occurs with
• correlates with
• hallmark of (P->D)
E-P
• contributes to (E->P)
• influences (E->P)
• exacerbates (E->P)
• manifest in (P->E)
G-E (kind of)
• expressed in
• expressed during
• contains
• inactivated by
…the relationships and their evidence must also be
captured

Data Integrators have deep knowledge to help data
providers birth interoperable data

Thank you
Julie McMurry
For her illustrative gifts

Thank you for helping build the data-brick road
Melissa Haendel, PhD
@ontowonka

Building (and traveling) the data-brick road: A report from the front lines of data integration

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Building (and traveling) the data-brick road: A report from the front lines of data integration

Ähnlich wie Building (and traveling) the data-brick road: A report from the front lines of data integration (20)

Mehr von mhaendel

Mehr von mhaendel (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Building (and traveling) the data-brick road: A report from the front lines of data integration