The NIH Data Commons must treat the data it will contain not unlike the mortar and stones of a road. To help our fellow scientists travelers use the road, we must engineer for heavy traffic and diverse destinations. There are many steps to architecting a robust and persistent road. First, the data must be sourced and manipulated into common data models. This requires versioned access to the data, equivalency determination of identifiers within the data or minting of new ones for the data and/or within it, manipulating the data according to common data models (e.g. a genotype-to-pehnotype association in one source may relate a variant to a disease, where in another it may be a set of alleles associated with a set of phenotypes, each source models the data differently). Inclusion of the data in the Commons must meet all licensing restrictions, which are varied and usually poorly declared, as well as security, HIPAA, and ethics requirements. Software tools are needed to perform the Enhance-Transform-Load (ETL) process on a regular cycle to keep the data current, and to assess changes and quality assurance over time. For records that disappear, there needs to be a way to keep an archive of them. Once in the Commons, the data requires a map to navigate the roads: where do you want to go? Indexing and search across the data requires having the data be self-reporting - loading ontologies used in the data for indexing and providing faceted query over these and other attributes, sophisticated text mining tools, relevance ranking, and equivalency and similarity determination from amongst different providers. Once found, the users need vehicles to drive upon the road. These are their workspaces, the place where they design and implement the operations they need in order to get where they want to go. Unimaginable scientific emeralds are to be found at the end of the road, as the sum of all the data, if well integrated and made computationally reusable, has proven to be well beyond the sum of its parts in getting us where we want to go.
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Building (and traveling) the data-brick road: A report from the front lines of data integration
1. Building (and traveling) the
data-brick road:
A report from the front lines of
data integration
Melissa Haendel, PhD
NIH Data Commons Kickoff
2017-12-06
6. Traceable Licensed Connected
bit.ly/fair-tlc
• Data models
• Ontologies
• Concept alignment
• Common Data
Elements
• Identifiers in service of
above
Evidence
Provenance
Attribution
• Clearly stated
• Comprehensive and
non-negotiated
• Accessible
• Avoid restrictions on
kinds of (re)use
• Avoid restrictions on
who may (re)use
Traceability and Licensure
go hand-in-hand:
Licensing often a
(convoluted) credit hack
9. Thomas Markello
Dong Chen
Justin Y. Kwan
Iren Horkayne-
Szakaly
Alan Morrison
Olga Simakova
Irina Maric
Jay Lozier
Andrew R. Cullinane
Tatjana Kilo
Lynn Meister
Kourosh Pakzad
Sanjay Chainani
Roxanne Fischer
Camilo Toro
James G. White
David Adams
Cornelius Boerkoel
William A. Gahl
Cynthia J. Tifft
Meral Gunay-Aygun
Hans Goeble
Karen Balbach
Nadine Pfeifer
Sandra Werner
Christian Linden
Melissa Haendel
Peter Robinson
Chris Mungall
Sebastian Kohler
Cindy Smith
Nicole Vasilevsky
Sandra Dolken
Elizabeth Lee
Amanda Links
Will Bone
Murat Sincan
Damian Smedley
Jules Jacobson
Nicole Washington
Elise Flynn
Sebastian Kohler
Orion Buske
Marta Girdea
Michael Brudno
Jeremy Band
Melissa Haendel
David Adams
David Draper
Bailey Gallinger
Joie Davis
Nicole Vasilevsky
Heather Trang
Rena Godfrey
Gretchen Golas
Catherine Groden
Michele Nehrebecky
Ariane Soldatos
Elise Valkanas,
Colleen Wahl
Lynne Wolfe
Johannes Grosse
Attila Braun
David Varga-Szabo
Niklas Beyersdorf
Boris Schneider
Lutz Zeitlmann
Petra Hanke
Patricia Schropp
Silke Mühlstedt
Carolin Zorn
Michael Huber
Carolin Schmittwolf
Wolfgang Jagla
Philipp Yu
Thomas Kerkau
Harald Schulze
Michael Nehls
Bernhard Nieswandt
Clinicians/
care team
Pathologists Ontologists Informaticians Curators Basic
Research
The translational workforce:
It takes a village to solve disease
10. GEO dataset
Gemma DRG
Genes differentially expressed:
~8,000 gene comparisons
Genes significantly expressed or
unchanged:
~13,000 gene comparisons
Incongruous results
Evidence and provenance
Importance of raw data alignment
(never mind that the gene IDs had to be mapped from strings)
Increased: 1,640
Decreased: 1,110
Differential: 2,920
Increased: 4,264
Decreased: 3,833
Differential: 8,133
Both resources recorded 95% confidence intervals for significance
11. …how many of these data are truly reusable?
Openness is assumed, but …
12. Most licenses are vague, non-standard, or missing
(n = 51 DBs)
Reusabledata.org
13. Identifiers are the invisible bedrock of all scientific inquiry;
the more complex the question,
the greater the reliance on ID hygiene
What? Why?
How?
Identifiers
Identifiers &
Metadata
Identifiers &
MetaData &
Models
Requiredharmonization
Question complexity
FAIR
FAIR
How many?
FAIR
14. Identifier Reality: Not all IDs created equal
We need systems that accommodate the heterogeneity
Traditional
Literature
Non-
Traditional
Persistent
Ephemeral
Non-existent
IdentifierMaturity
Scholarly Output Maturity
Genomic
resources
Wild west of identifier
tumbleweed
15. (way) beyond “linkrot”:
Pain points in identifier tech/standards
• Versioning and
Content evolution
AKA “content drift”
• Identifier
Surrogacy /
Granularity
• Ambiguous
equivalence
• Distribution
(& replication)
of content
over multiple
providers
bit.ly/evidence-of-identifier-pain
21. Ambiguous equivalence case 3: Fuzzy Match on xrefs/content
How are these
11 records for
“Ehlers Danlos
Syndrome”
related to each
other?
Narrow synonym?
Broad? Exact?
Child? Parent?
Bayesian models
like k-BOOM can
help
Mungall
doi:10.1101/048843
bit.ly/xref-wildwest
22. Challenges in propagating knowledge:
Different sources associate phenotypic information with
different aspects of the genotype
fgf8ati282a/ti282a;shhatbx392/+[TL];
cdkn1caMO3-cdkn1ca
Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6]
daf-2(e1370)
ATP1A3(NM_152296.3)
[c.946G>A, p.Gly316Ser]
tin(ABD/346)
Includes gene knockdowns
Includes genetic background
Includes multiple alleles
Single allele w/zygosity
allele
variant/mutation
gene
23. Challenges in propagating knowledge:
Different sources associate phenotypic information with
different aspects of the genotype
fgf8ati282a/ti282a;shhatbx392/+[TL];
cdkn1caMO3-cdkn1ca
Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6]
daf-2(e1370)
ATP1A3(NM_152296.3)
[c.946G>A, p.Gly316Ser]
tin(ABD/346)
Includes gene knockdowns
Includes genetic background
Includes multiple alleles
Single allele w/zygosity
allele
variant/mutation
gene
Transverse spirally arranged
myofibrils are almost
completely absent.
Dystonia 12
24. Decomposition of complex concepts allows interoperability
“Palmoplantar
hyperkeratosis”
increased
Stratum corneum
layer of skin
=
Human phenotype PATO
Uberon
Species neutral ontologies, homologous concepts
Autopod
keratinization
GO
“Ulcerated
paws”
Mouse phenotype
=
28. G-P or D (disease)
• causes
• contributes to
• is risk factor for
• protects against
• correlates with
• is marker for
• modulates
• involved in
• increases susceptibility to
G-G (kind of)
• regulates
• negatively regulates (inhibits)
• positively regulates (activates)
• directly regulates
• interacts with
• co-localizes with
• co-expressed with
P/D - P/D
• part of
• results in
• co-occurs with
• correlates with
• hallmark of (P->D)
E-P
• contributes to (E->P)
• influences (E->P)
• exacerbates (E->P)
• manifest in (P->E)
G-E (kind of)
• expressed in
• expressed during
• contains
• inactivated by
…the relationships and their evidence must also be
captured
29. Data Integrators have deep knowledge to help data
providers birth interoperable data