SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Building (and traveling) the
data-brick road:
A report from the front lines of
data integration
Melissa Haendel, PhD
NIH Data Commons Kickoff
2017-12-06
Chasm of
semantic
despair
The data-brick road
is a means…
…not a destination
Reuse: the last (and hardest) mileCompliance
F A I R
VS
FAIR FAIR
Reuse requires (semantic) integration
Under-
appreciated
obstacles to
integration?
FAIR-TLC
“Traceability, Licensing,
and Connectedness --
OH MY!”
Traceable Licensed Connected
Traceable Licensed Connected
bit.ly/fair-tlc
• Data models
• Ontologies
• Concept alignment
• Common Data
Elements
• Identifiers in service of
above
Evidence
Provenance
Attribution
• Clearly stated
• Comprehensive and
non-negotiated
• Accessible
• Avoid restrictions on
kinds of (re)use
• Avoid restrictions on
who may (re)use
Traceability and Licensure
go hand-in-hand:
Licensing often a
(convoluted) credit hack
Why do we
persist?
A vision for the future of scholarship
Thomas Markello
Dong Chen
Justin Y. Kwan
Iren Horkayne-
Szakaly
Alan Morrison
Olga Simakova
Irina Maric
Jay Lozier
Andrew R. Cullinane
Tatjana Kilo
Lynn Meister
Kourosh Pakzad
Sanjay Chainani
Roxanne Fischer
Camilo Toro
James G. White
David Adams
Cornelius Boerkoel
William A. Gahl
Cynthia J. Tifft
Meral Gunay-Aygun
Hans Goeble
Karen Balbach
Nadine Pfeifer
Sandra Werner
Christian Linden
Melissa Haendel
Peter Robinson
Chris Mungall
Sebastian Kohler
Cindy Smith
Nicole Vasilevsky
Sandra Dolken
Elizabeth Lee
Amanda Links
Will Bone
Murat Sincan
Damian Smedley
Jules Jacobson
Nicole Washington
Elise Flynn
Sebastian Kohler
Orion Buske
Marta Girdea
Michael Brudno
Jeremy Band
Melissa Haendel
David Adams
David Draper
Bailey Gallinger
Joie Davis
Nicole Vasilevsky
Heather Trang
Rena Godfrey
Gretchen Golas
Catherine Groden
Michele Nehrebecky
Ariane Soldatos
Elise Valkanas,
Colleen Wahl
Lynne Wolfe
Johannes Grosse
Attila Braun
David Varga-Szabo
Niklas Beyersdorf
Boris Schneider
Lutz Zeitlmann
Petra Hanke
Patricia Schropp
Silke Mühlstedt
Carolin Zorn
Michael Huber
Carolin Schmittwolf
Wolfgang Jagla
Philipp Yu
Thomas Kerkau
Harald Schulze
Michael Nehls
Bernhard Nieswandt
Clinicians/
care team
Pathologists Ontologists Informaticians Curators Basic
Research
The translational workforce:
It takes a village to solve disease
GEO dataset
Gemma DRG
Genes differentially expressed:
~8,000 gene comparisons
Genes significantly expressed or
unchanged:
~13,000 gene comparisons
Incongruous results
Evidence and provenance
Importance of raw data alignment
(never mind that the gene IDs had to be mapped from strings)
Increased: 1,640
Decreased: 1,110
Differential: 2,920
Increased: 4,264
Decreased: 3,833
Differential: 8,133
Both resources recorded 95% confidence intervals for significance
…how many of these data are truly reusable?
Openness is assumed, but …
Most licenses are vague, non-standard, or missing
(n = 51 DBs)
Reusabledata.org
Identifiers are the invisible bedrock of all scientific inquiry;
the more complex the question,
the greater the reliance on ID hygiene
What? Why?
How?
Identifiers
Identifiers &
Metadata
Identifiers &
MetaData &
Models
Requiredharmonization
Question complexity
FAIR
FAIR
How many?
FAIR
Identifier Reality: Not all IDs created equal
We need systems that accommodate the heterogeneity
Traditional
Literature
Non-
Traditional
Persistent
Ephemeral
Non-existent
IdentifierMaturity
Scholarly Output Maturity
Genomic
resources
Wild west of identifier
tumbleweed
(way) beyond “linkrot”:
Pain points in identifier tech/standards
• Versioning and
Content evolution
AKA “content drift”
• Identifier
Surrogacy /
Granularity
• Ambiguous
equivalence
• Distribution
(& replication)
of content
over multiple
providers
bit.ly/evidence-of-identifier-pain
Distribution
across
providers
Ambiguous
equivalence
case 2:
Eye of the
beholder
Tangible, actionable community best practice
on identifiers for data integration
doi:10.1371/journal.pbio.2001414 (bit.ly/id21c-plosbio)
What integrators are aiming to do is non-trivial
Ambiguous equivalence case 4: Post-hoc harmonization
Ambiguous equivalence case 3: Fuzzy Match on xrefs/content
How are these
11 records for
“Ehlers Danlos
Syndrome”
related to each
other?
Narrow synonym?
Broad? Exact?
Child? Parent?
Bayesian models
like k-BOOM can
help
Mungall
doi:10.1101/048843
bit.ly/xref-wildwest
Challenges in propagating knowledge:
Different sources associate phenotypic information with
different aspects of the genotype
fgf8ati282a/ti282a;shhatbx392/+[TL];
cdkn1caMO3-cdkn1ca
Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6]
daf-2(e1370)
ATP1A3(NM_152296.3)
[c.946G>A, p.Gly316Ser]
tin(ABD/346)
Includes gene knockdowns
Includes genetic background
Includes multiple alleles
Single allele w/zygosity
allele
variant/mutation
gene
Challenges in propagating knowledge:
Different sources associate phenotypic information with
different aspects of the genotype
fgf8ati282a/ti282a;shhatbx392/+[TL];
cdkn1caMO3-cdkn1ca
Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6]
daf-2(e1370)
ATP1A3(NM_152296.3)
[c.946G>A, p.Gly316Ser]
tin(ABD/346)
Includes gene knockdowns
Includes genetic background
Includes multiple alleles
Single allele w/zygosity
allele
variant/mutation
gene
Transverse spirally arranged
myofibrils are almost
completely absent.
Dystonia 12
Decomposition of complex concepts allows interoperability
“Palmoplantar
hyperkeratosis”
increased
Stratum corneum
layer of skin
=
Human phenotype PATO
Uberon
Species neutral ontologies, homologous concepts
Autopod
keratinization
GO
“Ulcerated
paws”
Mouse phenotype
=
Need for a
comprehensive
and connected
picture across
resources
Goldilocks
approach to
harmonizing
data dissonance
Genes Environment Phenotypes+ =
We need interoperability for not only the types of
things in our data….
G-P or D (disease)
• causes
• contributes to
• is risk factor for
• protects against
• correlates with
• is marker for
• modulates
• involved in
• increases susceptibility to
G-G (kind of)
• regulates
• negatively regulates (inhibits)
• positively regulates (activates)
• directly regulates
• interacts with
• co-localizes with
• co-expressed with
P/D - P/D
• part of
• results in
• co-occurs with
• correlates with
• hallmark of (P->D)
E-P
• contributes to (E->P)
• influences (E->P)
• exacerbates (E->P)
• manifest in (P->E)
G-E (kind of)
• expressed in
• expressed during
• contains
• inactivated by
…the relationships and their evidence must also be
captured
Data Integrators have deep knowledge to help data
providers birth interoperable data
Thank you
Julie McMurry
For her illustrative gifts
Thank you for helping build the data-brick road
Melissa Haendel, PhD
@ontowonka

Weitere ähnliche Inhalte

Ähnlich wie Building (and traveling) the data-brick road: A report from the front lines of data integration

The Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision MedicineThe Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision Medicinemhaendel
 
provenance of microarray experiments
provenance of microarray experimentsprovenance of microarray experiments
provenance of microarray experimentsHelena Deus
 
How to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusableHow to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusablePhoenix Bioinformatics
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics PosterMichael Atkins
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesNeo4j
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management inscit2006
 
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]HeonjongHan
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECAProject
 
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...Jeremy Yang
 
Single-Cell Sequencing for Drug Discovery: Applications and Challenges
Single-Cell Sequencing for Drug Discovery: Applications and ChallengesSingle-Cell Sequencing for Drug Discovery: Applications and Challenges
Single-Cell Sequencing for Drug Discovery: Applications and Challengesinside-BigData.com
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPatricia Francis-Lyon
 
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...Gabe Rudy
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 
Crowds Cure Canver: Annotating Data from The Cancer Imaging Archive
Crowds Cure Canver: Annotating Data from The Cancer Imaging ArchiveCrowds Cure Canver: Annotating Data from The Cancer Imaging Archive
Crowds Cure Canver: Annotating Data from The Cancer Imaging ArchiveCancerImagingInforma
 
American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013Dmitry Grapov
 
Prof. Barend Mons, Biosemantics Group at Leiden University Medical Center and...
Prof. Barend Mons, Biosemantics Group at Leiden University Medical Center and...Prof. Barend Mons, Biosemantics Group at Leiden University Medical Center and...
Prof. Barend Mons, Biosemantics Group at Leiden University Medical Center and...Research Data Alliance
 

Ähnlich wie Building (and traveling) the data-brick road: A report from the front lines of data integration (20)

The Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision MedicineThe Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision Medicine
 
provenance of microarray experiments
provenance of microarray experimentsprovenance of microarray experiments
provenance of microarray experiments
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
How to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusableHow to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusable
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management
 
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
 
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
 
Single-Cell Sequencing for Drug Discovery: Applications and Challenges
Single-Cell Sequencing for Drug Discovery: Applications and ChallengesSingle-Cell Sequencing for Drug Discovery: Applications and Challenges
Single-Cell Sequencing for Drug Discovery: Applications and Challenges
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Anonymity
AnonymityAnonymity
Anonymity
 
Crowds Cure Canver: Annotating Data from The Cancer Imaging Archive
Crowds Cure Canver: Annotating Data from The Cancer Imaging ArchiveCrowds Cure Canver: Annotating Data from The Cancer Imaging Archive
Crowds Cure Canver: Annotating Data from The Cancer Imaging Archive
 
American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
Prof. Barend Mons, Biosemantics Group at Leiden University Medical Center and...
Prof. Barend Mons, Biosemantics Group at Leiden University Medical Center and...Prof. Barend Mons, Biosemantics Group at Leiden University Medical Center and...
Prof. Barend Mons, Biosemantics Group at Leiden University Medical Center and...
 

Mehr von mhaendel

Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...
Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...
Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...mhaendel
 
Semantics for rare disease phenotyping, diagnostics, and discovery
Semantics for rare disease phenotyping, diagnostics, and discoverySemantics for rare disease phenotyping, diagnostics, and discovery
Semantics for rare disease phenotyping, diagnostics, and discoverymhaendel
 
The Software and Data Licensing Solution: Not Your Dad’s UBMTA
The Software and Data Licensing Solution: Not Your Dad’s UBMTA The Software and Data Licensing Solution: Not Your Dad’s UBMTA
The Software and Data Licensing Solution: Not Your Dad’s UBMTA mhaendel
 
GA4GH Monarch Driver Project Introduction
GA4GH Monarch Driver Project IntroductionGA4GH Monarch Driver Project Introduction
GA4GH Monarch Driver Project Introductionmhaendel
 
GA4GH Phenotype Ontologies Task team update
GA4GH Phenotype Ontologies Task team updateGA4GH Phenotype Ontologies Task team update
GA4GH Phenotype Ontologies Task team updatemhaendel
 
Reusable data for biomedicine: A data licensing odyssey
Reusable data for biomedicine:  A data licensing odysseyReusable data for biomedicine:  A data licensing odyssey
Reusable data for biomedicine: A data licensing odysseymhaendel
 
Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery
Data Translator: an Open Science Data Platform for Mechanistic Disease DiscoveryData Translator: an Open Science Data Platform for Mechanistic Disease Discovery
Data Translator: an Open Science Data Platform for Mechanistic Disease Discoverymhaendel
 
Global phenotypic data sharing standards to maximize diagnostic discovery
Global phenotypic data sharing standards to maximize diagnostic discoveryGlobal phenotypic data sharing standards to maximize diagnostic discovery
Global phenotypic data sharing standards to maximize diagnostic discoverymhaendel
 
How open is open? An evaluation rubric for public knowledgebases
How open is open?  An evaluation rubric for public knowledgebasesHow open is open?  An evaluation rubric for public knowledgebases
How open is open? An evaluation rubric for public knowledgebasesmhaendel
 
Deep phenotyping to aid identification of coding & non-coding rare disease v...
Deep phenotyping to aid identification  of coding & non-coding rare disease v...Deep phenotyping to aid identification  of coding & non-coding rare disease v...
Deep phenotyping to aid identification of coding & non-coding rare disease v...mhaendel
 
Science in the open, what does it take?
Science in the open, what does it take?Science in the open, what does it take?
Science in the open, what does it take?mhaendel
 
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...mhaendel
 
Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation mhaendel
 
Credit where credit is due: acknowledging all types of contributions
Credit where credit is due: acknowledging all types of contributionsCredit where credit is due: acknowledging all types of contributions
Credit where credit is due: acknowledging all types of contributionsmhaendel
 
Deep phenotyping for everyone
Deep phenotyping for everyoneDeep phenotyping for everyone
Deep phenotyping for everyonemhaendel
 
Why the world needs phenopacketeers, and how to be one
Why the world needs phenopacketeers, and how to be oneWhy the world needs phenopacketeers, and how to be one
Why the world needs phenopacketeers, and how to be onemhaendel
 
On the frontier of genotype-2-phenotype data integration
On the frontier of genotype-2-phenotype data integrationOn the frontier of genotype-2-phenotype data integration
On the frontier of genotype-2-phenotype data integrationmhaendel
 
The Monarch Initiative: A semantic phenomics approach to disease discovery
The Monarch Initiative: A semantic phenomics approach to disease discoveryThe Monarch Initiative: A semantic phenomics approach to disease discovery
The Monarch Initiative: A semantic phenomics approach to disease discoverymhaendel
 
Getting (and giving) credit for all that we do
Getting (and giving) credit for all that we doGetting (and giving) credit for all that we do
Getting (and giving) credit for all that we domhaendel
 
The Monarch Initiative: An integrated genotype-phenotype platform for disease...
The Monarch Initiative: An integrated genotype-phenotype platform for disease...The Monarch Initiative: An integrated genotype-phenotype platform for disease...
The Monarch Initiative: An integrated genotype-phenotype platform for disease...mhaendel
 

Mehr von mhaendel (20)

Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...
Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...
Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...
 
Semantics for rare disease phenotyping, diagnostics, and discovery
Semantics for rare disease phenotyping, diagnostics, and discoverySemantics for rare disease phenotyping, diagnostics, and discovery
Semantics for rare disease phenotyping, diagnostics, and discovery
 
The Software and Data Licensing Solution: Not Your Dad’s UBMTA
The Software and Data Licensing Solution: Not Your Dad’s UBMTA The Software and Data Licensing Solution: Not Your Dad’s UBMTA
The Software and Data Licensing Solution: Not Your Dad’s UBMTA
 
GA4GH Monarch Driver Project Introduction
GA4GH Monarch Driver Project IntroductionGA4GH Monarch Driver Project Introduction
GA4GH Monarch Driver Project Introduction
 
GA4GH Phenotype Ontologies Task team update
GA4GH Phenotype Ontologies Task team updateGA4GH Phenotype Ontologies Task team update
GA4GH Phenotype Ontologies Task team update
 
Reusable data for biomedicine: A data licensing odyssey
Reusable data for biomedicine:  A data licensing odysseyReusable data for biomedicine:  A data licensing odyssey
Reusable data for biomedicine: A data licensing odyssey
 
Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery
Data Translator: an Open Science Data Platform for Mechanistic Disease DiscoveryData Translator: an Open Science Data Platform for Mechanistic Disease Discovery
Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery
 
Global phenotypic data sharing standards to maximize diagnostic discovery
Global phenotypic data sharing standards to maximize diagnostic discoveryGlobal phenotypic data sharing standards to maximize diagnostic discovery
Global phenotypic data sharing standards to maximize diagnostic discovery
 
How open is open? An evaluation rubric for public knowledgebases
How open is open?  An evaluation rubric for public knowledgebasesHow open is open?  An evaluation rubric for public knowledgebases
How open is open? An evaluation rubric for public knowledgebases
 
Deep phenotyping to aid identification of coding & non-coding rare disease v...
Deep phenotyping to aid identification  of coding & non-coding rare disease v...Deep phenotyping to aid identification  of coding & non-coding rare disease v...
Deep phenotyping to aid identification of coding & non-coding rare disease v...
 
Science in the open, what does it take?
Science in the open, what does it take?Science in the open, what does it take?
Science in the open, what does it take?
 
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
 
Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation
 
Credit where credit is due: acknowledging all types of contributions
Credit where credit is due: acknowledging all types of contributionsCredit where credit is due: acknowledging all types of contributions
Credit where credit is due: acknowledging all types of contributions
 
Deep phenotyping for everyone
Deep phenotyping for everyoneDeep phenotyping for everyone
Deep phenotyping for everyone
 
Why the world needs phenopacketeers, and how to be one
Why the world needs phenopacketeers, and how to be oneWhy the world needs phenopacketeers, and how to be one
Why the world needs phenopacketeers, and how to be one
 
On the frontier of genotype-2-phenotype data integration
On the frontier of genotype-2-phenotype data integrationOn the frontier of genotype-2-phenotype data integration
On the frontier of genotype-2-phenotype data integration
 
The Monarch Initiative: A semantic phenomics approach to disease discovery
The Monarch Initiative: A semantic phenomics approach to disease discoveryThe Monarch Initiative: A semantic phenomics approach to disease discovery
The Monarch Initiative: A semantic phenomics approach to disease discovery
 
Getting (and giving) credit for all that we do
Getting (and giving) credit for all that we doGetting (and giving) credit for all that we do
Getting (and giving) credit for all that we do
 
The Monarch Initiative: An integrated genotype-phenotype platform for disease...
The Monarch Initiative: An integrated genotype-phenotype platform for disease...The Monarch Initiative: An integrated genotype-phenotype platform for disease...
The Monarch Initiative: An integrated genotype-phenotype platform for disease...
 

Kürzlich hochgeladen

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 

Kürzlich hochgeladen (20)

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 

Building (and traveling) the data-brick road: A report from the front lines of data integration

  • 1. Building (and traveling) the data-brick road: A report from the front lines of data integration Melissa Haendel, PhD NIH Data Commons Kickoff 2017-12-06
  • 3. The data-brick road is a means… …not a destination
  • 4. Reuse: the last (and hardest) mileCompliance F A I R VS FAIR FAIR Reuse requires (semantic) integration
  • 5. Under- appreciated obstacles to integration? FAIR-TLC “Traceability, Licensing, and Connectedness -- OH MY!” Traceable Licensed Connected
  • 6. Traceable Licensed Connected bit.ly/fair-tlc • Data models • Ontologies • Concept alignment • Common Data Elements • Identifiers in service of above Evidence Provenance Attribution • Clearly stated • Comprehensive and non-negotiated • Accessible • Avoid restrictions on kinds of (re)use • Avoid restrictions on who may (re)use Traceability and Licensure go hand-in-hand: Licensing often a (convoluted) credit hack
  • 8. A vision for the future of scholarship
  • 9. Thomas Markello Dong Chen Justin Y. Kwan Iren Horkayne- Szakaly Alan Morrison Olga Simakova Irina Maric Jay Lozier Andrew R. Cullinane Tatjana Kilo Lynn Meister Kourosh Pakzad Sanjay Chainani Roxanne Fischer Camilo Toro James G. White David Adams Cornelius Boerkoel William A. Gahl Cynthia J. Tifft Meral Gunay-Aygun Hans Goeble Karen Balbach Nadine Pfeifer Sandra Werner Christian Linden Melissa Haendel Peter Robinson Chris Mungall Sebastian Kohler Cindy Smith Nicole Vasilevsky Sandra Dolken Elizabeth Lee Amanda Links Will Bone Murat Sincan Damian Smedley Jules Jacobson Nicole Washington Elise Flynn Sebastian Kohler Orion Buske Marta Girdea Michael Brudno Jeremy Band Melissa Haendel David Adams David Draper Bailey Gallinger Joie Davis Nicole Vasilevsky Heather Trang Rena Godfrey Gretchen Golas Catherine Groden Michele Nehrebecky Ariane Soldatos Elise Valkanas, Colleen Wahl Lynne Wolfe Johannes Grosse Attila Braun David Varga-Szabo Niklas Beyersdorf Boris Schneider Lutz Zeitlmann Petra Hanke Patricia Schropp Silke Mühlstedt Carolin Zorn Michael Huber Carolin Schmittwolf Wolfgang Jagla Philipp Yu Thomas Kerkau Harald Schulze Michael Nehls Bernhard Nieswandt Clinicians/ care team Pathologists Ontologists Informaticians Curators Basic Research The translational workforce: It takes a village to solve disease
  • 10. GEO dataset Gemma DRG Genes differentially expressed: ~8,000 gene comparisons Genes significantly expressed or unchanged: ~13,000 gene comparisons Incongruous results Evidence and provenance Importance of raw data alignment (never mind that the gene IDs had to be mapped from strings) Increased: 1,640 Decreased: 1,110 Differential: 2,920 Increased: 4,264 Decreased: 3,833 Differential: 8,133 Both resources recorded 95% confidence intervals for significance
  • 11. …how many of these data are truly reusable? Openness is assumed, but …
  • 12. Most licenses are vague, non-standard, or missing (n = 51 DBs) Reusabledata.org
  • 13. Identifiers are the invisible bedrock of all scientific inquiry; the more complex the question, the greater the reliance on ID hygiene What? Why? How? Identifiers Identifiers & Metadata Identifiers & MetaData & Models Requiredharmonization Question complexity FAIR FAIR How many? FAIR
  • 14. Identifier Reality: Not all IDs created equal We need systems that accommodate the heterogeneity Traditional Literature Non- Traditional Persistent Ephemeral Non-existent IdentifierMaturity Scholarly Output Maturity Genomic resources Wild west of identifier tumbleweed
  • 15. (way) beyond “linkrot”: Pain points in identifier tech/standards • Versioning and Content evolution AKA “content drift” • Identifier Surrogacy / Granularity • Ambiguous equivalence • Distribution (& replication) of content over multiple providers bit.ly/evidence-of-identifier-pain
  • 18. Tangible, actionable community best practice on identifiers for data integration doi:10.1371/journal.pbio.2001414 (bit.ly/id21c-plosbio)
  • 19. What integrators are aiming to do is non-trivial
  • 20. Ambiguous equivalence case 4: Post-hoc harmonization
  • 21. Ambiguous equivalence case 3: Fuzzy Match on xrefs/content How are these 11 records for “Ehlers Danlos Syndrome” related to each other? Narrow synonym? Broad? Exact? Child? Parent? Bayesian models like k-BOOM can help Mungall doi:10.1101/048843 bit.ly/xref-wildwest
  • 22. Challenges in propagating knowledge: Different sources associate phenotypic information with different aspects of the genotype fgf8ati282a/ti282a;shhatbx392/+[TL]; cdkn1caMO3-cdkn1ca Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6] daf-2(e1370) ATP1A3(NM_152296.3) [c.946G>A, p.Gly316Ser] tin(ABD/346) Includes gene knockdowns Includes genetic background Includes multiple alleles Single allele w/zygosity allele variant/mutation gene
  • 23. Challenges in propagating knowledge: Different sources associate phenotypic information with different aspects of the genotype fgf8ati282a/ti282a;shhatbx392/+[TL]; cdkn1caMO3-cdkn1ca Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6] daf-2(e1370) ATP1A3(NM_152296.3) [c.946G>A, p.Gly316Ser] tin(ABD/346) Includes gene knockdowns Includes genetic background Includes multiple alleles Single allele w/zygosity allele variant/mutation gene Transverse spirally arranged myofibrils are almost completely absent. Dystonia 12
  • 24. Decomposition of complex concepts allows interoperability “Palmoplantar hyperkeratosis” increased Stratum corneum layer of skin = Human phenotype PATO Uberon Species neutral ontologies, homologous concepts Autopod keratinization GO “Ulcerated paws” Mouse phenotype =
  • 25. Need for a comprehensive and connected picture across resources
  • 27. Genes Environment Phenotypes+ = We need interoperability for not only the types of things in our data….
  • 28. G-P or D (disease) • causes • contributes to • is risk factor for • protects against • correlates with • is marker for • modulates • involved in • increases susceptibility to G-G (kind of) • regulates • negatively regulates (inhibits) • positively regulates (activates) • directly regulates • interacts with • co-localizes with • co-expressed with P/D - P/D • part of • results in • co-occurs with • correlates with • hallmark of (P->D) E-P • contributes to (E->P) • influences (E->P) • exacerbates (E->P) • manifest in (P->E) G-E (kind of) • expressed in • expressed during • contains • inactivated by …the relationships and their evidence must also be captured
  • 29. Data Integrators have deep knowledge to help data providers birth interoperable data
  • 30. Thank you Julie McMurry For her illustrative gifts
  • 31. Thank you for helping build the data-brick road Melissa Haendel, PhD @ontowonka