Más contenido relacionado

Similar a Equivalence is in the (ID) of the beholder(20)

Más de mhaendel(20)


Equivalence is in the (ID) of the beholder

  1. Equivalence is in the (ID) of the beholder Prefix Commons Biocontext versus one-“ID” Willie Biomedical Data Translator NIH Data Commons Melissa Haendel @ontowonka
  2. Finding treasure requires aligning different knowledge modalities and perspectives Real-world concepts Annotations Mappings Standards and team science
  3. Data science is a young field; the data is crufty, but it is valuable
  4. What integrators are aiming to do is non-trivial But… biological things can have multiple IDs, and each ID can be written in multiple ways. Some persist, some don’t. Some have machine readable data, some don’t. This makes integration hard.
  5. Identifiers are the invisible bedrock of all scientific inquiry; the more complex the question, the greater the reliance on ID hygiene What? Why? How? Identifiers Identifiers & Metadata Identifiers & MetaData & Relationships Requiredharmonization Question complexity FAIR FAIR How many? FAIR
  6. Identifier Reality: Not all IDs are created equal Not all communities are equally poised to use them We need systems that accommodate the heterogeneity Traditional Literature Non- Traditional Persistent Ephemeral Non- existent IdentifierMaturity Scholarly Output Maturity Genomic resources Wild west of identifier tumbleweed Connected
  7. Goonies may not need IDs, but agents do
  8. Pharos DrugBank PubMed KEGG WikiPathways HGNC HGNC NCBI ClinVar OMIM
  9. Genes Diseases Chemicals Proteins NCBI Gene HGNC Ensembl OMIM Organism-DBs Panther Clinvar MONDO OMIM Orphanet ICD-9 / ICD-10 / ICD-11 SNOMED-CT Chebi Drugbank Pharos UniProt/SwissProt WikiPathways Reactome The NCATS Data Translator Knowledge Map: Will define a menu of types, IDs, and paths available to reasoners
  10. Genes Environment Phenotypes+ = Biology is not just identifying the bits, but also the relationships between them. G-P or D (disease) • causes • contributes to • is risk factor for • protects against • correlates with • is marker for • modulates • involved in • increases susceptibility to G-G (kind of) • regulates • negatively regulates (inhibits) • positively regulates (activates) • directly regulates • interacts with • co-localizes with • co-expressed with P/D - P/D • part of • results in • co-occurs with • correlates with • hallmark of (P->D) E-P • contributes to (E->P) • influences (E->P) • exacerbates (E->P) • manifest in (P->E) G-E (kind of) • expressed in • expressed during • contains • inactivated by
  11. Identifier cleaning is a shared pain point (and a huge time sink), but no ways to share the outcomes Let’s discuss!!
  12. CHEMBL1431 Pharos:O15244 (one of 355) IDG:D908 reactome:R-HSA-374914 The NCATS BioMedical Translator Knowledge Map is a set of actions the agent can perform: 3rd party provider value: API’s and connectivity reactome:HSA-1430728 clinvar.variant:443497 medgen:87607 Short stature medgen:892473 Abnormality of cardiovascular system morphology medgen:427827 Abnormality of the ear medgen:776570 Polydactyly Variant Phenotypes Monarch PhenotypeToDisease Phenotypes Disease HP:0004322 Short stature HP:0030680 Abnormality of cardiovascular system morphology HP:0000598 Abnormality of the ear HP:0010442 Polydactyly MONDO:0019391 Fanconi Anemia Disease Phenotype
  13. download everything crawl & index high-level abstraction trial & error more details fewer details Burden on integrator (less coordination) Burden on provider (more coordination) Data integration is a socio-technical problem
  14. (Way) beyond linkrot: types of equivalence pain
  15. Challenge 1: ID Syntax polymorphism There is no ID that is immune to getting mangled A fine strategy for 2 variations but …
  16. Challenge 1: ID Syntax polymorphism There is no ID that is immune to getting mangled SNCA (synuclein alpha) Implicated in Parkinson Disease 22 2 0000-0002-1825-0097 ORCiD:0000-0002-1825-0097 orcid:0000-0002-1825-0097 ORCiD: 0000-0002-1825-0097 orcid: 0000-0002-1825-0097 DOIsorcids 532possible combinations of short form and http identifiers for the same gene in the same DB!
  17. Mitigation 1. syntax polymorphism: minimize it where you can, document it where you can’t Repos: • Document how you want others to reference your IDs • Implement QC to check the IDs of records you xref • Implement Signposting (Herbert van de Sompel) Technical Needs: • Generic identifier QC services ( • Tooling to diagnose and roll up equivalents (underway) • Monarch Initiative / NIH Data Commons / NIH Data Translator
  18. PDBsum: Proteopedia: PDB Europe: RCSB PDB: PDBj: Challenge 2: Coordinated mirroring of exact record copies within consortium members 2gc4 is a 16 chain structure with sequence from Paracoccus denitrificans
  19. Mitigation 2: Use machine-readable documentation about distributions Find sources (eg. of machine-readable documentation about all possible endpoints Note this doesn’t work with ad-hoc deposition of a single record in multiple places (eg. same preprint servers, institutional repositories, etc)
  20. Challenge 3: Knowledge evolution and implications for over-eager collapse This evolution makes it hard to compare diagnoses made at different times. DSM change Analagous examples: Organizational membership evolution and partonomy
  21. Mitigation 3: Capture provenance, versioning, and context Make use of it in analysis 1, no changes in reporting practices and no calendar effect (baseline prevalence); 2, only a calendar effect; 3, a calendar effect and a diagnostic change; 4, a calendar effect and the inclusion of outpatients; 5, a calendar effect, a diagnostic change, and the inclusion of outpatients (total observed prevalence). 1 2 3 4 5 doi:10.1001/jamapediatrics.2014.1893
  22. Challenge 4: Closely-related real-world entities whose distinction matters only to some Ala D-AlanineL-Alanine CHEBI:16977 CHEBI:16449CHEBI:15570
  23. Mitigation 4: create systems that let users choose “lenses” for fuzzy or exact matching ExactFuzzy Example implementations: OpenPhacts (drugs), Monarch Initiative (phenotypes)
  24. Challenge 5: Partially recapitulated records that reference the same ID EHLERS-DANLOS SYNDROME, CLASSIC TYPE, 1; EDSCL1 Ehlers-Danlos syndrome, classic type 007522 yndrome_classic_type OMIM:130000
  25. Mitigation 5: Use authoritative IDs as redirection/query entry points into related data Use JSON-LD context files to document in/out paths Incoming Outgoing OMIM OMIM Outgoing OMIM Incoming _context.jsonld PrefixCommons
  26. Challenge 6: Records that [seem to] be about the same real world concept (or maybe not-so real) Too little equivalence: Missing connections Too much equivalence: False positives
  27. Challenge 6 cont’d: Post-hoc harmonization
  28. Challenge 6 cont’d: Fuzzy Match on xrefs/content How are these 11 records for “Ehlers Danlos Syndrome” related to each other? Narrow synonym? Broad? Exact? Child? Parent? Bayesian models like k-BOOM can help Mungall doi:10.1101/048843
  29. OMIM (brown) MESH (grey) ORDO/Orphanet (yellow) SubClassOf (solid line) Xref (dashed grey line) Hemolytic anemia mappings across resources Each vocabulary is different, they inconsistently map to each other, leading to poor interoperability and computability
  30. Mitigation 6: Use algorithms to determine probability of equivalence Bayesian OWL Ontology Merging (kBOOM) Mungall et al. Determine probability of equivalence based upon: 1. Synonyms 2. XREFs (ideally semantically typed) 3. Graph structure 4. (Partial) string matching of labels 5. Prior weighting (e.g. if you know a source has specific or poor xref’ing curation strategies) Example applied to diseases:
  31. ID21C: Tangible, actionable community best practice on identifiers for data integration doi:10.1371/journal.pbio.2001414 ( Please comment!
  32. Goldilocks approach to (ID) standards NON-adoption to nano adoption Only the information useful for action
  33. UDP:2542 Impaired platelet aggregation (HP:0003540) Thromocytopenia (HP:0001873) Abnormal platelet activation (MP:0006298) Thrombocytopenia (MP:0003179) Genetics in Medicine 18, 608–617 (2016) doi:10.1038/gim.2015.137 MGI:3764834 If we get equivalence right, what does that make possible?
  34. • Traceable • Attributable • Versioned • High Level typing of records for desired audience (eg. is record for a gene? A disease? An instrumentation readout?) • Licensed • Pick a standard license; encode in file header • (using a URL like if possible) • PIDs for license types? • Connected: • Syntactic and semantic harmonization • Use standardized metadata, vocabs, ontologies • Document your identifier scheme and follow the ID21C guidelines or other best practices to avoid syntactic variation • Capture cross references but ensure that you semantically qualify them • Tell others how to link to you • APIs please; data dumps are not enough, but if you do nothing else: • Tell us what types of ids to expect and how they are related to one another • Support bulk identifier operations in APIs, especially for xref’d IDs FAIR TLC Traceable Licensed Connected (
  35. Don’t hoard your data like One-“ID” Willie; Help it travel well via better equivalency support. With thanks to: Julie McMurry, Chris Mungall, Mathias Wawer and whole team John Kunze, Greg Janee, Sarala Wimalaratne, Nick Juty, others 5 R24 OD011883 OT3 TR002019-01S2 1 OT OD025464-01 1 OT3 HL142479-01 U24TR002306

Hinweis der Redaktion

  1. Data that is not Reusable is FAI, not FAIR
  3. They don’t know anything about what info is available or how to crawl it
  4. To efficiently query datasets, the agent needs to know at a high level abstraction, what is in it: what semantic types, what IDs, what relationship types
  5. Add IDs Relationships
  6. The classic G+E=P. But the = has a lot that can be applied to aid the linking.
  7. When looking across the data jungle… We do expect to find radically different ways that providers model knowledge, … but we are constantly annoyed by how differently they identify it and how hard it is to find it. That should be the easy part.
  9. Chemists care about chirality, biologists less so; even if things are well identified as being fuzzy matches, it helps to have filters over which different stakeholders can interrogate at different levels of rigor.
  10. Note the two subgraphs; little overlap in the upper areas
  11. The R part of FAIR is reusability, and most reusability depends upon data mash-up. Data mashup that enables science depends upon