2. Challenge: making representations of biological
knowledge interoperable
OMIM
MGI
HGNC
FlyBase
ClinVar
CTD
DrugBank
UniProt
BGeeDb
GO
SGD
RGD PomBase Monarch
WormBasePharmGKB
Reactome
GWAS
catalog
CHEMBL
ENSEMBL
DrugBank
BioGrid
KEGG
Panther
ZFIN
Xen
Base
Animal
QTLdb
3. What do we mean by knowledge here?
● Data, sensu lato : collection of values in some organized form
○ Data, sensu stricto: Output of a data collection process
■ Instrumentation or observation; raw or processed; not altered by curation
■ Serves role as evidence
■ E.g. read count in RNAseq experiment OR examination of KO mouse
○ Metadata: Data about data (or more typically) datasets
■ May be curated at source, post-hoc, manually or automatically
■ E.g. details about an RNAseq experiment (factors, instrumentation, sample prep)
○ Knowledge: Propositional assertions inferred from data
■ Something you need evidence for
■ E.g.
● gene G is expressed in tissue T under condition C
● Knocking out G gives rise to phenotype P with high penetrance
● Many bio-”databases” are actually “knowledge bases” (by this definition)
● Usual caveats:
○ Other definitions available, divisions can be murky, this is a guide rather than dogma, etc
5. Haven’t we been here before?
http://www.mged.org/Meetings/presentations/OMG/sld019.htm
6. Haven’t we been here before?
http://www.mged.org/Meetings/presentations/OMG/sld019.htm
7. Complexity and fluidity of biological knowledge vs
schema rigidity
// hypothetical strawman schema
class Gene {
String: name
String: function
String: phenotype
Protein: product
Int: start
Int: end
String: chromosome
}
Bad assumption:
- Genes actually have multiple functions
- String representation rather than vocab
Bad assumption:
- Different builds?
- Should be inherited from generic
seq feature
Bad assumption:
- Genes can have multiple products
- Products not necessarily genes
- What about transcript, exon, ...
}
8. The backwards evolution of schema languages
● 80s: ER, SQL DDL
○ Basis in FOL, formal algebra/calculus
● 90s: OO, UML, Description Logics
○ Rich polymorphism
● 00s: XML, SOAP
○ Can’t even...
● 10s: JSON and JSON-Schema
○ No polymorphism
○ Limited typing
○ Tree-based
○ Geared towards web-apps, not rich modeling
9. What works: Open-ended knowledge representation
using RDF Graphs plus OWL
● RDF: minimal
representation
model for
representing simple
facts as edges
● OWL: encodes
semantics about
RDF graphs
10. Success of OWL:
Bio-Ontologies
● One datamodel (OWL),
covers rich variety of
interconnected biology
● APIs, SPARQL, ...
http://obofoundry.org/ontology/uberon.html
11. Analogous approach in biological databases
● GMOD Chado
● Graph-like database layered
over RDBMS
● Allowed flexibility and
extensibility
● Large uptake by small MODs
Mungall, C. J., Emmert, D. B., et al. (2007) A. Bioinformatics, 23(13),
i337-346. http://doi.org/10.1093/bioinformatics/btm189
https://github.com/GMOD/Chado
12. Knowledge Graphs, the most pluripotent representation of data, are no longer as exotic or
experimental as they were 10 years ago. Goofaceamazonlink etc are all using them to some degree.
13. Challenge: too much flexibility
● With flexible schema-free graph-based
representations, multiple ways of modeling
things
● OWL provides semantic open-world
biological constraints
○ All genes are located_on exactly 1 chromosome
● Software often needs more rigid closed-
world information model constraints
○ Information System A: gene can be located on
multiple contigs/scaffolds
○ Information System B: locational info not relevant
14. BioLink Model Approach
● Define a powerful underlying metamodel
○ Mix aspects of closed-world UML and open-world OWL
○ Build for extensibility
○ Define exports: UML, SQL DDL, GraphQL, Json-Schema, Java, ...
● Define core biological types (E)
○ Gene, disease, anatomical entity, disease, ...
○ Cede detailed typology to ontologies
● Define core properties (R)
○ Id, name, synonym
○ Part-of, interacts-with, gives-rise-to
● Define taxonomy of relationships (extension of R)
○ Gene-gene-interaction, gene-tissue-expression
● Extensibility through use-case specific profiles
https://biolink.github.io/biolink-model
15. Browsing the model
● YAML source
● Autogenerated website docs: https://biolink.github.io/biolink-model
● OWL export
○ Protege
○ Bioportal
● JSON-Schema (lossy unless working in JSON-LD)
● GraphQL (lossy)
● UML Diagrams (lossy)
https://biolink.github.io/biolink-model
22. Profiles
● Different projects require different views of the data
○ E.g. omission/inclusion of different fields
○ Denormalizations
○ Inlining vs referencing
● Metamodel supports remixing and mixins
● One core conceptual model
● Different serializations for different profiles
● Well-defined transforms
● Caveat: this part is not well documented yet
23. How do I use it? How do I get data?
● Data model is serialization neutral
○ Plus: Flexible
○ Negative: Additional layer of abstraction
● RDF/Turtle serialization
○ http://data.monarchinitiative.org/ttl/
○ Turtle conforms to association patterns
● Property graphs
○ http://neo4j.monarchinitiative.org/
● JSON
○ Challenge: lack of polymorphism
○ Available via generic model or specific models
○ API http://api.monarchinitiative.org/api/
○ Preview: https://data.monarchinitiative.org/json/
○ BDBags of JSON coming soon
24. What NOT to use the biolink-model for
● Raw data
● Metadata about a dataset
● ..
● However..
○ Underlying metamodel may be useful in providing flexible representations of these
○ Currently aligning with FHIR metamodel
25. How does this relate to KC7?
● One view: DC is about data sensu stricto, and metadata
○ Search = lightweight ontology (syns + subsumption) + metadata datamodels
○ “Knowledge bases” have their own specialized search interfaces developed by specialists
○ No role for a standard KM in DC
● Counterview
○ We’re not trying to compete with bio-KBs
○ We want to leverage knowledge to enhance data search
■ Analogous to how google KG enhances google search
○ Example:
■ Find TopMed studies relevant to my disease
● Exploit KG linkages between disease-phenotype, phenotype-variable, phenotype-
gene