Building a repository of biomedical ontologies with Neo4j
1. Building a repository of biomedical
ontologies with Neo4j
Simon Jupp
Samples, Phenotypes and Ontologies Team
European Bioinformatics Institute
Cambridge, UK.
2. Outline
• Why we care about ontologies in biology
• Why we need a repository of ontologies
• Building a new Ontology Lookup Service (OLS) at the
EBI
• Index OWL ontologies in Neo4j
• OLS Infrastructure
• Challenges with Neo4j
• Neo4j and Linked Open Data
3. What is EMBL-EBI?
• Part of the European
Molecular Biology
Laboratory
• International, non-profit
research institute
• Europe’s hub for
biological data services
and research
• Based in Hinxton,
Cambridge
4. Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas
Metabolights
PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Literature &
ontologies
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide
Archive
1000 Genomes
Gene, protein & metabolite expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels
Enzyme Portal
BioSamples
Ensembl
Ensembl Genomes
European Genome-phenome Archive
Metagenomics portal
6. We have a lot of data silos
• A lot of public data
• Heterogeneous semantics, formats, identifiers
• EBI and other institutes invest heavily in cross-linking
resources
8. One Identity for each entity
• Mouse or Mus or mice = NCBITaxon_10088
• …but not all mice are equal
9. Building ontologies
• Put things into categories
• Helps organise the data
• Allows us to generalise over data
• Capture the relations between things
• Anatomical parts
Biopolymer
Nucleic Acid Polypeptide
EnzymeDNA RNA
tRNA mRNA smRNA
10. Web Ontology Language – (OWL)
• W3C standard vocabulary for describing
ontologies
• OWL is based on a description logic
• We can use it to describe sets of things based
on their properties
• A subclassOf B - Implies all things of type A, are
also things of type B
• “heart” part-of “Cardiovascular System”
• Powerful knowledge representation
‘mitochondrial chromosome’ ‘equivalent to’
chromosome and ‘part of’ some mitochondrion
11. Using a DL reasoner to infer classification
Relatively flat asserted view Inferred polyhierarchy
OWL reasoner
12. 12
Genotype Phenotype
Sequence
Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue /
enzyme source
Development
Anatomy
Phenotype
Plasmodium
life cycle
-Sequence types
and features
-Genetic Context
- Molecule role
- Molecular Function
- Biological process
- Cellular component
-Protein covalent bond
-Protein domain
-UniProt taxonomy
-Pathway ontology
-Event (INOH pathway
ontology)
-Systems Biology
-Protein-protein
interaction
-Arabidopsis development
-Cereal plant development
-Plant growth and developmental stage
-C. elegans development
-Drosophila development FBdv fly
development.obo OBO yes yes
-Human developmental anatomy, abstract
version
-Human developmental anatomy, timed version
-Mosquito gross anatomy
-Mouse adult gross anatomy
-Mouse gross anatomy and development
-C. elegans gross anatomy
-Arabidopsis gross anatomy
-Cereal plant gross anatomy
-Drosophila gross anatomy
-Dictyostelium discoideum anatomy
-Fungal gross anatomy FAO
-Plant structure
-Maize gross anatomy
-Medaka fish anatomy and development
-Zebrafish anatomy and development
-NCI Thesaurus
-Mouse pathology
-Human disease
-Cereal plant trait
-PATO PATO attribute and value.obo
-Mammalian phenotype
-Habronattus courtship
-Loggerhead nesting
-Animal natural history and life history
eVOC (Expressed
Sequence Annotation
for Humans)
Ontologies for life sciences
13. We do a lot of tagging
CL:CL_0000071
(blood vessel
endothelial cell)
obo:CHEBI_39867
(valproic acid)
NCBITaxon:NCBITa
xon_9606
(Homo Sapiens)
15. Summary so far…
• Ontologies provide a “semantic glue” for integrating
biological data
• There’s a lot of ontologies about
• The biological community need ontology infrastructure
and services
• Ontologies can be complex
• Ontologies can be big
• Ontologies can change
16. Ontologies as Graphs
• OWL ontologies aren’t graphs, but…
… can be represented as an RDF graph
… people want to use them as graphs
• Plenty of RDF databases around
• But incomplete w.r.t. OWL semantics
• SPARQL is an acquired taste
17. Ontology repository use-cases
• Search for ontology terms
• labels, synonyms, descriptions
• Querying the structure
• Get parent/child terms
• Querying transitive closure
• Get ancestor/descendant terms
• Querying across relations
• Partonomy or development stages
• A graph database and search index should satisfy
these requirements
18. The old Ontology Lookup Service
• EBI been hosting a repository of over 100 Bio-medical
ontologies for past 10 years
• SOAP services for programmatic access
• Up to 25 million requests per month (mostly API).
http://www.ebi.ac.uk/ontology-lookup
19.
20. Why we need a new OLS
• Old codebase (+10 years in
places)
• Updated to work with OWL (not
OBO)
• Uses Oracle RDMS and SQL
for querying ontology structure
(suboptimal)
• Ditch SOAP/XML in favour of
REST/JSON
21. OLS 3.0
• Rebuilt from scratch
• Polls ontologies by URL
• Server side checksum for detecting changes in files
• Uses Java OWL API for loading (still supports OBO)
• Infer relations with reasoner
• RESTful API built with Spring Data
• Multiple indexes for scalable querying
• SOLR server – text queries
• Embedded Neo4j – graph queries (drives REST API)
• Virtuoso server – SPARQL for Advanced users
22. OLS 3 beta is now live
• http://www.ebi.ac.uk/ols/beta/
• 140 ontologies
• Neo4j version 2.2
• Runs in embedded mode
• Inside Tomcat container
• 7 million nodes
• 11 million edges
• ~10Gb on disk
• Generic ontology infrastructure
• Can load any OWL or SKOS file
• Built with standard technologies
• Solr, Neo4j, Spring IO, Thymeleaf,
Bootstrap, Jquery
• Includes stand-alone Spring-Boot app for
loading ontologies into Neo4j
• Open-source project
https://github.com/EBISPOT/OLS
23. REST API
• Search across any field in one or more ontologies (SOLR)
• /search
• Get ontology and term meta data (Neo4j)
• /ontologies
• /ontologies/{name}
• /ontologies/{name}/terms
• /ontologies/{name}/terms/{termid}
• Get related terms and navigate ontology structure (Neo4j)
• /ontologies/{name}/terms/{termid}/parent
• /ontologies/{name}/terms/{termid}/children
• /ontologies/{name}/terms/{termid}/descendants
• /ontologies/{name}/terms/{termid}/ancestors
• /ontologies/{name}/terms/{termid}/{relation} e.g. part_of
• Get JSON for common visualisation libraries (Neo4j)
• /ontologies/{name}/terms/{termid}/tree
• /ontologies/{name}/terms/{termid}/graph
http://www.ebi.ac.uk/ols/beta/api
24. OWL to Neo4j schema
Label every node by type (e.g. class, property or individual) and ontology id
Label every relation by name
include additional index for “special relations” like partonomy and subsets
25. Nightly Neo4j build process
Nightly crawl of all
>140 registered
ontologies
Use the Java OWL API and
reasoner to classify ontology
(get the inferred
classification)
Use Neo4j
BatchInserter to
update neo4j index
Download file
create checksum
If the file is new
Drop ontology from
neo4j index
26. OLS 3.0 Infrastructure
2 x Load balanced Tomcat servers
Two data centers
Data center 1 (8GB VM) Data center 2 (8GB VM)
27. Why Neo4j?
• Our primary use-case required a graph store
• OWL mapping to RDF graph is complex (lots of blank
nodes)
• We wanted Spring Data and Spring Data Rest
• Less code for us to maintain
• Didn’t want to write our own DAO using SPARQL
• (We’ve tried this on another project)
• We wanted something that we could rely on with
community behind it
• Neo4j was quick to pick up
• 1 day GraphAware course 4 months ago
• Working pilot for new OLS + Neo4j 1 month later
28. Powerful yet simple queries
• Get the transitive closure for “heart” following parent and
partonomy relations from the UBERON anatomy ontology
MATCH path = (n:Class)-[r:SUBCLASSOF|RelatedTree*]
->(parent)<-[r2:SUBCLASSOF|RelatedTree]-(sibling:Class)
WHERE n.ontology_name = {0} AND n.iri = {1}
29. Generating visualisations
MATCH path = (n:Class)-[r:SUBCLASSOF|Related]-(parent)
WHERE n.ontology_name = {0} AND n.iri = {1}
RETURN {nodes: collect( distinct {iri: p.iri, label: p.label}), edges: collect
(distinct {source: startNode(r1).iri, target: endNode(r1).iri, label:
r1.label, uri: r1.uri} )} as result
Generating common JSON representations directly from Cypher is very powerful
30. Challenges
• Wanted to utilise Spring for our REST API
• We had a REST resource hierarchy that we wanted
api/ontologies/{name}/terms/{termid}/parents
api/ontologies/{name}/terms/{termid}/children
• Too hard to get this to work using just an object model
and SDN alone
• No matter what we tried always ended up sending Neo4j
into a spin
@NodeEntity
@TypeAlias(value = "Class")
public class Term {
@RelatedToVia (direction= Direction.OUTGOING, type = ”SUBCLASSOF")
@Fetch Set<Term> parents;
@RelatedToVia (direction= Direction.INCOMING, type = ”SUBCLASSOF")
@Fetch Set<Term> children;
}
31. …but it was easy enough to achieve what we
wanted with some Spring magic
Repository interface with custom Cypher
Define our own controllers
Custom Resource Assemblers for HAL links
32. Challenges
• We need dynamic fields
• Neo4j is driving the REST API
• Each ontology term has metadata where we don’t know the
field names up front (e.g. ‘created by’ or ‘comment’)
• To get get the right set of dependencies we currently use
SDN 3.4.0
• Dynamic fields not supported in SDN 4.0
• We are forced to run in embedded mode
• Is this true?
• Scaling tips for running inside a tomcat please
33. Challenges
• Full index rebuild takes up to 20 hours
• Most nights the update runs in ~2 hours
• We have one master Neo4j db
• If an ontology needs updating we take it out and then reload
• Built on machine with 128GB memory + SSD
• There’s always a chance we might trash the entire index
• We’d like to build an index for each ontology
independently.
• Have a final stage where we merge all the successfully built
indexes
• Other suggestions?
34. Things we’d like to do
• Extract subsets from a graph
• Some nodes are tagged as being in a subset
• Help to give broad overview of an annotated datasets
• May require us to infer relations
Master graph Extracted subset graph
36. Recap
• The EBI Ontology Lookup Service provides access to the
ontologies for biological researchers and database
curators
• Main priority is providing a scalable API for external services
to develop against
• Pilot of Neo4j quickly turned into our primary index for
driving the REST API
• There is no one fit solution for the backend, always some
compromise
• So we make the most of frameworks like Spring Data Solr
and Spring Data Neo4j to make creating multiple indexes
simpler
• Neo4j has been easy to get grips with and scaled well for
our setup with pretty much out of the box configuration
37. A word on Linked Data
• We have many years experience working with RDF and
Semantic Web technologies
• The EBI RDF platform –EBI data that has been converted to
RDF (Billions of triples)
• The ontologies and the data in one big federated graph
• http://www.ebi.ac.uk/rdf - powerful data integration platform
• Semantic Web technologies have struggled to get
mainstream adoption
• Reasons: Hype, Complexity, Baggage, Poor
implementations
• Remain relevant in the life sciences
• A lot of public data out there that needs to be integrated
38. Life sciences rely on Linked Open Data
• Linked data is a rebranding of the Semantic Web
• Core principles address our data integration needs
• Use URIs to identify things
• Type things with ontology terms
• Make sure URIs resolve (self describing documents)
• Link documents together
• We see some major wins if Neo4j was more linked data
friendly
• This doesn’t have to mean supporting SPARQL
• A general feeling of tension between Neo4j and the RDF
community
39. Final thoughts – Neo4j and JSON-LD?
• A lot of frameworks now make it trivial to produce good
APIs
• What’s currently missing is how to integrate data from two
or more independent APIs
• Hard to crawl independent datasets for connections without
a human to interpret semantics
• Still a need to express a schema alongside the data
• W3C standard like RDF/RDFS/SKOS/OWL provide the
basic vocabularies and semantics for expressing data
schemas
• JSON-LD is bridging the gap from JSON to RDF
40. Be open
• We are committed to making life science data public and
freely available
• Likewise the tools and software we develop to work with
the data are open
• We always strive to use products that are open and freely
available
• We can only use Neo4j while it continues to be made
available in this model
• Vendor lock-in for our products is very bad for us
• Graph database have great potential for biology
• But we need open standards for these databases
41. Acknowledgements
• Sample Phenotypes and Ontologies Team - Tony
Burdett, James Malone, Dani Welter, Catherine Leroy,
Sira Sarntivijai, Ilinca Tudose, Helen Parkinson
• Matt Pearce – Flax (BioSOLR project)
• Michal Bachman and GraphAware team
• Funding
• European Molecular Biology Laboratory (EMBL)
• European Union projects: DIACHRON, BioMedBridges and
CORBEL