Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Knetminer Backend Training, Nov 2018
1. Behind the scenes of KnetMiner
Marco Brandizi
marco.brandizi@rothamsted.ac.uk
Bioinformatics Group Training, 27/11/2018
Find these slides at:
https://www.slideshare.net/mbrandizi
9. Why Changing?
• Graph databases have emerged
• having expressive query Languages (eg, SPARQL, Cypher)
• Having low memory footprint (and possibly scalability over clusters/clouds)
• More stable APIs and implementations
• Data Standards, Machine-Readable Data, FAIR Principles, etc etc etc
• Useful in Input: standardised data, less custom ELT to do, useful tools and techniques (e.g., SPARQL
CONSTRUCT, scripting with JSON)
• Useful in Output: applications based on APIs/micro-services, query languages, machine readable &
standardised data.
• New apps can be either ours or 3rd parties
• Ondex issues
• Getting old (and older with Java >8)
• All data must be in memory
• Not exactly high quality code
11. The Cypher Query/DML Language
Proteins->Reactions->Pathways:
// chain of paths, node selection via property (exploits indices)
MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] ->
(pway:Path{ title: ‘apoptosis’ })
// further conditions, not always so performant
WHERE prot.name =~ ‘(?i)^DNA.+’
// Usual projection and post-selection operators
RETURN prot.name, pway
// Relations can have properties
ORDER BY csby.pvalue
LIMIT 1000
Proteins->Reactions->Pathways:
// Single-path (or same-direction branching) easy to write
MATCH (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction)
- [:part_of*1..3] -> (pway:Path)
RETURN ID(prot), ID(pway) LIMIT 1000
// Very compact forms available, depending on the data
MATCH (prot:Protein) - (pway:Path) RETURN pway
14. Exercise 1: Try Cypher
• Go to http://babvs48.rothamsted.ac.uk:7476/browser
• Use neo4j/test as credentials
• Try the query:
• MATCH (prot:Protein) - [prot2react:cs_by|pd_by] - (react:Reaction)
- [react2path:part_of] -> (pway:Path)
WHERE pway.prefName CONTAINS 'acyl carrier protein metabolism'
RETURN * LIMIT 10
• And explore the graphical result
• What do you think you’ve found?
• What do you have in () and in []?
• What’s the meaning of the ‘|’ operator?
• cs_by and pd_by are shortcuts for ‘consumed by’ and ‘produced by’
• What’s the difference between -[]-> and -[]-> ?
• More help about Cypher at: https://neo4j.com/developer/cypher-query-language
15. Exercise 1: Solution
• You should see something like the figure
• Which shows the ACP pathway at the centre, a member
reaction and proteins consumed/produced by the latter
• (name:Label) matches nodes (label is synonym of type),
[name:Type] matches relations
• [r:R1|R2] matches relations of either type R1 or R2
• (src:Label1)-[r:R]->(dst:Label2) matches relations of type R
going from nodes of type Label1 to nodes of type Label2
• (n1)-[:R]-(n2) matches both directions, so both n1->r1->n2
and n2->r2->n1
16. Exercise 2: Write Your Own Cypher
• Using the same browser, find:
• genes,
• which are encoded by proteins,
• which are mentioned by articles that contain ‘ZmPEAMT1’ in the title
• Hints
• Use the node labels: Gene, Protein, Publication
• Use the relation types: enc (meaning ‘encodes’), pub_in (meaning ‘published in’, or ‘mentioned in’)
• Use the attribute AbstractHeader (meaning ‘publication title’)
• Use the filter operator CONTAINS, as in the previous exercise
• More info about the KnetMiner node/relation types on the left column in the Neo4j browser, and on the
following slides
17. Exercise 2: Solution
• MATCH (gene:Gene)-[enc:enc]->(prot:Protein)-[xref:pub_in]->(article:Publication)
WHERE article.AbstractHeader CONTAINS 'ZmPEAMT1'
RETURN * LIMIT 10
• Your solution might be a variant of this
19. But how to Encode Data? The Semantic
Web Way
@prefix bkr: <http://www.ondex.org/bioknet/resources/> .
@prefix bk: <http://www.ondex.org/bioknet/terms/> .
@prefix bka: <http://www.ondex.org/bioknet/terms/attributes/>.
bkr:TOB1 a bk:Protein ;
bk:participates_in <http://www.wikipathways.org/id1> ;
bk:prefName “TOB1";
bk:published_in bkr:23236473.
25. And more
Neo4J, Cypher DBs, Graph DBs Semantic Web/Triple Stores
Data xchg format
- No official one, just Cypher,
Support for GraphML, RDF
+/- Focus on backing applications
+ Focus on data sharing standards
Data model
+ Relations with properties
- Metadata/schemas/ontologies management
- Relations cannot have properties (reification
required)
+ Metadata/schemas/ontologies as first citizen
and standardised OWL
Performance + complex graph traversals + Comparable in most cases
Query Language
+ Cypher is easier (eg, compact, implicit elems)? -
Expressivity issues (unions)
- No standard QL (but efforts in progress, eg,
OpenCypher)
- SPARQL is Harder? (URIs, namespaces,
verbosity) + SPARQL More expressive
Standardisation,
openness
+/- (TinkerPop is open, Neo4J isn’t)
+ Commercial support
+ More alive and up-to date (e.g., support for
Hadoop, nice Neo4j browser, easy installation)
+ Natively open, many open implementations
- Instability and many short-lived prototypes
- Advancements seems to be slowing down
+ Some nice open and commercial browser
(LODEStar,
Scalability, big data
+/- Commercial support to clustering/clouds for
Neo4J + Open support in TinkerPop
+ Load Balancing/Cluster solutions, Commercial
Cloud support (eg GraphDB) + SPARQL Over
TinkerPop (via SAIL inteface)
27. Why Should I Bother?
• As data consumer
• Querying data via Cypher (or SPARQL)
• In particular, define new semantic motifs to find gene-related entities
• Knowing our BioKNO ontology/schema (TODO)
• In future, querying data via API/Cypher, getting back JSON/BioKNO
• As data producer (for KnetMiner)
• Scripting with RDF/SPARQL/etc to integrate data sources (and produce KnetMiner data
sets)
• Querying multiple SPARQL endpoints to produce data sets and/or integrate our KnetMiner
data with other RDF/SPARQL sources
28. Exercise 3: Playing with RDF
• Study Bio-KNO examples at https://github.com/Rothamsted/bioknet-onto
• What is the meaning of ‘a’? What are the classes (ie, types) used in example 1?
• Which property types (ie, relations) link proteins, pathways and protein accessions?
• According to example 2, is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’?
• In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the
publication mentioning TOB1?
• How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)?
• Hint, use bk:is_annotated_by
• How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05?
• Hint, use the attribute bka:EVIDENCE and bka:Score
• Possibly use further documentation:
• A quick tutorial about RDF and Turtle syntax: https://ai.ia.agh.edu.pl/wiki/_media/pl:dydaktyka:semweb:quick-tutorial-rdf-
turtle.pdf
• BioKNO Ontology Reference:
• http://www.marcobrandizi.info/files/bkn-owldoc/bioknet/index.html (core)
• http://www.marcobrandizi.info/files/bkn-owldoc/bk_ondex/index.html (entities used in KnetMiner/Ondex)
29. Exercise 3: Solution
• ‘a’ is a shortcut for the URI rdf:type, which is the standard property to state that an entity is instance of a class
• So, you can find the classes used in the example by looking at the target of the ‘a’ predicate: bk:Path, bk:Protein, bk:Accession
• is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’?
• The question aims at highlighting a feature of graph data, that is: automatic reasoning
• ‘CCR4-NOT core complex’ is only explicitly stated as being part of ‘CCR4-NOT complex’ (follow the bk:part_of relation and the URIs it refers to)
• So, using only the declared data in the example, a computer cannot ‘know’ that CCR4-NOT complex is also part of ‘intracellular part’
• However, graph systems are able to work with rules like: ?x bk:part_of ?y, ?y bk:is_a ?z => ?x part_of ?z
• This rule can be applied to ?x := obo:GO_0030014, ?y := obo:GO_0030015, ?z := obo:GO_0044424
• and logically infer that obo:GO_0030014 part_of obo:GO_0044424
• This additional statements can be used in queries, eg: searching for all things that are part of intracellular part would return CCR4-NOT core complex in the results, even if this is not
explicitly declared in the original data
• The rationale for this conclusion is that anything that is part of something that is a core complex is also part of something that is an intracellular part, because every core complex is also a
intracellular part (as per is_a)
• In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the publication mentioning TOB1?
• Because you need to provide a context for the usually binary relation, ie, you need to tell what its confidence score is and the evidence to justify the statement
• Compare this with the Neo4j equivalent
• How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)?
• bkr:TOB1 bk:is_annotated_by obo:GO_0003714.
• How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05?
• You need to add:
bkr:citation_TOB1_15489334 a bk:Relation ;
bk:relTypeRef bk:is_annotated_by;
bk:relFrom bkr:TOB1;
bk:relTo obo:GO_0003714 ;
bka:Score 0.95 ;
bka:EVIDENCE “text mining tool”.
• bka:EVIDENCE is an attribute, and it’s an alternative simplified form to represent evidence in KnetMiner (just a string, rather than a resource having multiple attributes).
30. Exercise 4: Data Integration based on
RDF
• Study the example at https://github.com/Rothamsted/bioknet-onto/tree/master/examples/bmp_reg_human,
which build a KnetMiner network in RDF format (and following the BioKNO ontology)
• using two tools: the SPARQL CONSTRUCT construct (https://www.futurelearn.com/courses/linked-
data/0/steps/16104) to perform RDF-to-RDF transformations
• and the SPARQL CONSTRUCT coupled with TARQL tool (http://tarql.github.io/) to transform CSV/table data
into RDF
• Look at the transformation https://github.com/Rothamsted/bioknet-
onto/blob/master/examples/bmp_reg_human/cvt_bpax.sparql, which transform the BioPAX RDF data into our
BioKNO
• What is happening? Look at it before the next question
• Sketch a schema of the BioPAX graph that is matched by the WHERE clause and the one built by the
CONSTRUCT block. Is the new graph smaller or bigger?
• How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)?
• You can play with the data generated in this example at http://marcobrandizi.info:8890/sparql
• Se example queries at: https://github.com/Rothamsted/bioknet-
onto/tree/master/examples/bmp_reg_human/queries
31. Exercise 4: Solution
• The CONSTRUCT statement (which is part of the SPARQL query language), takes chains of
protein/reaction/pathway expressed in the BioPAX format (not the use of the bp: namespace) and builds
chains of protein/pathway in BioKNO format.
• So, it maps a format to another (an alternative would be to do so in data queries, see
queries/pw_commons_fed.sparql)
• and generates a simplified representation (many KnetMiner data sets do so, the data explorations we aim at
serving don’t need certain details)
• How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)?
• In the CONSTRUCT block you’d have:
?comp bk:participates_in ?path.
32. Thanks!
• Even more material:
• On graph databases, standards, KnetMiner new backend:
• https://www.slideshare.net/mbrandizi/behind-the-scenes-of-knetminer-towards-standardised-and-interoperable-knowledge-graphs
• https://doi.org/10.1515/jib-2018-0023
• On Semantic Web, Linked Data, RDF, SPARQL, etc:
• https://prezi.com/hbxhz0kesfnn/sod-2014-presentations-summary
• https://goo.gl/bfF1hu
• https://www.nature.com/articles/nbt1139
• https://www.researchgate.net/publication/221024668_Ontologies_Come_of_Age
• http://mowl-power.cs.man.ac.uk/protegeowltutorial/resources/ProtegeOWLTutorialP4_v1_3.pdf