Knetminer Backend Training, Nov 2018

Behind the scenes of KnetMiner
Marco Brandizi
marco.brandizi@rothamsted.ac.uk
Bioinformatics Group Training, 27/11/2018
Find these slides at:
https://www.slideshare.net/mbrandizi

Behind the scenes of KnetMiner

<concept>
<id>1</id>
<pid>Q75WV3</pid>
<description/>
<elementOf>
<idRef>UNIPROTKB-SwissProt</idRef>
</elementOf>
<ofType>
<idRef>Protein</idRef>
</ofType>
<evidences>
<evidence>
<idRef>IMPD</idRef>
</evidence>
</evidences>
<conames>
<concept_name>
<name>Probable trehalose-phosphate phosphatase 1</name>
<isPreferred>true</isPreferred>
</concept_name>
…
<cc>
<id>Protein</id>
<fullname>Protein</fullname>
<description>
A protein is comprised of one or more Polypeptides
and potentially other molecules.
</description>
<specialisationOf>
<idRef>MolCmplx</idRef>
</specialisationOf>
</cc>
<relation>
<fromConcept>1</fromConcept>
<toConcept>3</toConcept>
<ofType>
<idRef>participates_in</idRef>
</ofType>
<evidences>
<evidence>
<idRef>ECO:0000316</idRef>
</evidence>
</evidences>
<relgds/>
</relation>
<concept>
<id>3</id>
<pid>GO:0009651</pid>
<description>response to salt stress</description>
<ofType><idRef>BioProc</idRef></ofType>
<coaccessions>
<concept_accession>
<accession>GO:0009651</accession>
<elementOf><idRef>GO</idRef></elementOf>
<ambiguous>false</ambiguous>
</concept_accession>
</coaccessions>
</concept>
The OXL format

But it Needs some Pre-Processing Too

Why Changing?
https://funnyjunk.com/Reinvent+the+wheel/funny-
pictures/5665443/

Why Changing?
• Graph databases have emerged
• having expressive query Languages (eg, SPARQL, Cypher)
• Having low memory footprint (and possibly scalability over clusters/clouds)
• More stable APIs and implementations
• Data Standards, Machine-Readable Data, FAIR Principles, etc etc etc
• Useful in Input: standardised data, less custom ELT to do, useful tools and techniques (e.g., SPARQL
CONSTRUCT, scripting with JSON)
• Useful in Output: applications based on APIs/micro-services, query languages, machine readable &
standardised data.
• New apps can be either ours or 3rd parties
• Ondex issues
• Getting old (and older with Java >8)
• All data must be in memory
• Not exactly high quality code

The Cypher Query/DML Language
Proteins->Reactions->Pathways:
// chain of paths, node selection via property (exploits indices)
MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] ->
(pway:Path{ title: ‘apoptosis’ })
// further conditions, not always so performant
WHERE prot.name =~ ‘(?i)^DNA.+’
// Usual projection and post-selection operators
RETURN prot.name, pway
// Relations can have properties
ORDER BY csby.pvalue
LIMIT 1000
Proteins->Reactions->Pathways:
// Single-path (or same-direction branching) easy to write
MATCH (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction)
- [:part_of*1..3] -> (pway:Path)
RETURN ID(prot), ID(pway) LIMIT 1000
// Very compact forms available, depending on the data
MATCH (prot:Protein) - (pway:Path) RETURN pway

Cypher as Semantic Motif Language

Exercise 1: Try Cypher
• Go to http://babvs48.rothamsted.ac.uk:7476/browser
• Use neo4j/test as credentials
• Try the query:
• MATCH (prot:Protein) - [prot2react:cs_by|pd_by] - (react:Reaction)
- [react2path:part_of] -> (pway:Path)
WHERE pway.prefName CONTAINS 'acyl carrier protein metabolism'
RETURN * LIMIT 10
• And explore the graphical result
• What do you think you’ve found?
• What do you have in () and in []?
• What’s the meaning of the ‘|’ operator?
• cs_by and pd_by are shortcuts for ‘consumed by’ and ‘produced by’
• What’s the difference between -[]-> and -[]-> ?
• More help about Cypher at: https://neo4j.com/developer/cypher-query-language

Exercise 1: Solution
• You should see something like the figure
• Which shows the ACP pathway at the centre, a member
reaction and proteins consumed/produced by the latter
• (name:Label) matches nodes (label is synonym of type),
[name:Type] matches relations
• [r:R1|R2] matches relations of either type R1 or R2
• (src:Label1)-[r:R]->(dst:Label2) matches relations of type R
going from nodes of type Label1 to nodes of type Label2
• (n1)-[:R]-(n2) matches both directions, so both n1->r1->n2
and n2->r2->n1

Exercise 2: Write Your Own Cypher
• Using the same browser, find:
• genes,
• which are encoded by proteins,
• which are mentioned by articles that contain ‘ZmPEAMT1’ in the title
• Hints
• Use the node labels: Gene, Protein, Publication
• Use the relation types: enc (meaning ‘encodes’), pub_in (meaning ‘published in’, or ‘mentioned in’)
• Use the attribute AbstractHeader (meaning ‘publication title’)
• Use the filter operator CONTAINS, as in the previous exercise
• More info about the KnetMiner node/relation types on the left column in the Neo4j browser, and on the
following slides

• MATCH (gene:Gene)-[enc:enc]->(prot:Protein)-[xref:pub_in]->(article:Publication)
WHERE article.AbstractHeader CONTAINS 'ZmPEAMT1'
RETURN * LIMIT 10
• Your solution might be a variant of this

But how to Encode Data? The Semantic
Web Way

But how to Encode Data? The Semantic
Web Way
@prefix bkr: <http://www.ondex.org/bioknet/resources/> .
@prefix bk: <http://www.ondex.org/bioknet/terms/> .
@prefix bka: <http://www.ondex.org/bioknet/terms/attributes/>.
bkr:TOB1 a bk:Protein ;
bk:participates_in <http://www.wikipathways.org/id1> ;
bk:prefName “TOB1";
bk:published_in bkr:23236473.

select distinct ?prot ?comp {
where {
?prot a kb:Protein;
rdfs:label ?protLabel.
filter ( contains ( ?protLabel, ‘TOB1’ ).
?enz kb:activated_by ?prot.
?enz kb:activated_by ?comp.
?comp rdfs:label ?compLabel.
}
LIMIT 1000
Querying KnetMiner with SPARQL

select distinct ?prot ?pway {
where {
# Branch 1
?prot kb:pd_by|kb:cs_by ?react.
?prot a kb:Protein.
?react a kb:Reaction.
?react kb:part_of ?pway.
?pway a kb:Path.
}
union { # Branch 2
?prot ^kb:ac_by|kb:is_a ?enz.
?prot a kb:Protein.
?enz a kb:Enzyme.
{ # Branch 2.1
?enz kb:ac_by|kb:in_by ?comp.
?comp a kb:Compound.
?comp kb:cs_by|kb:pd_by ?trns
?trns a kb:Transport
} union {
# Branch 2.2
?enz ^kb:ca_by ?trns.
?comp a kb:Compound.
?trns a kb:Transport
}
?trns kb:part_of ?pway.
?pway a kb:Path.
}
} LIMIT 1000
Querying KnetMiner with SPARQL

And more
Neo4J, Cypher DBs, Graph DBs Semantic Web/Triple Stores
Data xchg format
- No official one, just Cypher,
Support for GraphML, RDF
+/- Focus on backing applications
+ Focus on data sharing standards
Data model
+ Relations with properties
- Metadata/schemas/ontologies management
- Relations cannot have properties (reification
required)
+ Metadata/schemas/ontologies as first citizen
and standardised OWL
Performance + complex graph traversals + Comparable in most cases
Query Language
+ Cypher is easier (eg, compact, implicit elems)? -
Expressivity issues (unions)
- No standard QL (but efforts in progress, eg,
OpenCypher)
- SPARQL is Harder? (URIs, namespaces,
verbosity) + SPARQL More expressive
Standardisation,
openness
+/- (TinkerPop is open, Neo4J isn’t)
+ Commercial support
+ More alive and up-to date (e.g., support for
Hadoop, nice Neo4j browser, easy installation)
+ Natively open, many open implementations
- Instability and many short-lived prototypes
- Advancements seems to be slowing down
+ Some nice open and commercial browser
(LODEStar,
Scalability, big data
+/- Commercial support to clustering/clouds for
Neo4J + Open support in TinkerPop
+ Load Balancing/Cluster solutions, Commercial
Cloud support (eg GraphDB) + SPARQL Over
TinkerPop (via SAIL inteface)

Why Should I Bother?
• As data consumer
• Querying data via Cypher (or SPARQL)
• In particular, define new semantic motifs to find gene-related entities
• Knowing our BioKNO ontology/schema (TODO)
• In future, querying data via API/Cypher, getting back JSON/BioKNO
• As data producer (for KnetMiner)
• Scripting with RDF/SPARQL/etc to integrate data sources (and produce KnetMiner data
sets)
• Querying multiple SPARQL endpoints to produce data sets and/or integrate our KnetMiner
data with other RDF/SPARQL sources

Exercise 3: Playing with RDF
• Study Bio-KNO examples at https://github.com/Rothamsted/bioknet-onto
• What is the meaning of ‘a’? What are the classes (ie, types) used in example 1?
• Which property types (ie, relations) link proteins, pathways and protein accessions?
• According to example 2, is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’?
• In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the
publication mentioning TOB1?
• How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)?
• Hint, use bk:is_annotated_by
• How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05?
• Hint, use the attribute bka:EVIDENCE and bka:Score
• Possibly use further documentation:
• A quick tutorial about RDF and Turtle syntax: https://ai.ia.agh.edu.pl/wiki/_media/pl:dydaktyka:semweb:quick-tutorial-rdf-
turtle.pdf
• BioKNO Ontology Reference:
• http://www.marcobrandizi.info/files/bkn-owldoc/bioknet/index.html (core)
• http://www.marcobrandizi.info/files/bkn-owldoc/bk_ondex/index.html (entities used in KnetMiner/Ondex)

• ‘a’ is a shortcut for the URI rdf:type, which is the standard property to state that an entity is instance of a class
• So, you can find the classes used in the example by looking at the target of the ‘a’ predicate: bk:Path, bk:Protein, bk:Accession
• is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’?
• The question aims at highlighting a feature of graph data, that is: automatic reasoning
• ‘CCR4-NOT core complex’ is only explicitly stated as being part of ‘CCR4-NOT complex’ (follow the bk:part_of relation and the URIs it refers to)
• So, using only the declared data in the example, a computer cannot ‘know’ that CCR4-NOT complex is also part of ‘intracellular part’
• However, graph systems are able to work with rules like: ?x bk:part_of ?y, ?y bk:is_a ?z => ?x part_of ?z
• This rule can be applied to ?x := obo:GO_0030014, ?y := obo:GO_0030015, ?z := obo:GO_0044424
• and logically infer that obo:GO_0030014 part_of obo:GO_0044424
• This additional statements can be used in queries, eg: searching for all things that are part of intracellular part would return CCR4-NOT core complex in the results, even if this is not
explicitly declared in the original data
• The rationale for this conclusion is that anything that is part of something that is a core complex is also part of something that is an intracellular part, because every core complex is also a
intracellular part (as per is_a)
• In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the publication mentioning TOB1?
• Because you need to provide a context for the usually binary relation, ie, you need to tell what its confidence score is and the evidence to justify the statement
• Compare this with the Neo4j equivalent
• How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)?
• bkr:TOB1 bk:is_annotated_by obo:GO_0003714.
• How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05?
• You need to add:
bkr:citation_TOB1_15489334 a bk:Relation ;
bk:relTypeRef bk:is_annotated_by;
bk:relFrom bkr:TOB1;
bk:relTo obo:GO_0003714 ;
bka:Score 0.95 ;
bka:EVIDENCE “text mining tool”.
• bka:EVIDENCE is an attribute, and it’s an alternative simplified form to represent evidence in KnetMiner (just a string, rather than a resource having multiple attributes).

Exercise 4: Data Integration based on
RDF
• Study the example at https://github.com/Rothamsted/bioknet-onto/tree/master/examples/bmp_reg_human,
which build a KnetMiner network in RDF format (and following the BioKNO ontology)
• using two tools: the SPARQL CONSTRUCT construct (https://www.futurelearn.com/courses/linked-
data/0/steps/16104) to perform RDF-to-RDF transformations
• and the SPARQL CONSTRUCT coupled with TARQL tool (http://tarql.github.io/) to transform CSV/table data
into RDF
• Look at the transformation https://github.com/Rothamsted/bioknet-
onto/blob/master/examples/bmp_reg_human/cvt_bpax.sparql, which transform the BioPAX RDF data into our
BioKNO
• What is happening? Look at it before the next question
• Sketch a schema of the BioPAX graph that is matched by the WHERE clause and the one built by the
CONSTRUCT block. Is the new graph smaller or bigger?
• How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)?
• You can play with the data generated in this example at http://marcobrandizi.info:8890/sparql
• Se example queries at: https://github.com/Rothamsted/bioknet-
onto/tree/master/examples/bmp_reg_human/queries

• The CONSTRUCT statement (which is part of the SPARQL query language), takes chains of
protein/reaction/pathway expressed in the BioPAX format (not the use of the bp: namespace) and builds
chains of protein/pathway in BioKNO format.
• So, it maps a format to another (an alternative would be to do so in data queries, see
queries/pw_commons_fed.sparql)
• and generates a simplified representation (many KnetMiner data sets do so, the data explorations we aim at
serving don’t need certain details)
• How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)?
• In the CONSTRUCT block you’d have:
?comp bk:participates_in ?path.

Thanks!
• Even more material:
• On graph databases, standards, KnetMiner new backend:
• https://www.slideshare.net/mbrandizi/behind-the-scenes-of-knetminer-towards-standardised-and-interoperable-knowledge-graphs
• https://doi.org/10.1515/jib-2018-0023
• On Semantic Web, Linked Data, RDF, SPARQL, etc:
• https://prezi.com/hbxhz0kesfnn/sod-2014-presentations-summary
• https://goo.gl/bfF1hu
• https://www.nature.com/articles/nbt1139
• https://www.researchgate.net/publication/221024668_Ontologies_Come_of_Age
• http://mowl-power.cs.man.ac.uk/protegeowltutorial/resources/ProtegeOWLTutorialP4_v1_3.pdf

Knetminer Backend Training, Nov 2018

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Knetminer Backend Training, Nov 2018

Ähnlich wie Knetminer Backend Training, Nov 2018 (20)

Mehr von Rothamsted Research, UK

Mehr von Rothamsted Research, UK (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Knetminer Backend Training, Nov 2018