2. Talk Overview
• How to do Neo4j bioinformatics on a local computer with public data
• Three use cases:
ofor genome analyses
ofor Carbohydrate-active enzymes
ofor antibiotic resistance in CARD
• Conclusions
2
Genes
Genomes Phenotypes
Degrade cellulose
Resist cephalosporin
Transport Fe2+
3. About me – Sixing Huang
• Studied biology and bioinformatics in Bremen.
• Worked as a bioinformatic data scientist in DSZM Braunschweig.
• Now as a bioinformatic scientist in MGI Shenzhen.
• First contact with Neo4j in 2019.
• Now use Neo4j for knowledge management, genome browser and
database and write about my Neo4j journey in medium.com.
3
4. Genomes have structures
• Genes are more than a bag of words.
• Neo4j can model genome structures.
Gene1 Gene2 Gene3
4
5. Neo4j as a genome browser
5
EMBL file Neo4j genome browser
Import
6. Gene CAZy clusters and annotations
6
MATCH p=(f0:Gene) -[:NEXT*5] -> (f1:Gene) -[:NEXT*5] ->(f2:Gene)
WHERE f1.name =~ '.+GH16[^a-zA-Zds:]*'
Neo4j Commander
7. Bacterial genome data
Data in biological studies
Genome1 3 4 1 0 1 0 1 3 2 1 3 2 1
Genome2 2 1 3 1 3 0 1 3 2 1 4 1 2
Genes
Neo4j can compare core and pan genomes effectively.
7
8. Heavy use of orthology
Taxonomy
Phylum
Class1 Class2
Order1 Order2 Order3
Genome1 Genome2 Genome3 Genome4
KEGG Gene annotation
Metabolism
Carbohydrate
metabolism
Glycolysis /
Gluconeogenesis
alcohol
dehydrogenase
K00001
Lipid metabolism
Fatty acid
degradation
acyl-CoA
dehydrogenase
K06445
hexokinase
K00844
Neo4j can model orthology intuitively.
8
10. Chromobacterium sp. ATCC 53434
10
KO in sisters as filter
KO in 53434 - filter
Unique KO in 53434
11. Unique KO in ATCC 53434
11
KO Annotation
K04783 yersiniabactin salicyl-AMP ligase [EC:6.3.2.-]
K04784 yersiniabactin nonribosomal peptide synthetase
K12241 pyochelin biosynthesis protein PchG
K12242 pyochelin biosynthesis protein PchC
K13255 ferric iron reductase protein FhuF
K23227
ferric hydroxamate transport system substrate-
binding protein
K23228
ferric hydroxamate transport system permease
protein
K10829
ferric hydroxamate transport system ATP-binding
protein [EC:7.2.2.16]
siderophore
siderophore
ferric hydroxamate transport
Hypothesis: ATCC 53434 has a unique repertoire of iron related transport proteins.
12. Phylogeny made easy
12
Order by numbers of shared KO
Compute the shared KO
Name Shared KO
Chromobacterium vaccinii 1869
Chromobacterium sp. IIBBL 112-1 1867
Chromobacterium rhizoryzae 1777
Chromobacterium haemolyticum 1776
Chromobacterium sp. 257-1 1723
13. Neo4j for Carbohydrate-active EnZYmes:
13
Reannotation of Formosa agariphila KMM 3901
CAZy Annotation
PL28 ulvan lyase
PL37 ulvan lyase
GH28 polygalacturonase
GH78 alpha-L-rhamnosidase
GH105 unsaturated rhamnogalacturonyl hydrolase
GH86 beta-agarase
Gh168 endo-alpha-(1,3)-L-fucanase
Unique CAZy not in sister genomes
Visualization
degrade
ulvan
degrade
pectin
degrade sulfated
polysaccharide
unique agarolytic life strategy
18. Conclusions
• Neo4j can serve as an all-in-one genome browser, a biodata
warehouse and a data mining tool.
• It can deliver insights more quickly than relational database + SQL
• Built-in machine learning can predict new connections and properties
• GraphQL serves data for non-Neo4j users
18