SlideShare wird heruntergeladen. ×
0
Bio4j: A pioneer graph based         database for the integration of                  biological Big Datawww.ohnosequences...
Who am I?     I am Currently working as a Bioinformatics consultant/developer/researcher at     Oh no sequences!    Oh no ...
A bit of background…    In Bioinformatics we have highly interconnected overlapping knowledge spread    throughout differe...
However all this data is in most cases modeled in relational databases.        Sometimes even just as plain CSV files     ...
With a relational paradigm, the double implication                               Entity  Table         does not go both ...
Life in general and biology in particular are probably not 100% like a graph…                                but one thing...
What’s Bio4j?     Bio4j is a bioinformatics graph based DB including most data     available in :        Uniprot KB (Swiss...
What’s Bio4j?     It provides a completely new and powerful framework     for protein related information querying and    ...
What’s Bio4j?     Bio4j uses Neo4j technology, a "high-performance graph     engine with all the features of a mature and ...
What’s Bio4j?                        Everything in Bio4j is open source !       released under AGPLv3www.ohnosequences.com...
Bio4j in numbers     The current version (0.8) includes:            Relationships: 717.484.649            Nodes: 92.667.74...
Let’s dig a bit about Bio4j structure…               Data sources and their relationships:www.ohnosequences.com           ...
Bio4j domain modelwww.ohnosequences.com   www.bio4j.com
Bio4j modules   Bio4j includes different data sources but you may not always be interested in having   all of them.   That...
Bio4j modules   You must however keep in mind that you must be coherent when choosing the data   sources you want to have ...
The Graph DB model: representation          Core abstractions:             Nodes             Relationships between nodes  ...
How are things modeled?                            Couldn’t be simpler!                 Entities           Associations / ...
Some examples of nodes would be:                                      GO term                  Protein                    ...
We have developed a tool aimed to be used both as a reference manual and    initial contact for Bio4j domain model: Bio4jE...
Entry points and indexing        There are two kinds of entry points for the graph:               Auxiliary relationships ...
Retrieving protein info (Bio4j Java API)     //--creating manager and node retriever----     Bio4jManager manager = new Bi...
Querying Bio4j with Cypher     Getting a keyword by its ID     START k=node:keyword_id_index(keyword_id_index = "KW-0181")...
Mining Bio4j data      Finding topological patterns in Protein-Protein                  Interaction networkswww.ohnosequen...
A graph traversal language     Get protein by its accession number and return its full name     gremlin> g.idx(protein_acc...
REST Server     You can also query/navigate through Bio4j with the Neo4j REST API !     The default representation is json...
Visualizations (1)  REST Server Data Browser      Navigate through Bio4j data in real time !www.ohnosequences.com        ...
Visualizations (2)  Bio4j GO Toolswww.ohnosequences.com                    www.bio4j.com
Visualizations (3)  Bio4j + Gephi      Get really cool graph visualizations using Bio4j and Gephi visualization and      ...
Bio4j + Cloud     We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving     us the following benefits:  ...
Why would I use Bio4j ?    Massive access to protein/genome/taxonomy… related information    Integration of your own DBs/r...
OK, but why starting all this?   Were you so bored…?!    It all started somehow around our need for massive access to prot...
These processes had to be automated for our (specifically designed for NGS data)  bacterial genome annotation system BG7 (...
We got used to having massive direct access to all this protein related      information…           So why not adding othe...
Bio4j + MG7 + 48 Blast XML files (~1GB each)     Some numbers:               • 157 639 502 nodes               • 742 615 7...
MG7 domain modelwww.ohnosequences.com   www.bio4j.com
What’s MG7?     MG7 provides the possibility of choosing different parameters to fix the     thresholds for filtering the ...
Heat-map Vizwww.ohnosequences.com   www.bio4j.com
Finding the lowest common ancestor of a set of NCBI                taxonomy nodes with Bio4jwww.ohnosequences.com         ...
Future directions     Improvements in modules     Integration of even more massive data     Application to Cancer genomics...
Community     Bio4j has a fast growing internet presence:            - Twitter: check @bio4j for updates            - Blog...
That’s it !                        Thanks for                        your time ;)www.ohnosequences.com                  ww...
Nächste SlideShare
Wird geladen in ...5
×

Bio4j

1,876

Published on

Bio4j presentation given at the Workshop: 'Graph Databases in Life Sciences'

Published in: Technologie
0 Kommentare
4 Gefällt mir
Statistiken
Notizen
  • Hinterlassen Sie den ersten Kommentar

Keine Downloads
Views
Gesamtviews
1,876
Bei Slideshare
0
Aus Einbettungen
0
Anzahl an Einbettungen
3
Aktionen
Geteilt
0
Downloads
32
Kommentare
0
Gefällt mir
4
Einbettungen 0
No embeds

No notes for slide

Transcript of "Bio4j"

  1. 1. Bio4j: A pioneer graph based database for the integration of biological Big Datawww.ohnosequences.com www.bio4j.com
  2. 2. Who am I? I am Currently working as a Bioinformatics consultant/developer/researcher at Oh no sequences! Oh no what !? We are the R&D group at Era7 Bioinformatics. we like bioinformatics, cloud computing, NGS, category theory, bacterial genomics… well, lots of things. What about Era7 Bioinformatics? Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis, knowledge management and sequencing data interpretation. Our area of expertise revolves around biological sequence analysis, particularly Next Generation Sequencing data management and analysis.www.ohnosequences.com www.bio4j.com
  3. 3. A bit of background… In Bioinformatics we have highly interconnected overlapping knowledge spread throughout different DBswww.ohnosequences.com www.bio4j.com
  4. 4. However all this data is in most cases modeled in relational databases. Sometimes even just as plain CSV files As the amount and diversity of data grows, domain models become crazily complicated!www.ohnosequences.com www.bio4j.com
  5. 5. With a relational paradigm, the double implication Entity  Table does not go both ways. You get ‘auxiliary’ tables that have no relationship with the small piece of reality you are modeling. You need ‘artificial’ IDs only for connecting entities, (and these are mixed with IDs that somehow live in reality) Entity-relationship models are cool but in the end you always have to deal with ‘raw’ tables plus SQL. Integrating/incorporating new knowledge into already existing databases is hard and sometimes even not possible without changing the domain modelwww.ohnosequences.com www.bio4j.com
  6. 6. Life in general and biology in particular are probably not 100% like a graph… but one thing’s sure, they are not a set of tables!www.ohnosequences.com www.bio4j.com
  7. 7. What’s Bio4j? Bio4j is a bioinformatics graph based DB including most data available in : Uniprot KB (SwissProt + Trembl) NCBI Taxonomy Gene Ontology (GO) RefSeq UniRef (50,90,100) Enzyme DBwww.ohnosequences.com www.bio4j.com
  8. 8. What’s Bio4j? It provides a completely new and powerful framework for protein related information querying and management. Since it relies on a high-performance graph engine, data is stored in a way that semantically represents its own structurewww.ohnosequences.com www.bio4j.com
  9. 9. What’s Bio4j? Bio4j uses Neo4j technology, a "high-performance graph engine with all the features of a mature and robust database". Thanks to both being based on Neo4j DB and the API provided, Bio4j is also very scalable, allowing anyone to easily incorporate his own data making the best out of it.www.ohnosequences.com www.bio4j.com
  10. 10. What’s Bio4j? Everything in Bio4j is open source ! released under AGPLv3www.ohnosequences.com www.bio4j.com
  11. 11. Bio4j in numbers The current version (0.8) includes: Relationships: 717.484.649 Nodes: 92.667.745 Relationship types: 144 Node types: 42 We’re approaching the 1 billion relationships! :)www.ohnosequences.com www.bio4j.com
  12. 12. Let’s dig a bit about Bio4j structure… Data sources and their relationships:www.ohnosequences.com www.bio4j.com
  13. 13. Bio4j domain modelwww.ohnosequences.com www.bio4j.com
  14. 14. Bio4j modules Bio4j includes different data sources but you may not always be interested in having all of them. That’s why the importing process is modular and customizable, allowing you to import just the data you are interested in.www.ohnosequences.com www.bio4j.com
  15. 15. Bio4j modules You must however keep in mind that you must be coherent when choosing the data sources you want to have included in your database; that’s to say, you cannot import for example protein interactions without having first included the proteins! ;) Here’s a schema showing the dependencies for the importing process:www.ohnosequences.com www.bio4j.com
  16. 16. The Graph DB model: representation Core abstractions: Nodes Relationships between nodes Properties on bothwww.ohnosequences.com www.bio4j.com
  17. 17. How are things modeled? Couldn’t be simpler! Entities Associations / Relationships Nodes Edgeswww.ohnosequences.com www.bio4j.com
  18. 18. Some examples of nodes would be: GO term Protein Genome Element and relationships: Protein PROTEIN_GO_ANNOTATION GO te rmwww.ohnosequences.com www.bio4j.com
  19. 19. We have developed a tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer Bio4jExplorer allows you to: • Navigate through all nodes and relationships • Access the javadocs of any node or relationship • Graphically explore the neighborhood of a node/relationship • Look up for the indexes that may serve as an entry point for a node • Check incoming/outgoing relationships of a specific node • Check start/end nodes of a specific relationshipwww.ohnosequences.com www.bio4j.com
  20. 20. Entry points and indexing There are two kinds of entry points for the graph: Auxiliary relationships going from the reference node, e.g. - CELLULAR_COMPONENT : leads to the root of GO cellular component sub-ontology - MAIN_DATASET : leads to both main datasets: Swiss-Prot and Trembl Node indexing There are two types of node indexes: - Exact: Only exact values are considered hits - Fulltext: Regular expressions can be usedwww.ohnosequences.com www.bio4j.com
  21. 21. Retrieving protein info (Bio4j Java API) //--creating manager and node retriever---- Bio4jManager manager = new Bio4jManager(“/mybio4jdb”); NodeRetriever nR= new NodeRetriever(manager); ProteinNode protein = nR.getProteinNodeByAccession(“P12345”); Getting more related info... List<InterproNode> interpros = protein.getInterpro(); OrganismNode organism = protein.getOrganism(); List<GoTermNode> goAnnotations = protein.getGOAnnotations(); List<ArticleNode> articles = protein.getArticleCitations(); for (ArticleNode article : articles) { System.out.println(article.getPubmedId()); } //Don’t forget to close the manager manager.shutDown();www.ohnosequences.com www.bio4j.com
  22. 22. Querying Bio4j with Cypher Getting a keyword by its ID START k=node:keyword_id_index(keyword_id_index = "KW-0181") return k.name, k.id Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset: START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot") MATCH d <-[r:PROTEIN_DATASET]- p, circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) - [:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p) return p.accession, p2.accession, p3.accession Check this blog post for more info and our Bio4j Cypher cheetsheetwww.ohnosequences.com www.bio4j.com
  23. 23. Mining Bio4j data Finding topological patterns in Protein-Protein Interaction networkswww.ohnosequences.com www.bio4j.com
  24. 24. A graph traversal language Get protein by its accession number and return its full name gremlin> g.idx(protein_accession_index)[[protein_accession_index:P12345]].full_name ==> Aspartate aminotransferase, mitochondrial Get proteins (accessions) associated to an interpro motif (limited to 4 results) gremlin> g.idx(interpro_id_index)[[interpro_id_index:IPR023306]].inE(PROTEIN_INTERPRO).outV. accession[0..3] ==> E2GK26 ==> G3PMS4 ==> G3Q865 ==> G3PIL8 Check our Bio4j Gremlin cheetsheetwww.ohnosequences.com www.bio4j.com
  25. 25. REST Server You can also query/navigate through Bio4j with the Neo4j REST API ! The default representation is json, both for responses and or data sent with POST/PUT requests Get protein by its accession number: (Q9UR66) http://server_url:7474/db/data/index/node/protein_accession_index/ protein_accession_index/Q9UR66 Get outgoing relationships for protein Q9UR66 http://server_url:7474/db/data/node/Q9UR66_node_id/relationships/o utwww.ohnosequences.com www.bio4j.com
  26. 26. Visualizations (1)  REST Server Data Browser Navigate through Bio4j data in real time !www.ohnosequences.com www.bio4j.com
  27. 27. Visualizations (2)  Bio4j GO Toolswww.ohnosequences.com www.bio4j.com
  28. 28. Visualizations (3)  Bio4j + Gephi Get really cool graph visualizations using Bio4j and Gephi visualization and exploration platformwww.ohnosequences.com www.bio4j.com
  29. 29. Bio4j + Cloud We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving us the following benefits: Interoperability and data distribution Releases are available as public EBS Snapshots, giving AWS users the opportunity of creating and attaching to their instances Bio4j DB 100% ready volumes in just a few seconds. CloudFormation templates: - Basic Bio4j DB Instance - Bio4j REST Server Instance Backup and Storage using S3 (Simple Storage Service) We use S3 both for backup (indirectly through the EBS snapshots) and storage (directly storing RefSeq sequences as independent S3 files)www.ohnosequences.com www.bio4j.com
  30. 30. Why would I use Bio4j ? Massive access to protein/genome/taxonomy… related information Integration of your own DBs/resources around common information Development of services tailored to your needs built around Bio4j Networks analysis Visualizations Besides many others I cannot think of myself… If you have something in mind for which Bio4j might be useful, please let us know so we can all see how it could help you meet your needs! ;)www.ohnosequences.com www.bio4j.com
  31. 31. OK, but why starting all this? Were you so bored…?! It all started somehow around our need for massive access to protein GO (Gene Ontology) annotations. At that point I had to develop my own MySQL DB based on the official GO SQL database, and problems started from the beginning: I got crazy ‘deciphering’ how to extract Uniprot protein annotations from GO official tables schema Uniprot and GO official protein annotations were not always consistent Populating my own DB took really long due to all the joins and subqueries needed in order to get and store the protein annotations. Soon enough we also had the need of having massive access to basic protein information.www.ohnosequences.com www.bio4j.com
  32. 32. These processes had to be automated for our (specifically designed for NGS data) bacterial genome annotation system BG7 (PLOS ONE 2012 in Press) Uniprot web services available were too limited: - Slow - Number of queries limitation - Too little information available So I downloaded the whole Uniprot DB in XML format (Swiss-Prot + Trembl) and started to have some fun with it !www.ohnosequences.com www.bio4j.com
  33. 33. We got used to having massive direct access to all this protein related information… So why not adding other resources we needed quite often in most projects and which now were becoming a sort of bottleneck compared to all those already included in Bio4j ? Then we incorporated: - Isoform sequences - Protein interactions and features - Uniref 50, 90, and 100 - RefSeq - NCBI Taxonomy - Enzyme Expasy DBwww.ohnosequences.com www.bio4j.com
  34. 34. Bio4j + MG7 + 48 Blast XML files (~1GB each) Some numbers: • 157 639 502 nodes • 742 615 705 relationships • 632 832 045 properties • 148 relationship types • 44 node types And it works just fine!www.ohnosequences.com www.bio4j.com
  35. 35. MG7 domain modelwww.ohnosequences.com www.bio4j.com
  36. 36. What’s MG7? MG7 provides the possibility of choosing different parameters to fix the thresholds for filtering the BLAST hits: i. E-value ii. Identity and query coverage It allows exporting the results of the analysis to different data formats like: • XML • CSV • Gexf (Graph exchange XML format) As well as provides to the user with Heat maps and graph visualizations whilst including an user-friendly interface that allows to access to the alignment responsible for each functional or taxonomical read assignation and that displays the frequencies in the taxonomical tree --> MG7Viewerwww.ohnosequences.com www.bio4j.com
  37. 37. Heat-map Vizwww.ohnosequences.com www.bio4j.com
  38. 38. Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4jwww.ohnosequences.com www.bio4j.com
  39. 39. Future directions Improvements in modules Integration of even more massive data Application to Cancer genomics Gene flux tool (New tool for bacterial comparative genomics) Pathways tool Data from Metacyc is going to be included in Bio4j. This data would allow to dissect the metabolic pathways in which a genome element, organism or community (metagenomic samples) is involved. . Data visualization, network analysis and much more…www.ohnosequences.com www.bio4j.com
  40. 40. Community Bio4j has a fast growing internet presence: - Twitter: check @bio4j for updates - Blog: go to http://blog.bio4j.com - Mail-list: ask any question you may have in our list. - LinkedIn: check the Bio4j group - Github issues: don’t be shy! open a new issue if you think something’s going wrong.www.ohnosequences.com www.bio4j.com
  41. 41. That’s it ! Thanks for your time ;)www.ohnosequences.com www.bio4j.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×