Traditional microbial genome sequencing relies upon clonal cultures, but the new era of genomics is facing a new challenge: the metagenomics analysis. In the next few years it is probable that metagenomics will be used in clinical diagnostic settings. Thus, metagenomics has the potential to revolutionize pathogen detection in public health laboratories by allowing the simultaneous detection of all microorganisms in a clinical sample. For viruses, unbiased high-throughput sequencing approach is useful for directly detecting pathogenic viruses without advance genetic information. The use of metagenomics for virus discovery in clinical samples has opened new opportunities for understanding the aetiology of unexplained illness. For bacteria, it should be reminded that only a small fraction of the phylogenetic diversity of Bacteria and Archaea is represented by cultivated organisms. Hence, metagenomics will probably serve to identify new pathogens, and new infections caused by consortiums. In chronic infections metagenomics will give us information about the relevance of biofilms and other bacterial organizations that would be important in such infections. As an example, metagenomics for Mycobacterium infections have demonstrated undetected, plural, strains in the same patient. Microbiome analysis has been one of the most important applications of metagenomics.
Two major strategies have been applied in the past years for bacterial metagenomics: 16S and shotgun metagenomics. 16S metagenomics tells us about microbial diversity and relative abundance of species and taxa. Shotgun metagenomics is a much more massive approach able to inform about the functional profile of the different genes present in the sample and even to obtain assembled genomes if the sample is not very complex.
Metagenomics has brought new challenges to bioinformatics. Cloud computing can solve the problem of massive data analysis providing scalable, real time, on demand computing for metagenomics data analysis. However, Cloud Computing infrastructure is not easy to manage and publicly available software solutions would be needed to extend the use of cloud for the analysis of huge metagenomics data sets.
MG7 is a new system for analysis of reads from metagenomics based on the use of cloud computing for the parallel computation of the BLAST similarity in which is based the inference of function and the assignment of taxonomic origin. A special peculiarity of MG7 system is the utilization of a non relational model database. MG7 uses a graph database to store the results of the analysis and to facilitate the querying and the access to the data organized in the hierarchic structure of the taxonomy tree. MG7 is an open source project that is licensed under AGPLV3 license.
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics
1. A New Era in Diagnostic Microbiology Pathogen Genomics. Whole Genome Sequencing
15 January 2014. The Royal College of Pathologists.
A New Cloud Computing System for Massive
Analysis of Reads from Metagenomics Samples
http://ohnosequences.com
www.era7bioinformatics.com
2. The Royal College of Pathologists
15 January 2014
A New Cloud Computing System for Massive Analysis of Reads from
Metagenomics Samples
- A bit of context:
- The metagenomics bioinformatics challenge:
• What is Era7
• High computational cost
• What is Oh no sequences! Research group
• Bining for reducing computation
• Research lines / Research projects
• Reducing reference database
- Clonal cultures versus Metagenomics
- Microbiome
- MG7
- Microbiome in health and disease
• Cloud computing
- Metagenomics in a clinical sample
• MG7 algorithms and pipeline
- 16S and shotgun metagenomics
• Lowest Common Ancestor assignment
- Metagenomics for detection of viruses
• MG7 uses Graph databases
- Metagenomics for detection of bacteria
• MG7 uses NCBI taxonomy tree
MG7 for metagenomics analysis
3. The Royal College of Pathologists
15 January 2014
A bit of context
http://ohnosequences.com
www.era7bioinformatics.com
4. The Royal College of Pathologists
15 January 2014
What is Era7 Bioinformatics
http://ohnosequences.com
www.era7bioinformatics.com
5. The Royal College of Pathologists
15 January 2014
•
•
•
•
Research driven SME
Open Source
Cloud Computing
Next Generation Sequencing
http://ohnosequences.com
www.era7bioinformatics.com
6. The Royal College of Pathologists
15 January 2014
•
•
•
•
•
•
•
Bacterial Genomics projects
Comparative Genomics
Metagenomics
Microbiome
RNA-seq (and Dual RNA-seq)
Cancer Genomics
Big Data management and integration
http://ohnosequences.com
www.era7bioinformatics.com
7. The Royal College of Pathologists
15 January 2014
What is Era7 Oh no sequences!
http://ohnosequences.com
www.era7bioinformatics.com
8. The Royal College of Pathologists
15 January 2014
A New Cloud Computing System for Massive Analysis of Reads from
Metagenomics Samples
Research Lines:
Software Research Ptojects
• BG7
• Algorithms for assembly
• Bio4j
• Methods for bacterial genome annotation
• Nextmicro
• New Cloud Computing Architectures
• Statika
• Graph Databases for Biological data
• Nispero
• Comparative genomics and bacterial evolution
• Genome Plasticity
• Big Data integration and visualization
• Host Immune System and infection
• MG7
(All of them are Open Source
AGPLv3 projects)
MG7 for metagenomics analysis
9. The Royal College of Pathologists
15 January 2014
Traditional microbial genome sequencing
relies upon clonal cultures,
but the new era of genomics is facing a new
challenge: the metagenomics analysis
http://ohnosequences.com
www.era7bioinformatics.com
10. The Royal College of Pathologists
15 January 2014
Microbiome analysis is possible by
metagenomics approaches.
•
•
•
•
Health and Disease
Therapeutic Interventions
Transplant
Immune system
http://ohnosequences.com
www.era7bioinformatics.com
11. The Royal College of Pathologists
15 January 2014
Microbiome in Health and Disease
•
•
•
•
•
Inflamatory Bowel Disease
Diabetes
Obesity
Cardiovascular Disease
Colon Cancer
http://ohnosequences.com
www.era7bioinformatics.com
12. The Royal College of Pathologists
15 January 2014
Modifying the Microbiome
• Prebiotics
• Probiotics
• Microbiome Transplant (Clostridium Difficile)
http://ohnosequences.com
www.era7bioinformatics.com
13. The Royal College of Pathologists
15 January 2014
For bacteria, it should be reminded that only
a small fraction of the phylogenetic diversity
of Bacteria and Archaea is represented by
cultivated organisms
http://ohnosequences.com
www.era7bioinformatics.com
14. The Royal College of Pathologists
15 January 2014
Metagenomics has the potential to
revolutionize pathogen detection in public
health laboratories by allowing the
simultaneous detection of all microorganisms
in a clinical sample
http://ohnosequences.com
www.era7bioinformatics.com
15. The Royal College of Pathologists
15 January 2014
Metagenomic analysis after PCR amplification
of different gene regions
Shotgun Metagenomics
http://ohnosequences.com
www.era7bioinformatics.com
16. The Royal College of Pathologists
15 January 2014
Metagenomic analysis after PCR amplification
of different gene regions:
• 16S rRNA
•
•
•
•
•
Gyrase
Ribosomal proteins
Elongation Fctors
RNA Polymerase
……….
16S metagenomics tells us about microbial diversity and
relative abundance of species and taxa
http://ohnosequences.com
www.era7bioinformatics.com
17. The Royal College of Pathologists
15 January 2014
Shotgun Metagenomics
Shotgun metagenomics is a much more massive approach
able to inform about the functional profile of the different genes
present in the sample and even to obtain assembled genomes
if the sample is not very complex
http://ohnosequences.com
www.era7bioinformatics.com
18. The Royal College of Pathologists
15 January 2014
Thechnology
•
454 in the past
•
illumina today (approaches overlaping paired reads)
•
Preprocessing steps very important
http://ohnosequences.com
www.era7bioinformatics.com
19. The Royal College of Pathologists
15 January 2014
For viruses:
Unbiased high-throughput sequencing approach is useful for
directly detecting pathogenic viruses without advance genetic
information.
The use of metagenomics for virus discovery in clinical
samples has opened new opportunities for understanding the
aetiology of unexplained illness
http://ohnosequences.com
www.era7bioinformatics.com
20. The Royal College of Pathologists
15 January 2014
For Bacteria:
Metagenomics will probably serve to identify new pathogens,
and new infections caused by consortiums.
In chronic infections metagenomics will give us information
about the relevance of biofilms and other bacterial
organizations that would be important in such infections..
Microbiome analysis has been one of the most important
applications of metagenomics.
http://ohnosequences.com
www.era7bioinformatics.com
21. The Royal College of Pathologists
15 January 2014
For Bacteria:
As an example, metagenomics for Mycobacterium infections
have demonstrated undetected, plural, strains in the same
patient
http://ohnosequences.com
www.era7bioinformatics.com
22. The Royal College of Pathologists
15 January 2014
The Bioinformatics challenge
Metagenomics has a high computational cost
1. One approach is to reduce the need of computation
2. The other is to be more efficient
http://ohnosequences.com
www.era7bioinformatics.com
23. The Royal College of Pathologists
15 January 2014
The Bioinformatics challenge
Metagenomics has a high computational cost
1. Reducing the computation
•
Binning (clustering) the reads 16S and Shotgun.
Operational Taxonomic Units (OTUs) in 16S
http://ohnosequences.com
www.era7bioinformatics.com
24. The Royal College of Pathologists
15 January 2014
The Bioinformatics challenge
Metagenomics has a high computational cost
1. Reducing the computation
•
Reducing the size of the reference database: It
is frequent to use only the complete bacterial
genomes Shotgun
http://ohnosequences.com
www.era7bioinformatics.com
25. The Royal College of Pathologists
15 January 2014
The Bioinformatics challenge
Metagenomics has a high computational cost
2. The other is to be more efficient:
http://ohnosequences.com
MG7
www.era7bioinformatics.com
26. The Royal College of Pathologists
15 January 2014
The Bioinformatics challenge
Cloud computing can solve the problem of massive data
analysis providing scalable, real time, on demand computing
for metagenomics data analysis.
However, Cloud Computing infrastructure is not easy to
manage and publicly available software solutions would be
needed to extend the use of cloud for the analysis of huge
metagenomics data sets.
http://ohnosequences.com
www.era7bioinformatics.com
27. The Royal College of Pathologists
15 January 2014
MG7
•
•
•
•
•
Based in Cloud Computing (AWS)
Parallel computation
Each read is compared with the complete database:
• No binning, all the reads
• All the known sequences (nt database) for shotgun
NCBI taxonomy
Graph database for analyzing the assignment results
http://ohnosequences.com
www.era7bioinformatics.com
28. The Royal College of Pathologists
15 January 2014
MG7
Based in Cloud Computing (AWS)
•
•
•
•
•
EC2
S3
SQS
SNS
……
http://ohnosequences.com
www.era7bioinformatics.com
29. The Royal College of Pathologists
15 January 2014
MG7
Based in Cloud Computing (AWS) parallel computation
•
•
•
A Cloud Master machine creates tasks and set Qeues
A set (hundreds, it could be thousands) of Cloud
instances (usually micro cloud EC2 instances) are
launched
After the parallel computation, results are modeled in a
graph database. This allows to further analysis
http://ohnosequences.com
www.era7bioinformatics.com
30. The Royal College of Pathologists
15 January 2014
https://github.com/pablopareja/MG7/wiki
http://ohnosequences.com
www.era7bioinformatics.com
31. The Royal College of Pathologists
15 January 2014
https://github.com/pablopareja/MG7/wiki
http://ohnosequences.com
Data Model for the Graph DatabaseNeo4j
www.era7bioinformatics.com
32. The Royal College of Pathologists
15 January 2014
MG7
Based in Cloud Computing (AWS)
•
Storage , another challenge. AWS Cloud is very useful:
•
S3 for inmediate access
•
Glacier for archiving .
http://ohnosequences.com
www.era7bioinformatics.com
33. The Royal College of Pathologists
15 January 2014
MG7
Each read is compared with the complete database:
•
Direct Assignment Best Blast Hit It can be done by:
• E value
• Depending on similarity % and length of the hit
•
Lowest Common Ancestor
http://ohnosequences.com
www.era7bioinformatics.com
34. The Royal College of Pathologists
15 January 2014
MG7
Lowest Common Ancestor
First step:
We start from a set of nodes with an
arbitrary length – 4 in this sample,
which are spread through the
taxonomy tree
http://ohnosequences.com
www.era7bioinformatics.com
35. The Royal College of Pathologists
15 January 2014
MG7
Lowest Common Ancestor
Second step:
We fetch then the first node from the
set and calculate its whole ancestor list
to the main root of the taxonomy.
http://ohnosequences.com
www.era7bioinformatics.com
36. The Royal College of Pathologists
15 January 2014
MG7
Lowest Common Ancestor
Third step:
Now that we have the list, we take the
second node of the set and check if it’s
contained in it, if not, we keep going up
through its ancestors until we find a
marked node. Once it has been found,
we get rid of the previous elements in the
list (if any) so that they are not taken into
account for the next iterations in the
algorithm.
http://ohnosequences.com
www.era7bioinformatics.com
37. The Royal College of Pathologists
15 January 2014
MG7
Lowest Common Ancestor
Fourth step:
We keep going trough our node set,
and node C also removes some
elements of the list…
http://ohnosequences.com
www.era7bioinformatics.com
38. The Royal College of Pathologists
15 January 2014
MG7
Lowest Common Ancestor
Fifth step:
Finally we reach the last node of our
set, but no element is removed from
our list as a result.
http://ohnosequences.com
www.era7bioinformatics.com
39. The Royal College of Pathologists
15 January 2014
MG7
Lowest Common Ancestor
Here we have our lowest common
ancestor!
http://ohnosequences.com
www.era7bioinformatics.com
40. The Royal College of Pathologists
15 January 2014
MG7
All the known sequences (nt database) for shotgun
Nt database is the largest nucleotide database.
It contains nucleotide sequences from all the organisms.
This is important to detect:
•
•
Unexpected organism
Contamination
http://ohnosequences.com
www.era7bioinformatics.com
41. The Royal College of Pathologists
15 January 2014
MG7
NCBI taxonomy
This Taxonomy is probably the best and most
comprehensive
A Graph Database is very appropriate to model a
Taxonomy tree
http://ohnosequences.com
www.era7bioinformatics.com
42. Thanks
for your attention!
Marina Manrique
Eduardo Pareja-Tobes
Pablo Pareja-Tobes
Raquel Tobes
Eduardo Pareja
epareja@era7.com
http://ohnosequences.com
www.era7bioinformatics.com