Masahito Ohue, Marina Yamasawa, Kazuki Izawa, Yutaka Akiyama: Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN,
In Proceedings of the 19th annual IEEE International Conference on Bioinformatics and Bioengineering (IEEE BIBE 2019), 152-156, 2019. doi: 10.1109/BIBE.2019.00035
Call Girls Hsr Layout Just Call đ 7737669865 đ Top Class Call Girl Service Ba...
Â
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN
1. Parallelized Pipeline
for Whole Genome Shotgun Metagenomics
with GHOSTZ-GPU and MEGAN
(DAY-3) Oct 29, 2019
B4 - Bioinformatics Session 4 (Sequence)
Royal Olympic Hotel, Athens, GREECE
Masahito Ohue1 Marina Yamasawa1,2 Kazuki Izawa1 Yutaka Akiyama1
1. Department of Computer Science, School of Computing,
Tokyo Institute of Technology, JAPAN
2. Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL),
National Institute of Advanced Industrial Science and Technology (AIST), JAPAN
Paper-ID 228
4. Metagenome Analysis
⢠Directly sequencing uncultured microbiomes
obtained from target environment and analyzing the
sequence data
â Finding novel genes from unculturable microorganism
â Elucidating composition of species/genes of environments
Human
body
SeaGut
Examples of microbiome
Soil
Oral
3
5. Home Microbiome Study Hospital Microbiome Project
Earth Microbiome Project Marine Phage Sequencing Project
National Metagenomic Project
4
6. 16S rRNA Metagenomics vs. WGS Metagenomics
5
Analyzes DNA from amplicon
sequencing of prokaryotic 16S small
subunit ribosomal RNA genes.
16S rRNA Sequencing
âProvides visuals of taxonomic
classification
âLow cost
Ă Cannot search for functional
genes
Analyzes the untargeted ('shotgun')
sequencing of all ('meta-') microbial
genomes present in a sample.
Whole Genome Shotgun
(WGS) Sequencing
âProvides visuals of taxonomic
classification and functional
genes
Ă More costly
12. GHOSTZ Algorithm
BLAST GHOSTZ
Database
Query sequences
K-mer
(neighborhood words)
Gapless
extension
Gapped
extension
finite
automaton
Seed
search
Results
Search K-mer substring
match by using finite
automaton
Database
Query sequences
Hash table
Gapless
extension
Gapped
extension
Results
Subsequence
clustering
Seed
search
Hash table
11
Suzuki S, Kakuta M, Ishida T, Akiyama Y. Faster sequence homology searches by
clustering subsequences. Bioinformatics, 31(8), 1183â1190, 2015.
Distance calculation
using cluster
representatives
13. Suzuki S, Kakuta M, Ishida T, Akiyama Y. GPU-Acceleration of Sequence Homology
Searches with Database Subsequence Clustering. PLoS ONE 11(8): e0157338, 2016.
ERR315856
(Marine
Microbiome
Tara Oceans)
against
KEGG GENES DB
RAPSearch
GHOSTZ/GHOSTZ-GPU Sensitivity
Homology search accuracy (sensitivity)
Marine sample
12
14. GHOSTZ/GHOSTZ-GPU Calculation Speed
13
0
2,000
4,000
6,000
8,000
10,000
12,000
computation time (sec.)
41,236
2,644
9,970
2,794
1,885 1,502
3,717
1,034
SRR407548 (Soil) +
SRS011098 (Oral) +
ERR315856(Marine)
against KEGG GENES DB
1,000,000 randomly
selected DNA reads
from each datasets.
CPU: 12 CPU threads
Xeon5670, 2.93GHz
GPU: Tesla K20X
(sec)
Suzuki S, Kakuta M, Ishida T, Akiyama Y. GPU-Acceleration of Sequence Homology
Searches with Database Subsequence Clustering. PLoS ONE 11(8): e0157338, 2016.
15. Summarization Tool
14
Calculate the relative ratio of OTU and gene function using the output
of BLAST and GHOSTZ-GPU
* MEGAN itself is a pipeline tool (based on DIAMOND)
Huson DH, et al. PLoS Comput Biol. (2016)
16. WGS Metagenomics Pipeline
⢠MetaWRAP
â Does not handle homology searches
⢠Preferably performs metagenome assembly
â Does not support multi-node parallelization
⢠MEGAN
â Uses DIAMOND
â DIAMOND does not support GPU acceleration
â Thus a high-speed analysis is not possible
⢠MiGAP
â Uses BLAST
â Thus also cannot perform highspeed analysis
15
Uritskiy GV, et al. Microbiome, 6(1), 158, 2018.
Huson DH, et al. PLoS Comput Biol, 12: e1004957, 2016.
Sugawara H, et al. Genome Inform, 2009.
17. WGS Metagenome Analysis Flow
16
2,500 days (on normal laptop PC)BLAST
Output 20-billion reads per 2-days
(Illumina NovaSeq 6000)
6-hrs (on 28 cores & 4 GPUs workstation)
18-hrs (on 28 cores workstation)
(Database: KEGG GENES DB, 1.3-million seqs)
e.g. analysis of 100-million reads (150 bp)
Further speedup is needed!
âś multi-node parallel computing
18. Purpose of This Study
⢠Developing new WGS metagenome analysis system,
GHOSTMEGAN
â Pipeline the sequence homology search and post-process
â Perform distributed computation on parallel computers
⢠Linking GHOSTZ-GPU and MEGAN
â GHOSTZ-GPU is the fastest sequence homology search
tool that supports multi-GPU computation
⢠Performance evaluation
â Evaluate using an actual WGS metagenome dataset
by parallel execution on a multi-node GPU cluster
17
22. 21
Query
(fasta file)
Divide fasta
fasta.1 fasta.2 fasta.nâŚ
(A) Dividing Query
⢠Input file (query) for WGS metagenome analysis is a huge single fasta file
⢠The query file is divided to n files for n compute nodes
⢠The processing time for dividing queries is extremely small compared
with the other steps
n nodes
23. (B) Sequence Homology Search by GHOSTZ-GPU
22
fasta.1 fasta.2 fasta.nâŚ
n nodes
GHOSTZ-GPU GHOSTZ-GPU GHOSTZ-GPU
⢠GHOSTZ-GPU is executed for individual divided query files on a node
â Thread parallel computation using all CPU/GPU resources
⢠Genome DB is stored in the local storage in all nodes
⢠The output of GHOSTZ-GPU is a tab-delimited BLAST format file
â E-value < 10-5 results are provided to the next step
tsv.1 tsv.2 tsv.nâŚ
24. (C) Analyzing by MEGAN
23
⢠MEGAN blast2rma command is performed (only using CPUs)
⢠The computation is performed independently for each read sequence
search result in the rma file, which will not be affected by dividing of
queries
n nodes
tsv.1 tsv.2 tsv.nâŚ
rma.1 rma.2 rma.nâŚ
MEGAN
blast2rma
MEGAN
blast2rma
MEGAN
blast2rma
25. (D) Integrating Results
24
⢠After all MEGAN blast2rma process, MEGAN compute-comparison
command is run
â integrates multiple analysis results into a single file
⢠Then MEGAN extract-biome is used to summarize the whole results
compute-comparison
extract-biome
MEGAN
MEGAN
Results
(rma file)
rma.1 rma.2 rma.nâŚ
26. GHOSTMEGAN Pipeline on Cluster System
25
To ensure usability, only one parameter file needs to be edited
GHOSTMEGAN pipeline
31. (1) Overall Pipeline Execution Time
30
15 hours
20 min24 min
33 min
⢠The maximum acceleration was ~45-times (on 128 nodes)
⢠GHOSTZ-GPU was too fast, and the calculation time was saturated
32. (2) Parallel Efficiency (Scalability)
31
strong scaling = (speedup by n nodes against 1 node) / n
strong scaling = 0.87
0.60 0.35
0.93
0.98
⢠Linear speed improvement was obtained between 1 to 32 nodes,
strong scaling = 0.87
33. Summary of the Results
⢠MEGAN scaling was good
⢠GHOSTZ-GPU scaling decreased at n > 32
â The query data was small
â Expect high efficiency for larger queries
⢠This time it was difficult because n = 1 had to be measured
⢠Strong scaling against n = 8 can be measured for the larger
query, for example
⢠MEGAN without GPU-implementation has room for
acceleration
â In order to cope with the increase queries, it is also
necessary to speed up by the GPUs other than the
homology search
32
34. Homology Search Results
33
compute
on 2 nodes
others
XP 025968818.1 LOW QUALITY
PROTEIN: tigger
transposable element-
derived protein 1-like
[Dromaius novaehollandiae]
XP 019376199.1 PREDICTED:
tigger transposable
element-derived protein 1-
like, partial
[Gavialis gangeticus]
compute
on 1 node
others
BAD18412.1 unnamed protein
product
[Homo sapiens]
EHH57573.1 hypothetical
protein EGM 07242, partial
[Macaca fascicularis]
read a
read b
â âtigger transposable
element-derived protein 1-
likeâ gene is widely
conserved
â The result did not affect
the WGS metagenome
analysis
â Both were annotated as
function-unknown genes
â The result also did not
affect the metagenome
analysis at this time
We found only two reads with different homology search results out of 1-million
reads in the parallel computing of GHOSTMEGAN for the dataset
36. Conclusion
⢠GHOSTMEGAN pipeline was developed and evaluated
to achieve large-scale metagenomic analysis
â Homology search and other process were parallelized
â Executed on the TSUBAME 3.0 supercomputer with multiple GPUs
⢠GHOSTMEGAN achieved parallel computing on multiple
compute nodes
â Obtained linear speedup to 32 nodes
â 45-times faster calculation on 128 nodes
⢠GPU-accelerated MEGAN or other tools will be crucial
â GHOSTZ-GPU was significantly accelerated on multiple GPUs
â To prepare for further increases in data size in the future
35