Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN

Parallelized Pipeline
for Whole Genome Shotgun Metagenomics
with GHOSTZ-GPU and MEGAN
(DAY-3) Oct 29, 2019
B4 - Bioinformatics Session 4 (Sequence)
Royal Olympic Hotel, Athens, GREECE
Masahito Ohue1 Marina Yamasawa1,2 Kazuki Izawa1 Yutaka Akiyama1
1. Department of Computer Science, School of Computing,
Tokyo Institute of Technology, JAPAN
2. Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL),
National Institute of Advanced Industrial Science and Technology (AIST), JAPAN
Paper-ID 228

Agenda
• Introduction
– Metagenome
– 16S rRNA vs. whole genome shotgun (WGS) metagenomics
– Homology search, GHOSTZ-GPU
– WGS metagenome workflow
• GHOSTMEGAN Pipeline
• Computational Experiments
• Results and Discussion
• Conclusion
1

Metagenome Analysis
• Directly sequencing uncultured microbiomes
obtained from target environment and analyzing the
sequence data
– Finding novel genes from unculturable microorganism
– Elucidating composition of species/genes of environments
Human
body
SeaGut
Examples of microbiome
Soil
Oral
3

Home Microbiome Study Hospital Microbiome Project
Earth Microbiome Project Marine Phage Sequencing Project
National Metagenomic Project
4

16S rRNA Metagenomics vs. WGS Metagenomics
5
Analyzes DNA from amplicon
sequencing of prokaryotic 16S small
subunit ribosomal RNA genes.
16S rRNA Sequencing
✓Provides visuals of taxonomic
classification
✓Low cost
× Cannot search for functional
genes
Analyzes the untargeted ('shotgun')
sequencing of all ('meta-') microbial
genomes present in a sample.
Whole Genome Shotgun
(WGS) Sequencing
✓Provides visuals of taxonomic
classification and functional
genes
× More costly

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 P1 P2 P3 P4 P5 P6
0%
20%
40%
60%
80%
100%
(16S) Taxonomic Composition (Periodontal diseases)
(Izawa K, et al. unpublished work)
6

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 P1 P2 P3 P4 P5 P6
0%
20%
40%
60%
80%
100%
(WGS) Functional Gene Category Composition
(Izawa K, et al. unpublished work)
7

16S Analysis Workflow
8Baichoo S, et al. BMC Bioinformatics, 19(1):457, 2018.
(example) USEARCH mapping + Qiime summarization

(example) homology search + summarization
WGS Metagenome Analysis Flow
9
Smith-Waterman？
BLAST？
Toooo Slow!!
Database
Escherichia coli
Daphnia pulex
ATGCGAAATCGCTA…
CGGCTCAGCGATCG…
AATCG
GCACA
Query
×

Rough Comparison of Homology Search Tools
10
BLAST
Altschul
1990
BLAT
Kent
2002
RAPSearch
ver. 2.12
Ye 2011
Zhao 2012
DIAMOND
ver. 0.7.9
Buchfink
2015
GHOSTZ
Suzuki
2015
GHOSTZ-GPU
Suzuki
2016
Sensitivity ✓
best
× ✓
△ (fast)
△
× (fast) ✓ ✓
Speed ratio (1) 50 100
1,600 (fast)
1,000
3,000 (fast)
400
1,500 (1 GPU)
2,000 (2 GPUs)
2,500 (3 GPUs)
GPU △ × × × × ✓

GHOSTZ Algorithm
BLAST GHOSTZ
Database
Query sequences
K-mer
(neighborhood words)
Gapless
extension
Gapped
extension
finite
automaton
Seed
search
Results
Search K-mer substring
match by using finite
automaton
Database
Query sequences
Hash table
Gapless
extension
Gapped
extension
Results
Subsequence
clustering
Seed
search
Hash table
11
Suzuki S, Kakuta M, Ishida T, Akiyama Y. Faster sequence homology searches by
clustering subsequences. Bioinformatics, 31(8), 1183–1190, 2015.
Distance calculation
using cluster
representatives

Suzuki S, Kakuta M, Ishida T, Akiyama Y. GPU-Acceleration of Sequence Homology
Searches with Database Subsequence Clustering. PLoS ONE 11(8): e0157338, 2016.
ERR315856
(Marine
Microbiome
Tara Oceans)
against
KEGG GENES DB
RAPSearch
GHOSTZ/GHOSTZ-GPU Sensitivity
Homology search accuracy (sensitivity)
Marine sample
12

GHOSTZ/GHOSTZ-GPU Calculation Speed
13
0
2,000
4,000
6,000
8,000
10,000
12,000
computation time (sec.)
41,236
2,644
9,970
2,794
1,885 1,502
3,717
1,034
SRR407548 (Soil) +
SRS011098 (Oral) +
ERR315856(Marine)
against KEGG GENES DB
1,000,000 randomly
selected DNA reads
from each datasets.
CPU: 12 CPU threads
Xeon5670, 2.93GHz
GPU: Tesla K20X
(sec)
Suzuki S, Kakuta M, Ishida T, Akiyama Y. GPU-Acceleration of Sequence Homology
Searches with Database Subsequence Clustering. PLoS ONE 11(8): e0157338, 2016.

Summarization Tool
14
Calculate the relative ratio of OTU and gene function using the output
of BLAST and GHOSTZ-GPU
* MEGAN itself is a pipeline tool (based on DIAMOND)
Huson DH, et al. PLoS Comput Biol. (2016)

WGS Metagenomics Pipeline
• MetaWRAP
– Does not handle homology searches
• Preferably performs metagenome assembly
– Does not support multi-node parallelization
• MEGAN
– Uses DIAMOND
– DIAMOND does not support GPU acceleration
– Thus a high-speed analysis is not possible
• MiGAP
– Uses BLAST
– Thus also cannot perform highspeed analysis
15
Uritskiy GV, et al. Microbiome, 6(1), 158, 2018.
Huson DH, et al. PLoS Comput Biol, 12: e1004957, 2016.
Sugawara H, et al. Genome Inform, 2009.

WGS Metagenome Analysis Flow
16
2,500 days (on normal laptop PC)BLAST
Output 20-billion reads per 2-days
(Illumina NovaSeq 6000)
6-hrs (on 28 cores & 4 GPUs workstation)
18-hrs (on 28 cores workstation)
(Database: KEGG GENES DB, 1.3-million seqs)
e.g. analysis of 100-million reads (150 bp)
Further speedup is needed!
▶ multi-node parallel computing

Purpose of This Study
• Developing new WGS metagenome analysis system,
GHOSTMEGAN
– Pipeline the sequence homology search and post-process
– Perform distributed computation on parallel computers
• Linking GHOSTZ-GPU and MEGAN
– GHOSTZ-GPU is the fastest sequence homology search
tool that supports multi-GPU computation
• Performance evaluation
– Evaluate using an actual WGS metagenome dataset
by parallel execution on a multi-node GPU cluster
17

Overview
• Simple workflow
• Focused on the cluster machine
(multi-GPUs x multi-nodes supercomputer)
19

GHOSTMEGAN Pipeline on Cluster System
20
Query
(fasta file)
Divide fasta
fasta.1 fasta.2 fasta.n
GHOSTZ-
GPU
GHOSTZ-
GPU
GHOSTZ-
GPU
tsv.1 tsv.2
MEGAN MEGAN MEGAN
tsv.n
rma.1 rma.2 rma.n
Concat rma
Results
(rma file)
…
…
…
…
…
singlenode
(A) Dividing query
(B) Sequence homology
search by GHOSTZ-GPU
(C) Analyzing by MEGAN
(D) Integrating results

21
Query
(fasta file)
Divide fasta
fasta.1 fasta.2 fasta.n…
(A) Dividing Query
• Input file (query) for WGS metagenome analysis is a huge single fasta file
• The query file is divided to n files for n compute nodes
• The processing time for dividing queries is extremely small compared
with the other steps
n nodes

(B) Sequence Homology Search by GHOSTZ-GPU
22
fasta.1 fasta.2 fasta.n…
n nodes
GHOSTZ-GPU GHOSTZ-GPU GHOSTZ-GPU
• GHOSTZ-GPU is executed for individual divided query files on a node
– Thread parallel computation using all CPU/GPU resources
• Genome DB is stored in the local storage in all nodes
• The output of GHOSTZ-GPU is a tab-delimited BLAST format file
– E-value < 10-5 results are provided to the next step
tsv.1 tsv.2 tsv.n…

(C) Analyzing by MEGAN
23
• MEGAN blast2rma command is performed (only using CPUs)
• The computation is performed independently for each read sequence
search result in the rma file, which will not be affected by dividing of
queries
n nodes
tsv.1 tsv.2 tsv.n…
rma.1 rma.2 rma.n…
MEGAN
blast2rma
MEGAN
blast2rma
MEGAN
blast2rma

(D) Integrating Results
24
• After all MEGAN blast2rma process, MEGAN compute-comparison
command is run
– integrates multiple analysis results into a single file
• Then MEGAN extract-biome is used to summarize the whole results
compute-comparison
extract-biome
MEGAN
MEGAN
Results
(rma file)
rma.1 rma.2 rma.n…

GHOSTMEGAN Pipeline on Cluster System
25
To ensure usability, only one parameter file needs to be edited
GHOSTMEGAN pipeline

Hardware Specification
27
TSUBAME 3.0 compute node specification (f_node)
CPU Intel Xeon E5-2680 v4 (2.4 GHz 14 cores) × 2
GPU NVIDIA Tesla P100 NVLink (16 GB) × 4
RAM 256 GiB
Local storage Intel SSD DC P3500 (2 TB)
Network Intel Omni-Path 100 Gb/s × 4
Job scheduler Univa Grid Engine 8.5.4C104 11
• TSUBAME 3.0
– 25th-ranked supercomputer
(Top500, 8.1 Petaflops, Jun 2019)
– 15,120 CPU cores
– 2,160 NVIDIA P100 GPUs
We performed GHOSTMEGAN with n nodes running in parallel using n of 1, 2, 4, 8, 16,
32, 64, and 128 as the query division number, respectively, and compared the execution
times and speedup rates.

Software and Dataset
28
Homology search: GHOSTZ-GPU ver. 1.1.0
Post process: MEGAN ver. 6.12.6
$ blast2rma --in [GHOSTZ output] –out [MEGAN rma file]
--format BlastTab
$ ghostz-gpu aln -d [DB] -b 1 -q d -a 1 –g 3 –I [query]
• Query sequences: human oral WGS metagenome reads
– Duran-Pinedo AE, et al. ISME J, 8(8), 1659–1672, 2014.
– The query used a random sample of 1,000,000 reads (100 bp)
from periodontally healthy individual samples (145 MB)
• Database: NCBI nr
– 166,109,435 seqs (101 GB)
– ftp://ftp.ncbi.nih.gov/blast/db/ (accessed August 18, 2018)
Dataset:
https://github.com/akiyamalab/ghostz-gpu
http://megan.informatik.uni-tuebingen.de

(1) Overall Pipeline Execution Time
30
15 hours
20 min24 min
33 min
• The maximum acceleration was ~45-times (on 128 nodes)
• GHOSTZ-GPU was too fast, and the calculation time was saturated

(2) Parallel Efficiency (Scalability)
31
strong scaling = (speedup by n nodes against 1 node) / n
strong scaling = 0.87
0.60 0.35
0.93
0.98
• Linear speed improvement was obtained between 1 to 32 nodes,
strong scaling = 0.87

Summary of the Results
• MEGAN scaling was good
• GHOSTZ-GPU scaling decreased at n > 32
– The query data was small
– Expect high efficiency for larger queries
• This time it was difficult because n = 1 had to be measured
• Strong scaling against n = 8 can be measured for the larger
query, for example
• MEGAN without GPU-implementation has room for
acceleration
– In order to cope with the increase queries, it is also
necessary to speed up by the GPUs other than the
homology search
32

Homology Search Results
33
compute
on 2 nodes
others
XP 025968818.1 LOW QUALITY
PROTEIN: tigger
transposable element-
derived protein 1-like
[Dromaius novaehollandiae]
XP 019376199.1 PREDICTED:
tigger transposable
element-derived protein 1-
like, partial
[Gavialis gangeticus]
compute
on 1 node
others
BAD18412.1 unnamed protein
product
[Homo sapiens]
EHH57573.1 hypothetical
protein EGM 07242, partial
[Macaca fascicularis]
read a
read b
✓ “tigger transposable
element-derived protein 1-
like” gene is widely
conserved
✓ The result did not affect
the WGS metagenome
analysis
✓ Both were annotated as
function-unknown genes
✓ The result also did not
affect the metagenome
analysis at this time
We found only two reads with different homology search results out of 1-million
reads in the parallel computing of GHOSTMEGAN for the dataset

Conclusion
• GHOSTMEGAN pipeline was developed and evaluated
to achieve large-scale metagenomic analysis
– Homology search and other process were parallelized
– Executed on the TSUBAME 3.0 supercomputer with multiple GPUs
• GHOSTMEGAN achieved parallel computing on multiple
compute nodes
– Obtained linear speedup to 32 nodes
– 45-times faster calculation on 128 nodes
• GPU-accelerated MEGAN or other tools will be crucial
– GHOSTZ-GPU was significantly accelerated on multiple GPUs
– To prepare for further increases in data size in the future
35

Acknowledgments
36
Funding
Akiyama Lab. Tokyo Tech, JAPAN

Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN

Ähnlich wie Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN (20)

Mehr von Masahito Ohue

Mehr von Masahito Ohue (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN