2024: Domino Containers - The Next Step. News from the Domino Container commu...
Crunching Huge Phylogenies. A. Stamatakis
1. Crunching Huge Phylogenies:
A Rapid Bootstrap Algorithm and
Massive Parallelism on the IBM
BlueGene
Alexandros Stamatakis
Swiss Federal Institute of Technology Lausanne (EPFL)
School of Computer & Communication Sciences
Laboratory for Computational Biology and Bioinformatics
Lausanne, Switzerland
&
Swiss Institute of Bioinformatics
Alexandros.Stamatakis@epfl.ch
icwww.epfl.ch/~stamatak
2. The Missing Part
Data Assembly Inference ? Tree Analysis
Alexandros Stamatakis, October 2007
6. The Big Hardware Problem
CPU Speed 40% p.a.
Memory Speed 9% p.a.
2007
1980
Alexandros Stamatakis, October 2007
7. ... and why this concerns
Bioinformatics
Sequence
CPU Speed 40% p.a. Data
Memory Speed 9% p.a.
2007
1980
Alexandros Stamatakis, October 2007
8. ... and why this concerns
Bioinformatics
Application of HPC
techniques will become Sequence
much moreSpeed 40% p.a.
CPU important Data
Memory Speed 9% p.a.
2007
1980
Alexandros Stamatakis, October 2007
10. Outline
Introduction
●
Computation of Phylogenies
●
Maximum Likelihood
●
Web & Grid Services
●
Three Steps Towards the Tree of Life
●
Parallelism on IBM BlueGene/L
●
Rapid Bootstrapping
●
A Bootstopping criterion
●
Related Projects
●
Outlook
●
Alexandros Stamatakis, October 2007
11. Phylogenetics
Input: “good” multiple Alignment
Output: unrooted binary tree
Various methods for phylogenetic
inference
Neighbour Joining (fast & simple)
Maximum Parsimony (relatively fast &
simple)
Maximum Likelihood (complex & slow)
Bayesian Methods (complex & slower)
Alexandros Stamatakis, October 2007
12. Phylogenetics
Input: “good” multiple Alignment
Output: unrooted binary tree
ML & Bayesian: explicit
Various methods choice
model for phylogenetic
inference
Neighbour Joining (fast & simple)
Maximum Parsimony (relatively fast &
simple)
Maximum Likelihood (complex & slow)
Bayesian Methods (complex & slower)
Alexandros Stamatakis, October 2007
13. Phylogenetics
Complex Methods &
Input: “good” multiple Alignment
Models required to
Output: unrooted binary tree
reconstruct large &
Various methods for phylogenetic
complicated trees !
inference
NeighbourFocus of(fast talk is on
Joining this & simple)
Maximum Likelihood!
Maximum Parsimony (relatively fast &
simple)
Maximum Likelihood (complex & slow)
Bayesian Methods (complex & slower)
Alexandros Stamatakis, October 2007
14. Phylogenetics
Input: “good” multiple Alignment
Output: unrooted binary tree
Various methods for phylogenetic
inference
NeighbourThe real (fast & simple)
Joining reason for
Maximum working on (relatively fast &
Parsimony ML: ......
simple)
Maximum Likelihood (complex & slow)
Bayesian Methods (complex & slower)
Alexandros Stamatakis, October 2007
15. Challenges for Phyloinformatics
Holy grail: “Tree of Life”
What is a good alignment in a
phylogenetic context?
Simultaneous alignment and tree building
Improve/extend models ... but thereby size
of computable trees decreases!
More HPC awareness
Exploit multi-core architectures
Amount of available data grows at a
higher rate than algorithms are getting
faster
Alexandros Stamatakis, October 2007
17. The number of trees
Alexandros Stamatakis, October 2007
18. The number of trees
Alexandros Stamatakis, October 2007
19. The number of trees
Alexandros Stamatakis, October 2007
20. The number of trees
explodes!
BANG !
Alexandros Stamatakis, October 2007
21. Outline
Introduction
●
Computation of Phylogenies
●
Maximum Likelihood
●
Web & Grid Services
●
Three Steps Towards the Tree of Life
●
Parallelism on IBM BlueGene/L
●
Rapid Bootstrapping
●
A Bootstopping criterion
●
Related Projects
●
Outlook
●
Alexandros Stamatakis, October 2007
22. Maximum Likelihood
Length: m
Seq1
Seq2
Alignment
Seq3
Seq4
Alexandros Stamatakis, October 2007
23. Maximum Likelihood
Length: m
ACGT
Seq1 A
Seq2 C Substitution
Alignment model
Seq3 G
Seq4 T
Alexandros Stamatakis, October 2007
24. Maximum Likelihood
Length: m
ACGT Prior probabilities,
Empirical base frequencies
Seq1 A
Seq2 C Substitution
Alignment πA πC πG πT
model
Seq3 G
Seq4 T
Alexandros Stamatakis, October 2007
25. Maximum Likelihood
Length: m
ACGT Prior probabilities,
Empirical base frequencies
Seq1 A
Seq2 C Substitution
Alignment πA πC πG πT
model
Seq3 G
Seq4 T
Seq 3
Seq 1 b3
b1
b5
b2 b4
Seq 2 Seq 4
Alexandros Stamatakis, October 2007
26. Maximum Likelihood
Length: m
ACGT Prior probabilities,
Empirical base frequencies
Seq1 A
Seq2 C Substitution
Alignment πA πC πG πT
model
Seq3 G
Seq4 T
Seq 3
Seq 1 b3
b1
b5
b2 b4
Seq 2 Seq 4
virtual root: vr
Alexandros Stamatakis, October 2007
27. Maximum Likelihood
Length: m
ACGT Prior probabilities,
Empirical base frequencies
Seq1 A
Seq2 C Substitution
Alignment πA πC πG πT
model
Seq3 G
Seq4 T
Seq 3
Seq 1 b3
b1
vr
b5
b2 b4
Seq 2 Seq 4
P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)
m
Alexandros Stamatakis, October 2007
28. Maximum Likelihood
Length: m
ACGT Prior probabilities,
Empirical base frequencies
Seq1 A
Seq2 C Substitution
Alignment πA πC πG πT
model
Seq3 G
Seq4 T
Lots of floating pointSeq 3
Seq 1 b3
b1
operations!
vr
b5
b2 b4
Seq 2 Seq 4
P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)
m
Alexandros Stamatakis, October 2007
29. Maximum Likelihood
Length: m
ACGT Prior probabilities,
Empirical base frequencies
Seq1 A
Seq2 C Substitution
Alignment πA πC πG πT
model
Seq3 G
Seq4 T
Seq 3
Seq 1
Seq 2 Seq 4
optimize branch lengths
Alexandros Stamatakis, October 2007
30. Maximum Likelihood
Length: m
ACGT Prior probabilities,
Empirical base frequencies
Seq1 A
Seq2 C Substitution
Alignment πA πC πG πT
model
Seq3 G
Seq4 T
optimize model parameters
Seq 3
Seq 1
Seq 2 Seq 4
Alexandros Stamatakis, October 2007
31. Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
Problem II: Computation of likelihood function is expensive
Problem III: Probably high score accuracy required
Problem IV: High memory consumption
Solution:
• New Algorithms
• New Models
• High Performance Computing
Alexandros Stamatakis, October 2007
32. Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
RAxML
Problem II: Computation of likelihood function is expensive
Randomized Axelerated
Problem III: Probably high score accuracy required
Maximum Likelihood
Problem IV: High memory consumption
Solution:
• New Algorithms
• New Models
• High Performance Computing
Alexandros Stamatakis, October 2007
33. Web & Grid Services
RAxML Web-Server at San Diego Supercomputing
Center via www.phylo.org (CIPRES project)
Web-Server at Vital-IT unit of Swiss Institute of
Bioinformatics phylobench.vital-it.ch/raxml-bb/
Includes novel search algorithm with 1 order of
magnitude run-time improvement
Since Sept 3, about 700 jobs from 130 Ips
Extension to SwissGrid planned
Novel algorithm with Bootstopping to be
integrated into CIPRES portal soon
RAxML integration into Distributed European
Infrastructure for Supercomputing Applications
www.deisa.org started 10 days ago
Integration into Debian medical distribution
Alexandros Stamatakis, October 2007
35. RAxML Black Box
Why are Black Boxes
useful?
Alexandros Stamatakis, October 2007
36. Outline
Introduction
●
Computation of Phylogenies
●
Maximum Likelihood
●
Web & Grid Services
●
Three Steps Towards the Tree of Life
●
Parallelism on IBM BlueGene/L
●
Rapid Bootstrapping
●
A Bootstopping criterion
●
Related Projects
●
Outlook
●
Alexandros Stamatakis, October 2007
37. Levels of Parallelism
Embarrassing Parallelism
MPI, CORBA, Grid Technologies
Alexandros Stamatakis, October 2007
38. Coarse-Grained Parallelism:
MPI Version of RAxML
PC-CLUSTER
Worker Processes
B-2
B-3
B-1
B-4
Interconnection
B-0 Network
Master Process
Alexandros Stamatakis, October 2007
39. Levels of Parallelism
Embarrassing Parallelism
MPI, CORBA, Grid Technologies
Inference Parallelism
MPI, algorithm-dependent
Alexandros Stamatakis, October 2007
40. Levels of Parallelism
Embarrassing Parallelism
MPI, CORBA, Grid Technologies
Inference Parallelism
MPI, algorithm-dependent
Loop-Level Parallelism
OpenMP, GPUs,
IBM CELL (Playstation),
IBM BlueGene,
Clusters with fast Interconnect
Alexandros Stamatakis, October 2007
41. Loop Level Parallelism
virtual root
P
Q
R
P[i] = f(Q[i], R[i])
Alexandros Stamatakis, October 2007
42. Loop Level Parallelism
virtual root
This operation uses ≥ 90%
P of total execution time !
Q
R
P[i] = f(Q[i], R[i])
Alexandros Stamatakis, October 2007
43. Loop Level Parallelism
virtual root
This operation uses ≥ 90%
P of total execution time !
→ simple fine-grained
parallelization
Q
R
P[i] = f(Q[i], R[i])
Alexandros Stamatakis, October 2007
47. Loop Level Parallelism
virtual root
The real reason for
assuming independent
evolution among sites:
P
......
Q
R
Alexandros Stamatakis, October 2007
50. HPC for ML (Bayesian)
Proof of Concept & Programming
Techniques:
RAxML on a Graphics Processing Unit
RAxML on the IBM CELL & Playstation
Production Level Implementations:
RAxML with OpenMP
RaxML with MPI
RAxML on BlueGene
Multi-Core Architectures
Alexandros Stamatakis, October 2007
51. HPC for ML (Bayesian)
Proof of Concept & Programming
Techniques:
RAxML on a Graphics Processing Unit
RAxML on the IBM CELL & Playstation
Production Level Implementations:
A good excuse to buy one
RAxML with OpenMP
RaxML with MPI
RAxML on BlueGene
Multi-Core Architectures
Alexandros Stamatakis, October 2007
52. RAxML-BlueGene
Many slow processors: 1024 in one rack
512 MB or 1GB of main memory per node
But: high performance network
Challenges:
Distribute tree data structure among CPUs
Exploit fast collective communication network
For optimal efficiency: loop-level +
embarrassing parallelism hybrid
parallelism with MPI
Test & Production Run Data
With Olaf Bininda-Emonds, Jena: 2,182
mammalian sequences x 51,000 base pairs
With Dan Janies, Ohio State: 270 Human
Haplotype Map sequences x 500,000 base pairs
Alexandros Stamatakis, October 2007
53. RAxML-BlueGene
To be presented at IEEE/ACM
2007 Supercomputing
Many slow processors: 1024 in one rack
Conference.
512 MB or 1GB of main memory per node
But: high performance network
Challenges:
Distribute tree data structure among CPUs
Exploit fast collective communication network
For optimal efficiency: loop-level +
embarrassing parallelism hybrid
parallelism with MPI
Test & Production Run Data
With Olaf Bininda-Emonds, Jena: 2,182
mammalian sequences x 51,000 base pairs
With Dan Janies, Ohio State: 270 Human
Haplotype Map sequences x 500,000 base pairs
Alexandros Stamatakis, October 2007
54. RAxML-BlueGene
Many slow processors: 1024 in one rack
512 MB or 1GB of main memory per node
But: high performance network
Challenges:
Distribute tree data structure among CPUs in
Largest ML analysis to date
terms of memory footprint
Exploit fast collective communication network
For optimal efficiency: loop-level +
embarrassing parallelism hybrid
parallelism with MPI
Test & Production Run Data
With Olaf Bininda-Emonds, Jena: 2,182
mammalian sequences x 51,000 base pairs
With Dan Janies, Ohio State: 270 Human
Haplotype Map sequences x 500,000 base pairs
Alexandros Stamatakis, October 2007
60. Outline
Introduction
●
Computation of Phylogenies
●
Maximum Likelihood
●
Web & Grid Services
●
Three Steps Towards the Tree of Life
●
Parallelism on IBM BlueGene/L
●
Rapid Bootstrapping
●
A Bootstopping criterion
●
Related Projects
●
Outlook
●
Alexandros Stamatakis, October 2007
61. Confidence Values
Tree without node confidence
values is mostly useless
Problem:
Confidence value calculation is major
computational obstacle
We can compute large trees but not
analyse them: compute ≠analyse !
Current Slow Methods
Sampling with Bayesian methods
Non-parametric Bootstrapping
Alexandros Stamatakis, October 2007
62. A Tree with Confidence Values
Joint work Stamatakis, October 2007
Alexandros with Marc Gottschling, Charite Hospital, Berlin
63. Bootstrapping
Original Alignment
perturbation
compute tree compute tree compute tree
Alexandros Stamatakis, October 2007
64. Bootstrapping
Original Alignment
This needs to be done
100-1000 times
Embarrassingly
Parallel !
perturbation
compute tree compute tree compute tree
Alexandros Stamatakis, October 2007
65. Two Questions
How to compute Bootstraps faster?
How many Bootstrap replicates do we
need?
Alexandros Stamatakis, October 2007
66. Current Work:
Rapid Bootstrapping Algorithm
Tested on 22 diverse (mammals, bacteria, archaea,
grasses, fishes, plants, viral) real-world DNA/AA
single-/multi-gene datasets containing 125-7,764
sequences
Pearson correlation on best-scoring ML trees between
RBS (Rapid BS) & SBS (Standard BS) support values
0.95-0.99 (except one dataset at 0.91), average 0.97
Weighted topological distance < 6%, average 4%
Program Acceleration: 8-20, average ≈ 15
Acceleration by one order of magnitude
Full ML analysis (100BS + ML search) of datasets of
up to 5,000 sequences within less than 5 days on
your desktop!
Allows for a sufficiently large number of Bootstrap
replicates
Alexandros Stamatakis, October 2007
70. Rapid Bootstrap
11111111111111 Compute Starting Tree
01102211111111
10111102220111
11111110112021
Alexandros Stamatakis, October 2007
71. Rapid Bootstrap
Optimize Model Params &
11111111111111 Branch Lengths
01102211111111
10111102220111
11111110112021
Alexandros Stamatakis, October 2007
72. Rapid Bootstrap
Use Starting Tree &
Model Params to compute
RELL scores
11111111111111
01102211111111 -110
10111102220111 -105
11111110112021 -100
Alexandros Stamatakis, October 2007
73. Rapid Bootstrap
Use Starting Tree &
Model Params to compute
RELL scores
11111111111111
01102211111111 -110
10111102220111 -105 Sort by RELL
11111110112021 -100
Alexandros Stamatakis, October 2007
88. 7,764 DNA All Bipartitions
Alexandros Stamatakis, October 2007
89. 775 x 3,838 AA
Alexandros Stamatakis, October 2007
90. New Opportunities
Assess Impact of Alignment Method
on tree and support values
Test Bootstrap of the Bootstrap
(double Bootstrap) procedures
Devise and empirically verify
Bootstopping criteria
Alexandros Stamatakis, October 2007
91. Bootstrap of the Bootstrap
140 AA (Efron et al PNAS 1996)
Alexandros Stamatakis, October 2007
92. Bootstrap of the Bootstrap
3,491 rBCL
Alexandros Stamatakis, October 2007
93. Bootstopping
Rapid Bootstrapping allows to assess
Bootstopping criteria as follows
1. Compute a high number of BS replicates (10,000)
2. Devise topology-based bootstopping criterion and
apply it to these 10,000 replicates
3. Compare support values induced by bootstopped
trees (say 300 replicates) with 10,000 replicates
We have 10,000 replicates for 18
datasets containing 125 to 2,554
sequences
Alexandros Stamatakis, October 2007
94. Bootstopping Criterion
Every 50, 100, 150, ... replicates do a test:
Say we have N BS trees
Do the following 100 times:
Randomly split up this set of N trees into 2
equal sets S1, S2, of size N/2
Compute the bipartition support vectors for
S1 and S2
Compute Pearson correlation of the support
vectors
return average of the 100 Pearson correlations
if average > 0.99 stop
Alexandros Stamatakis, October 2007
95. Result Overview
Bootstopped between 100-400 (avg
213)
Correlation on best tree: Bootstopped
versus 10,000 replicates > 0.99 (avg
0.995)
Correlation of all bipartitions > 0.995
(avg 0.997)
Alexandros Stamatakis, October 2007
102. Putting the Pieces together
Blue-Gene: Can handle huge datasets
Use Cat approximation on BlueGene
Further speedup of factor 3.5
Memory footprint reduction factor 4
Alexandros Stamatakis, October 2007
103. 8,864 Bacteria under GTR+Γ
and GTR+CAT
Log Likelihood
Score under Γ
7 days 14 days
Execution
Time
Alexandros Stamatakis, October 2007
104. Putting the Pieces together
Blue-Gene: Can handle huge datasets
Use Cat approximation on BlueGene
Further speedup of factor 3.5
Memory footprint reduction factor 4
Integrate rapid Bootstrap into BlueGene
version
Additional speedup ≈ 15
Mechanisms available to accelerate
BlueGene version by factor 50-60
Integrate Bootstopping into BlueGene
Conclusion: We will soon be able to
compute a small tree of life with 10,000
organisms and data from multiple genes!
Alexandros Stamatakis, October 2007
105. Outline
Introduction
●
Computation of Phylogenies
●
Maximum Likelihood
●
Web & Grid Services
●
Three Steps Towards the Tree of Life
●
Parallelism on IBM BlueGene/L
●
Rapid Bootstrapping
●
A Bootstopping criterion
●
Related Projects
●
Outlook
●
Alexandros Stamatakis, October 2007
108. Host-Parasite Co-Evolution
Hosts Parasites
Co-Evolution Hypothesis
8 Parasites
Adjacency
6 hosts Matrix 0/1
Statistical Test
Alexandros Stamatakis, October 2007
109. What can HPC do forBioinformatics?
Axelerated Parafit
“Parafit: statistical test of co-evolution”, Pierre
Legendre, Syst. Biol. 2003
AxParafit (Axelerated Parafit)
Statistical test of hypotheses of host-parasite co-
evolution
C porting, optimization, BLAS integration
Speedup up to factor 67
Master-Worker MPI-parallelization
Largest co-phylogenetic study to date conducted
within 8 minutes instead of 4 weeks
Open-Source Code:
http://icwww.epfl.ch/~stamatak/AxParafit.html
SwissGrid-based Web-Server planned
Alexandros Stamatakis, October 2007
112. The ML Benchmark:
A Current Community Project
Standardized way required to test ML search programs
Web-Server with real-world alignments and performance data
at Swiss Institute of Bioinformatics
Many developers of popular ML programs involved
Stephane Guindon (PHYML) Montpellier
Simon Wheelan (LeaPhy) Manchester
Bui Quang Minh (IQPNNI) Vienna
Derrick Zwickl (GARLI) Virginia
Thomas Keane (dprML) Cambridge
Byproduct: SPEC-like CPU benchmark for phylogenetics
Follow-up: (planned) ML competition at major conference with
industrial sponsor
Alexandros Stamatakis, October 2007
113. A Current Problem:
Handling Multi-Gene Alignments
Gene 1 Gene 2
Sequence 1
Sequence 5
Missing Data ≠ Gap Data
Alexandros Stamatakis, October 2007
118. A Multi-Gene Model
LogLH (T) = LogLh (T|Red) +
LogLH(T|Yellow)
Alexandros Stamatakis, October 2007
119. A Multi-Gene Model
Challenge: devise efficient data
structures for this
LogLH (T) = LogLh (T|Red) +
LogLH(T|Yellow)
Alexandros Stamatakis, October 2007
120. Why are Individual Branches
per Gene a Challenge?
Alexandros Stamatakis, October 2007
121. Why are Individual Branches
per Gene a Challenge?
Alexandros Stamatakis, October 2007
123. Outlook
Tree of Life
What is a good alignment in a
phylogenetic context?
Simultaneous alignment and tree building
More HPC & memory-aware programming
Multi-core architectures
Models for “gappy” multi-gene alignments
Alexandros Stamatakis, October 2007
124. Acknowledgements
BlueGene Project
Michael Ott, TUM
Srinivas Aluru, Jaroslaw Zola, Iowa State
Dan Janies, Andrew Johnson, Ohio State
IBM CELL & Playstation
Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech
Christos Antonopoulos, Univ. of Thessaly
Bootstopping
Bernard Moret, Masoud Alipour, EPFL
Olaf Bininda-Emonds, Univ. Jena
RAxML Web-Server
Jacques Rougemont, SIB
Terri Liebowitz, SDSC
AxParafit/AxPcoords
Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen
Datasets for Studies
Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm
(Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT)
Alexandros Stamatakis, October 2007
125. Thank you for your
Attention !
Lake Geneva, Switzerland
Alexandros Stamatakis, October 2007