Crunching Huge Phylogenies. A. Stamatakis

Crunching Huge Phylogenies:
A Rapid Bootstrap Algorithm and
Massive Parallelism on the IBM
BlueGene
Alexandros Stamatakis

Swiss Federal Institute of Technology Lausanne (EPFL)
School of Computer & Communication Sciences
Laboratory for Computational Biology and Bioinformatics
Lausanne, Switzerland
&
Swiss Institute of Bioinformatics

Alexandros.Stamatakis@epfl.ch
icwww.epfl.ch/~stamatak

The Missing Part

Data Assembly Inference ? Tree Analysis

Alexandros Stamatakis, October 2007

The Missing Part

Data Assembly Tree Analysis


IBM BlueGene/L
supercomputer


Rapid Bootstrapping
Bootstopping Criterion


The Big Hardware Problem

CPU Speed 40% p.a.

Memory Speed 9% p.a.

2007
1980

... and why this concerns
Bioinformatics

Sequence
CPU Speed 40% p.a. Data


2007
1980

... and why this concerns
Bioinformatics

Application of HPC
techniques will become Sequence
much moreSpeed 40% p.a.
CPU important Data


2007
1980

Cache Hierarchy


Outline
Introduction
●

Computation of Phylogenies
●

Maximum Likelihood
●

Web & Grid Services
●

Three Steps Towards the Tree of Life
●

Parallelism on IBM BlueGene/L
●

Rapid Bootstrapping
●

A Bootstopping criterion
●

Related Projects
●

Outlook
●


Phylogenetics
Input: “good” multiple Alignment


Output: unrooted binary tree


Various methods for phylogenetic


inference
Neighbour Joining (fast & simple)


Maximum Parsimony (relatively fast &


simple)
Maximum Likelihood (complex & slow)


Bayesian Methods (complex & slower)



Phylogenetics



ML & Bayesian: explicit
Various methods choice
model for phylogenetic


inference
Neighbour Joining (fast & simple)




simple)





Phylogenetics
Complex Methods &

Models required to

reconstruct large &
complicated trees !


inference
NeighbourFocus of(fast talk is on
Joining this & simple)

Maximum Likelihood!


simple)





Phylogenetics






inference
NeighbourThe real (fast & simple)
Joining reason for


Maximum working on (relatively fast &
Parsimony ML: ......


simple)





Challenges for Phyloinformatics

Holy grail: “Tree of Life”


What is a good alignment in a


phylogenetic context?
Simultaneous alignment and tree building


Improve/extend models ... but thereby size


of computable trees decreases!
More HPC awareness


Exploit multi-core architectures


Amount of available data grows at a


higher rate than algorithms are getting
faster

The algorithmic problem


The number of trees


The number of trees
explodes!

BANG !


Maximum Likelihood
Length: m

Seq1
Seq2
Alignment
Seq3
Seq4


Maximum Likelihood
Length: m
ACGT
Seq1 A
Seq2 C Substitution
Alignment model
Seq3 G
Seq4 T


Maximum Likelihood
Length: m
ACGT Prior probabilities,
Empirical base frequencies
Seq1 A
Seq2 C Substitution
Alignment πA πC πG πT
model
Seq3 G
Seq4 T


Maximum Likelihood
Length: m
Seq1 A
Seq2 C Substitution
model
Seq3 G
Seq4 T

Seq 3
Seq 1 b3
b1
b5
b2 b4
Seq 2 Seq 4


Maximum Likelihood
Length: m
Seq1 A
Seq2 C Substitution
model
Seq3 G
Seq4 T

Seq 3
Seq 1 b3
b1
b5
b2 b4
Seq 2 Seq 4

virtual root: vr


Maximum Likelihood
Length: m
Seq1 A
Seq2 C Substitution
model
Seq3 G
Seq4 T

Seq 3
Seq 1 b3
b1
vr
b5
b2 b4
Seq 2 Seq 4
P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)

m


Maximum Likelihood
Length: m
Seq1 A
Seq2 C Substitution
model
Seq3 G
Seq4 T

Lots of floating pointSeq 3
Seq 1 b3
b1
operations!
vr
b5
b2 b4
Seq 2 Seq 4
P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)

m


Maximum Likelihood
Length: m
Seq1 A
Seq2 C Substitution
model
Seq3 G
Seq4 T

Seq 3
Seq 1

Seq 2 Seq 4

optimize branch lengths


Maximum Likelihood
Length: m
Seq1 A
Seq2 C Substitution
model
Seq3 G
Seq4 T

optimize model parameters
Seq 3
Seq 1

Seq 2 Seq 4


Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
Problem II: Computation of likelihood function is expensive
Problem III: Probably high score accuracy required
Problem IV: High memory consumption
Solution:
• New Algorithms
• New Models
• High Performance Computing


Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
RAxML
Problem II: Computation of likelihood function is expensive
Randomized Axelerated
Problem III: Probably high score accuracy required
Maximum Likelihood
Problem IV: High memory consumption
Solution:
• New Algorithms
• New Models
• High Performance Computing


Web & Grid Services
RAxML Web-Server at San Diego Supercomputing

Center via www.phylo.org (CIPRES project)
Web-Server at Vital-IT unit of Swiss Institute of

Bioinformatics phylobench.vital-it.ch/raxml-bb/
 Includes novel search algorithm with 1 order of

magnitude run-time improvement
 Since Sept 3, about 700 jobs from 130 Ips

 Extension to SwissGrid planned

 Novel algorithm with Bootstopping to be

integrated into CIPRES portal soon
RAxML integration into Distributed European


Infrastructure for Supercomputing Applications
www.deisa.org started 10 days ago
Integration into Debian medical distribution



RAxML Black Box


RAxML Black Box

Why are Black Boxes
useful?


Levels of Parallelism
Embarrassing Parallelism
MPI, CORBA, Grid Technologies


Coarse-Grained Parallelism:
MPI Version of RAxML
PC-CLUSTER
Worker Processes

B-2
B-3
B-1
B-4
Interconnection
B-0 Network

Master Process

Inference Parallelism
MPI, algorithm-dependent


Inference Parallelism
MPI, algorithm-dependent
Loop-Level Parallelism
OpenMP, GPUs,
IBM CELL (Playstation),
IBM BlueGene,
Clusters with fast Interconnect


Loop Level Parallelism
virtual root

P

Q
R

P[i] = f(Q[i], R[i])


virtual root

This operation uses ≥ 90%
P of total execution time !

Q
R

P[i] = f(Q[i], R[i])


virtual root

This operation uses ≥ 90%
P of total execution time !
→ simple fine-grained
parallelization

Q
R

P[i] = f(Q[i], R[i])


virtual root

P

Q
R


virtual root
The real reason for
assuming independent
evolution among sites:
P
......

Q
R


Fine-Grained Parallelism:
OpenMP version of RAxML


HPC for ML (Bayesian)
Proof of Concept & Programming


Techniques:
 RAxML on a Graphics Processing Unit

 RAxML on the IBM CELL & Playstation

Production Level Implementations:


 RAxML with OpenMP

 RaxML with MPI

 RAxML on BlueGene

 Multi-Core Architectures


HPC for ML (Bayesian)
Proof of Concept & Programming


Techniques:
 RAxML on a Graphics Processing Unit

 RAxML on the IBM CELL & Playstation

Production Level Implementations:

A good excuse to buy one
 RAxML with OpenMP

 RaxML with MPI

 RAxML on BlueGene

 Multi-Core Architectures


RAxML-BlueGene
Many slow processors: 1024 in one rack


512 MB or 1GB of main memory per node


But: high performance network


Challenges:


Distribute tree data structure among CPUs


Exploit fast collective communication network


For optimal efficiency: loop-level +

embarrassing parallelism  hybrid
parallelism with MPI
Test & Production Run Data


With Olaf Bininda-Emonds, Jena: 2,182


mammalian sequences x 51,000 base pairs
With Dan Janies, Ohio State: 270 Human


Haplotype Map sequences x 500,000 base pairs

RAxML-BlueGene
To be presented at IEEE/ACM
2007 Supercomputing

Conference.




Challenges:


Distribute tree data structure among CPUs












RAxML-BlueGene






Challenges:


Distribute tree data structure among CPUs in
Largest ML analysis to date

terms of memory footprint










Loop-Level Parallelism on
BlueGene


50 Seqs x 23,385 bp


50 Seqs x 23,385 bp

Superlinear Speedup


250 Seqs x 403,581 bp



W W W W

M W
W M

M M
W W

W W
W W


Confidence Values
Tree without node confidence


values is mostly useless
Problem:


Confidence value calculation is major


computational obstacle
 We can compute large trees but not

analyse them: compute ≠analyse !
Current Slow Methods


Sampling with Bayesian methods


Non-parametric Bootstrapping



A Tree with Confidence Values

Joint work Stamatakis, October 2007
Alexandros with Marc Gottschling, Charite Hospital, Berlin

Bootstrapping
Original Alignment

perturbation

compute tree compute tree compute tree


Bootstrapping
Original Alignment
This needs to be done
100-1000 times
Embarrassingly
Parallel !
perturbation

compute tree compute tree compute tree


Two Questions
How to compute Bootstraps faster?


How many Bootstrap replicates do we


need?


Current Work:
Rapid Bootstrapping Algorithm
Tested on 22 diverse (mammals, bacteria, archaea,


grasses, fishes, plants, viral) real-world DNA/AA
single-/multi-gene datasets containing 125-7,764
sequences
Pearson correlation on best-scoring ML trees between


RBS (Rapid BS) & SBS (Standard BS) support values
0.95-0.99 (except one dataset at 0.91), average 0.97
Weighted topological distance < 6%, average 4%


Program Acceleration: 8-20, average ≈ 15


Acceleration by one order of magnitude


Full ML analysis (100BS + ML search) of datasets of


up to 5,000 sequences within less than 5 days on
your desktop!
Allows for a sufficiently large number of Bootstrap


replicates


Quick & Dirty Bootstrap

Modify Algorithm

Computational Experiments


Quick & Dirty Bootstrap

Modify Algorithm

iterate

Computational Experiments


Rapid Bootstrap

11111111111111

01102211111111
10111102220111
11111110112021


Rapid Bootstrap

11111111111111 Compute Starting Tree

01102211111111
10111102220111
11111110112021


Rapid Bootstrap

Optimize Model Params &
11111111111111 Branch Lengths

01102211111111
10111102220111
11111110112021


Rapid Bootstrap
Use Starting Tree &
Model Params to compute
RELL scores
11111111111111

01102211111111 -110
10111102220111 -105
11111110112021 -100


Rapid Bootstrap
Use Starting Tree &
Model Params to compute
RELL scores
11111111111111

01102211111111 -110
10111102220111 -105 Sort by RELL
11111110112021 -100


Rapid Bootstrap

11111111111111

11111110112021 -100 T0: Thorough Search

10111102220111 -105
01102211111111 -110


Rapid Bootstrap

11111111111111


10111102220111 -105 T1: Quick Search on T0
01102211111111 -110


Rapid Bootstrap

11111111111111




Rapid Bootstrap

11111111111111
sequential
dependency is
bad for
11111110112021 -100
parallelism T0: Thorough Search



Scalability of Rapid
Bootstrap


Scalability of Rapid
Bootstrap

Some datasets
are harder than
others


ML-Scores: Garli, RAxML,
PHYML 715 Sequences


Correlation 125 Taxa: 0.91


Support Value Distribution


Bootstrap Likelihood Values
125 x 19,436

10,000 replicates
only 195 non-trivial
bipartitions


Bootstrap Likelihood Values
125 x 19,436


3,491 rBCL sequences
Rapid versus Standard BS

Correlation:
0.98


7,764 DNA Best Tree


7,764 DNA All Bipartitions


775 x 3,838 AA


New Opportunities

Assess Impact of Alignment Method


on tree and support values
Test Bootstrap of the Bootstrap


(double Bootstrap) procedures
Devise and empirically verify


Bootstopping criteria


Bootstrap of the Bootstrap
140 AA (Efron et al PNAS 1996)


Bootstrap of the Bootstrap
3,491 rBCL


Bootstopping
Rapid Bootstrapping allows to assess


Bootstopping criteria as follows
1. Compute a high number of BS replicates (10,000)
2. Devise topology-based bootstopping criterion and
apply it to these 10,000 replicates
3. Compare support values induced by bootstopped
trees (say 300 replicates) with 10,000 replicates
We have 10,000 replicates for 18

datasets containing 125 to 2,554
sequences


Bootstopping Criterion
Every 50, 100, 150, ... replicates do a test:


 Say we have N BS trees

 Do the following 100 times:

 Randomly split up this set of N trees into 2

equal sets S1, S2, of size N/2
 Compute the bipartition support vectors for

S1 and S2
 Compute Pearson correlation of the support

vectors
 return average of the 100 Pearson correlations

if average > 0.99 stop



Result Overview

Bootstopped between 100-400 (avg


213)
Correlation on best tree: Bootstopped


versus 10,000 replicates > 0.99 (avg
0.995)
Correlation of all bipartitions > 0.995


(avg 0.997)


Bootstopping Best 140 AA


Bootstopping Best 404 DNA
(Multi-Gene)


Bootstopping Best 994 DNA


Bootstopping All 994 DNA


Bootstopping Best 1,908
DNA


Bootstopping Best 2,554
DNA


Putting the Pieces together
Blue-Gene: Can handle huge datasets


Use Cat approximation on BlueGene


Further speedup of factor 3.5


Memory footprint reduction factor 4



8,864 Bacteria under GTR+Γ
and GTR+CAT
Log Likelihood
Score under Γ

7 days 14 days

Execution
Time

Putting the Pieces together
Blue-Gene: Can handle huge datasets


Use Cat approximation on BlueGene


Further speedup of factor 3.5


Memory footprint reduction factor 4


Integrate rapid Bootstrap into BlueGene


version
Additional speedup ≈ 15


Mechanisms available to accelerate


BlueGene version by factor 50-60
Integrate Bootstopping into BlueGene


 Conclusion: We will soon be able to
compute a small tree of life with 10,000
organisms and data from multiple genes!

Host-Parasite Co-Evolution
Parasites (eg Lice)
Hosts (eg Mammals)


Hosts Parasites

Co-Evolution Hypothesis

8 Parasites

Adjacency
6 hosts Matrix 0/1


Hosts Parasites

Co-Evolution Hypothesis

8 Parasites

Adjacency
6 hosts Matrix 0/1

Statistical Test

What can HPC do forBioinformatics?
Axelerated Parafit

“Parafit: statistical test of co-evolution”, Pierre

Legendre, Syst. Biol. 2003
AxParafit (Axelerated Parafit)

 Statistical test of hypotheses of host-parasite co-

evolution
 C porting, optimization, BLAS integration

 Speedup up to factor 67

 Master-Worker MPI-parallelization

Largest co-phylogenetic study to date conducted

within 8 minutes instead of 4 weeks
Open-Source Code:

http://icwww.epfl.ch/~stamatak/AxParafit.html
SwissGrid-based Web-Server planned



AxParafit: Sequential
Performance


AxParafit: Parallel
Performance


The ML Benchmark:
A Current Community Project
Standardized way required to test ML search programs


Web-Server with real-world alignments and performance data


at Swiss Institute of Bioinformatics
Many developers of popular ML programs involved


 Stephane Guindon (PHYML) Montpellier

 Simon Wheelan (LeaPhy) Manchester

 Bui Quang Minh (IQPNNI) Vienna

 Derrick Zwickl (GARLI) Virginia

 Thomas Keane (dprML) Cambridge

Byproduct: SPEC-like CPU benchmark for phylogenetics


Follow-up: (planned) ML competition at major conference with

industrial sponsor


A Current Problem:
Handling Multi-Gene Alignments

Gene 1 Gene 2
Sequence 1

Sequence 5

Missing Data ≠ Gap Data


A Multi-Gene Model


A Multi-Gene Model

LogLH (T) = LogLh (T|Red)


A Multi-Gene Model

LogLH (T) = LogLh (T|Red) +
LogLH(T|Yellow)


A Multi-Gene Model
Challenge: devise efficient data
structures for this

LogLH (T) = LogLh (T|Red) +
LogLH(T|Yellow)


Why are Individual Branches
per Gene a Challenge?


Outlook


Outlook

Tree of Life


What is a good alignment in a


phylogenetic context?
Simultaneous alignment and tree building


More HPC & memory-aware programming


Multi-core architectures


Models for “gappy” multi-gene alignments



Acknowledgements
BlueGene Project


Michael Ott, TUM


Srinivas Aluru, Jaroslaw Zola, Iowa State


Dan Janies, Andrew Johnson, Ohio State


IBM CELL & Playstation


Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech


Christos Antonopoulos, Univ. of Thessaly


Bootstopping


Bernard Moret, Masoud Alipour, EPFL


Olaf Bininda-Emonds, Univ. Jena


RAxML Web-Server


Jacques Rougemont, SIB


Terri Liebowitz, SDSC


AxParafit/AxPcoords


Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen


Datasets for Studies


Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm


(Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT)


Thank you for your
Attention !

Lake Geneva, Switzerland

Crunching Huge Phylogenies. A. Stamatakis

Recommended

Recommended

More Related Content

More from Roderic Page

More from Roderic Page (20)

Recently uploaded

Recently uploaded (20)

Crunching Huge Phylogenies. A. Stamatakis