Multi-core programming talk for weekly biostat seminar

Using multi-core
algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Using multi-core algorithms to Introduction to
high-performance

speed up optimization computing

Concepts

Example 1: Hidden
Markov Model
Training

Gary K. Chen Example 2:
Regularized
Biostat Noon Seminar Logistic Regression

Closing remarks

March 23, 2011

Using multi-core

An outline algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to high-performance computing Introduction to
high-performance
computing

Concepts Concepts

Example 1: Hidden
Markov Model
Training

Example 1: Hidden Markov Model Training Example 2:
Regularized
Logistic Regression

Closing remarks
Example 2: Regularized Logistic Regression

Closing remarks

Using multi-core

CPUs are not getting any faster algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized

Heat and power are the sole obstacles Logistic Regression

Closing remarks
According to Intel: underclock a single core
by 20 percent and you save half the power
while sacriﬁcing only 13 percent of the
performance.
Implication? Two cores at the same power
have 73% more performance
(100 − 13) ∗ 2/100

Using multi-core
1. High performance computing algorithms to
speed up
optimization

clusters Gary K. Chen
Biostat Noon
Seminar

Coarse-grained, aka “embararassingly Introduction to
parallel”, problems high-performance
computing

1. Launch multiple instances of the program Concepts

2. Compute summary statistics across log Example 1: Hidden
Markov Model
ﬁles Training

Examples Example 2:
Regularized
Logistic Regression
Monte Carlo simulations (power/speciﬁcity), Closing remarks
GWAS scans, imputation, etc.
Remarks
Pros: maximizes throughput (CPUs kept
busy), gentle learning curve
Cons: Doesn’t address some interesting
computational problems

Using multi-core

Cluster Resource Example algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

HPCC at USC Introduction to
high-performance
94 teraﬂop cluster computing

1,980 simultaneous processes running on Concepts

Example 1: Hidden
main queue Markov Model
Training
Jobs are asynchronous; can start and end in
Example 2:
any order Regularized
Logistic Regression
Portable Batch System Closing remarks

Simply prepend some headers in your shell
script, describing how much memory you
want, how long your job will run, etc.

Using multi-core
2. High performance computing algorithms to
speed up
optimization

clusters Gary K. Chen
Biostat Noon
Seminar

Tightly-coupled parallel programs Introduction to
high-performance
Message Passing Interface computing

Concepts
1. Programs are distributed across multiple
Example 1: Hidden
physical hosts Markov Model
Training
2. Each program executes the exact same Example 2:
code Regularized
Logistic Regression
3. All processes can be synchronized at Closing remarks
strategic points
Remarks
Pro: Can run interesting algorithms like
parallel tempered MCMC
Con: Developer is responsible for establishing
a communication protocol

Using multi-core

Exploiting multiple-core processors algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
Fine-grained parallelism high-performance
computing
Suggests a much higher degree of Concepts
inter-dependence between each process Example 1: Hidden
Markov Model
A “master” process executes majority of Training

code base. “Slave” processes are invoked to Example 2:
Regularized
ease bottlenecks. Logistic Regression

We hope to minimize the time spent in the Closing remarks

master process
Some Bayesian algorithms stand to beneﬁt

Using multi-core

Amdahl’s Law algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks

1
P
(1−P)+ N

Using multi-core

Heterogeneous Computing algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks

Using multi-core

Multi-core programming algorithms to
speed up
optimization

Gary K. Chen
aka data-parallel programming Biostat Noon
Seminar

Built in to common compilers (e.g. gcc) Introduction to
high-performance
Very easy to get started! computing
SSE or Streaming SIMD Extensions: each Concepts

core can do vector operations Example 1: Hidden
Markov Model
OpenMP: parallel processing across multiple Training

cores Example 2:
Regularized
e.g. simply insert ”pragma omp for” directive Logistic Regression

and compile with gcc! Closing remarks

CUDA/OpenCL
CUDA is a proprietary C-based language
endorsed by nVidia
OpenCL: standards based implementation
backed by the Khronos Group

Using multi-core

OpenCL and CUDA algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar
CUDA
Introduction to
Powerful libraries available to enrich high-performance
computing
productivity
Concepts
Thrust: C++ generics, cuBLAS: Level 1 and Example 1: Hidden
2 parallel BLAS Markov Model
Training
Supported only on nVidia GPU devices Example 2:

OpenCL Regularized
Logistic Regression

Compatible with nVidia and ATI GPU Closing remarks

devices, as well as AMD/Intel CPUs
Lags behind CUDA in libraries and tools
Good to work with, given ATI hardware
currently leads in value

Using multi-core

A $60 HPC under your desk algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks

Using multi-core

Threads and threadblocks algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Threads: Introduction to
Perform a very limited function, but do all high-performance
computing
the heavy lifting Concepts

Are extremely lightweight, so you’ll want to Example 1: Hidden
Markov Model
launch thousands Training

Threadblocks: Example 2:
Regularized
Logistic Regression
Developer assigns threads that can cooperate
Closing remarks
on a common task into threadblocks
Threadblocks cannot communicate with one
another and run in any order
(asynchronously)

Using multi-core

Thread organization algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks

Using multi-core

Memory hierarchy algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks

Using multi-core

Kernels algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts
Warps/Wavefront: Example 1: Hidden
Markov Model
Describes an atomic set of threads (32 for Training
nVidia, 64 for ATI) Example 2:
Regularized
Instructions are executed in lock step across Logistic Regression

the set, each thread processing a distinct Closing remarks

data element
Developer responsible for synchronizing
across warps
Kernels:
Code that developer writes, which can
execute on a SIMD device
Essentially C functions

Using multi-core
algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
Hidden Markov Models high-performance
computing
A staple in machine learning. Concepts
Many applications in statistical genetics, Example 1: Hidden
Markov Model
including imputation of untyped genotypes, Training

local ancestry, sequence alignment (e.g. Example 2:
Regularized
protein family scoring) Logistic Regression

Closing remarks

Using multi-core

Application to cancer tumor data algorithms to
speed up
optimization

Gary K. Chen
Extending PennCNV Biostat Noon
Seminar

Tissues are assumed to be a mixture of Introduction to
tumor/normal cells high-performance
computing
Tumors are assumed to be heterogeneous in Concepts
CN across cells, implying fractional copy Example 1: Hidden
Markov Model
number states Training

PennCNV defines 6 hidden integer states for Example 2:
Regularized
normal cells and does not infer allelic state Logistic Regression

We can make more precise estimates of both Closing remarks

copy numbers and allelic state in tumors with
little sacrifice in performance
Copy Num: z = (1-α)znormal + α ztumor
z is fractional, whereas ztumor =
I(z<=2)floor(z) + I(z>2)ceil(z)

Using multi-core

State Space algorithms to
speed up
optimization
state CNfrac BACnormal CNtumor BACtumor
0 2 0 2 0 Gary K. Chen
1 2 1 2 1 Biostat Noon
2 2 2 2 2 Seminar
3 0 0 0 0
4 0 1 0 0 Introduction to
5 0 2 0 0 high-performance
6 0.5 0 0 0 computing
7 0.5 1 0 0
8 0.5 2 0 0 Concepts
9 1 0 1 0
10 1 1 1 0 Example 1: Hidden
11 1 1 1 1 Markov Model
12 1 2 1 1 Training
13 1.5 0 1 0
14 1.5 1 1 0 Example 2:
15 1.5 1 1 1 Regularized
16 1.5 2 1 1 Logistic Regression
17 2.5 0 3 0
18 2.5 1 3 1 Closing remarks
19 2.5 1 3 2
20 2.5 2 3 3
21 3 0 4 0
22 3 1 4 1
23 3 1 4 2
24 3 1 4 3
25 3 2 4 4
26 3.5 0 4 0
27 3.5 1 4 1
28 3.5 1 4 2
29 3.5 1 4 3
30 3.5 2 4 4

Using multi-core

Training a Hidden Markov Model algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Objective: Infer the probabilities of Seminar

transitioning between any pair of states Introduction to
high-performance
computing
Apply forward-backward and Baum-Welch Concepts
algorithms Example 1: Hidden
A special case of the Markov Model
Training
Expectation-Maximization (or generally, Example 2:
Regularized
MM) family of algorithms Logistic Regression

Expectation step: forward-backward Closing remarks

computes posterior probs based on estimated
parameters
Maximization: Baum-Welch empirically
estimates parameters by averaging across
observations

Using multi-core
algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Forward algorithm Logistic Regression

Closing remarks
We compute the probability vector at
observation t: f0:t = f0:t−1 TOt
Each state (element of the m-state vector)
can independently compute a sum-product
Threadblocks map to states
Threads calculate products in parallel,
followed by a log2(m) addition reduction

Using multi-core

Gridblock of threadblocks algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks

Using multi-core

Speedups algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

We implement 8 kernels. Examples: Introduction to
high-performance
Re-scaling transition matrix (for SNP computing

spacing) Concepts

2 Example 1: Hidden
Serial: O(2nm ); Parallel: O(n) Markov Model
Training
Forward backward Example 2:
Serial: O(2nm2 ); Parallel: O(nlog2 (m)) Regularized
Logistic Regression

Normalizing constant (Baum-Welch) Closing remarks

Serial: O(nm); Parallel: O(log2 (n))
MLE of transition matrix (Baum-Welch)
Serial: O(nm2 ); Parallel: O(n)

Using multi-core

Run time comparison algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Table: 1 iteration of HMM training on Chr 1 (41,263 Concepts

Example 1: Hidden
SNPs) Markov Model
Training

states CPU GPU fold-speedup Example 2:
Regularized
128 9.5m 37s 15x Logistic Regression

Closing remarks
512 2h 35m 1m 44s 108x

Using multi-core

Regularized Regression algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Variable Selection Introduction to
high-performance
computing
For tractability, most GWAS analyses entail
Concepts
separate univariate tests of each variable
Example 1: Hidden
(e.g. SNP, GxG, GxE). Markov Model
Training
However, it is preferable to model all Example 2:
variables simultaneously to tease out Regularized
Logistic Regression
correlated variables Closing remarks
This is problematic when p < n. Parameters
are unestimable, matrix inversion becomes
computationally intractable

Using multi-core

speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

The LASSO method (Tibshirani, 1996) Introduction to
high-performance
computing
Seeded a cottage industry of related methods
Concepts
e.g. Group LASSO, Elastic Net, MCP, NEG, Example 1: Hidden
Overlap LASSO, Graph LASSO Markov Model
Training
Fundamentally solves variable selection Example 2:
problem by introducing an L1 norm to invoke Regularized
Logistic Regression
sparsity Closing remarks

Limitations: Do not provide a mechanism
for hypothesis testing (e.g p-values)

Using multi-core

speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
Bayesian methods high-performance
computing
Posterior inferences on β Concepts
e.g.: Bayesian LASSO, Bayesian Elastic Net, Example 1: Hidden
Markov Model
Highly computational. Scaling up to genome Training

wide scale is not obvious Example 2:
Regularized
MCMC is inherently serial, best option is to Logistic Regression

speed up the sampling chain Closing remarks

Proposal: Implement key bottle neck on the
GPU: ﬁtting βLASSO to the data

Using multi-core

Optimization algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar
For binomial logistic regression:
n Introduction to
L(β) = i=1 [yi logpi + (1 − yi )log(1 − pi )] high-performance
t computing
e µ+xi β
pi = µ+x t β Concepts
1+e i
n
i=1 [yi − pi (β)]xi
Example 1: Hidden
L(β) = Markov Model
n
−d 2 L(β) = i=1 pi (β)[1 − pi (β)]xi xi
t Training

Example 2:
For *penalized* regression: Regularized
Logistic Regression
p
f (β) = L(β) − λ j=1 |βj | Closing remarks

Find global maximum by applying Newton
Raphson one variable at a time.
n
[yi −pi (β m )]xi −λsgn(βjm )
P
βjm+1 = βjm − i=1
Pn m m t
i=1 pi (β )[1−pi (β )]xi xi

Using multi-core

Overview of algorithm algorithms to
speed up
optimization
Newton-Raphson kernel Gary K. Chen
Biostat Noon
Each threadblock maps to a block of 512 Seminar
subjects (theads) for 1 variable
Introduction to
Each thread calculates subject’s contribution high-performance
computing
to gradient and hessian Concepts
Sum (reduction) across 512 subjects Example 1: Hidden
Sum (reduction) across subject blocks in new Markov Model
Training
kernel Example 2:
Regularized
Compute log-likelihood change for each Logistic Regression

variable (like above). Closing remarks

Apply a max operator (log2 reduction) to
select variable with greatest contribution
to likelihood.
Iterate repeatedly until likelihood increase
less than epsilon

Using multi-core

Consideration of datatypes algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks

Need to compress genotypes
Why? Global memory is scarce, bandwidth is
expensive
A warp of 32 threads loads 32 words
(containing 512 genotypes) into local
memory

Using multi-core

Distributed GPU implementation algorithms to
speed up
optimization

Gary K. Chen
For really large dimensions, we can link up Biostat Noon
Seminar

an arbitrary number of GPUs Introduction to
high-performance
MPI allows us to spread work across a computing

Concepts
cluster Example 1: Hidden
Markov Model
Developed on Epigraph: 2 Tesla C2050s Training

Approach Example 2:
Regularized
Logistic Regression
MPI master node delegates heavy lifting to
Closing remarks
slaves across network
Master node performs fast serial code, such
as sampling from the full conditional
likelihood of any penalty parameter (e.g. λ)
Network traﬃc is minimized so slaves must
maintain up to date copies of data structures

Using multi-core
algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks

Using multi-core

Evaluation on large dataset algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

GWAS data Introduction to
6,806 African American subjects in a case high-performance
computing
control study of prostate cancer Concepts

1,047,986 SNPs typed Example 1: Hidden
Markov Model
Elapsed walltime for 1 LASSO iteration Training

(sweep across all variables) Example 2:
Regularized
Logistic Regression
15 minutes on optimized serial
Closing remarks
implementation across 2 slave CPUs
5.8 seconds on parallel implementation across
2 nVidia Tesla C2050 GPU devices
155x speed up

Using multi-core

Conclusion algorithms to
speed up
optimization

Gary K. Chen
Multicore programming is not a panacea Biostat Noon
Seminar

Insuﬃcient parallelism leads to an inferior Introduction to
implementation high-performance
computing
Graph algorithms *generally* do not map Concepts
well to SIMD architectures Example 1: Hidden
Markov Model
Programming Eﬀort Training

Expect to spend at least 90 % time Example 2:
Regularized
debugging a black box Logistic Regression

Closing remarks
Is it worth it? Human time > computer
time?
For generic problems (matrix multiplication,
sorting), absolutely
OpenCL is a bit more verbose than CUDA,
but is more portable

Using multi-core

Potential Future Work algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Reconstructing Bayesian Networks Introduction to
high-performance
Compute joint probability for each possible computing

Concepts
topology
Example 1: Hidden
Code graph as a sparse adjacency matrix Markov Model
Training
Approximate Bayesian Computation Example 2:
Regularized
Sample θ from some assumed prior Logistic Regression

distribution Closing remarks

Generate a dataset conditional on θ
Examine how close fake data is to the real
one

Using multi-core
Tomorrow’s clusters will require algorithms to
speed up
optimization

heterogeneous programming Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks

Using multi-core

Tianhe-1A algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

World’s faster supercomputer Introduction to
high-performance
4.7 petaflops (quadrillion floating point computing

operations/sec) Concepts

Example 1: Hidden
14,336 Xeon CPUs, 7,168 Tesla M2050s Markov Model
Training
According to nVidia Example 2:
Regularized
CPU only: 50k CPUs, twice the floor space Logistic Regression
CPU only: 12 megawatts compared to 4.04 Closing remarks

megawatts
$88 million dollars to build, $20 million for
annual energy costs

Using multi-core

Thanks to algorithms to
speed up
optimization

Gary K. Chen
Biostat Noon
Seminar

Introduction to
high-performance
computing
Kai: Ideas for CNV analysis Concepts

Duncan, Wei: Discussions on LASSO Example 1: Hidden
Markov Model
Training
Tim, Zack: Access to Epigraph Example 2:
Regularized
Alex, James: Lively HPC Logistic Regression

Closing remarks
discussions/debates

Multi-core programming talk for weekly biostat seminar

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Multi-core programming talk for weekly biostat seminar

Ähnlich wie Multi-core programming talk for weekly biostat seminar (20)

Mehr von USC

Mehr von USC (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Multi-core programming talk for weekly biostat seminar