Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Multi-core programming talk for weekly biostat seminar
1. Using multi-core
algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Using multi-core algorithms to Introduction to
high-performance
speed up optimization computing
Concepts
Example 1: Hidden
Markov Model
Training
Gary K. Chen Example 2:
Regularized
Biostat Noon Seminar Logistic Regression
Closing remarks
March 23, 2011
2. Using multi-core
An outline algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to high-performance computing Introduction to
high-performance
computing
Concepts Concepts
Example 1: Hidden
Markov Model
Training
Example 1: Hidden Markov Model Training Example 2:
Regularized
Logistic Regression
Closing remarks
Example 2: Regularized Logistic Regression
Closing remarks
3. Using multi-core
CPUs are not getting any faster algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Heat and power are the sole obstacles Logistic Regression
Closing remarks
According to Intel: underclock a single core
by 20 percent and you save half the power
while sacrificing only 13 percent of the
performance.
Implication? Two cores at the same power
have 73% more performance
(100 − 13) ∗ 2/100
4. Using multi-core
1. High performance computing algorithms to
speed up
optimization
clusters Gary K. Chen
Biostat Noon
Seminar
Coarse-grained, aka “embararassingly Introduction to
parallel”, problems high-performance
computing
1. Launch multiple instances of the program Concepts
2. Compute summary statistics across log Example 1: Hidden
Markov Model
files Training
Examples Example 2:
Regularized
Logistic Regression
Monte Carlo simulations (power/specificity), Closing remarks
GWAS scans, imputation, etc.
Remarks
Pros: maximizes throughput (CPUs kept
busy), gentle learning curve
Cons: Doesn’t address some interesting
computational problems
5. Using multi-core
Cluster Resource Example algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
HPCC at USC Introduction to
high-performance
94 teraflop cluster computing
1,980 simultaneous processes running on Concepts
Example 1: Hidden
main queue Markov Model
Training
Jobs are asynchronous; can start and end in
Example 2:
any order Regularized
Logistic Regression
Portable Batch System Closing remarks
Simply prepend some headers in your shell
script, describing how much memory you
want, how long your job will run, etc.
6. Using multi-core
2. High performance computing algorithms to
speed up
optimization
clusters Gary K. Chen
Biostat Noon
Seminar
Tightly-coupled parallel programs Introduction to
high-performance
Message Passing Interface computing
Concepts
1. Programs are distributed across multiple
Example 1: Hidden
physical hosts Markov Model
Training
2. Each program executes the exact same Example 2:
code Regularized
Logistic Regression
3. All processes can be synchronized at Closing remarks
strategic points
Remarks
Pro: Can run interesting algorithms like
parallel tempered MCMC
Con: Developer is responsible for establishing
a communication protocol
7. Using multi-core
Exploiting multiple-core processors algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
Fine-grained parallelism high-performance
computing
Suggests a much higher degree of Concepts
inter-dependence between each process Example 1: Hidden
Markov Model
A “master” process executes majority of Training
code base. “Slave” processes are invoked to Example 2:
Regularized
ease bottlenecks. Logistic Regression
We hope to minimize the time spent in the Closing remarks
master process
Some Bayesian algorithms stand to benefit
8. Using multi-core
Amdahl’s Law algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
1
P
(1−P)+ N
9. Using multi-core
Heterogeneous Computing algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
10. Using multi-core
Multi-core programming algorithms to
speed up
optimization
Gary K. Chen
aka data-parallel programming Biostat Noon
Seminar
Built in to common compilers (e.g. gcc) Introduction to
high-performance
Very easy to get started! computing
SSE or Streaming SIMD Extensions: each Concepts
core can do vector operations Example 1: Hidden
Markov Model
OpenMP: parallel processing across multiple Training
cores Example 2:
Regularized
e.g. simply insert ”pragma omp for” directive Logistic Regression
and compile with gcc! Closing remarks
CUDA/OpenCL
CUDA is a proprietary C-based language
endorsed by nVidia
OpenCL: standards based implementation
backed by the Khronos Group
11. Using multi-core
OpenCL and CUDA algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
CUDA
Introduction to
Powerful libraries available to enrich high-performance
computing
productivity
Concepts
Thrust: C++ generics, cuBLAS: Level 1 and Example 1: Hidden
2 parallel BLAS Markov Model
Training
Supported only on nVidia GPU devices Example 2:
OpenCL Regularized
Logistic Regression
Compatible with nVidia and ATI GPU Closing remarks
devices, as well as AMD/Intel CPUs
Lags behind CUDA in libraries and tools
Good to work with, given ATI hardware
currently leads in value
12. Using multi-core
A $60 HPC under your desk algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
13. Using multi-core
An outline algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to high-performance computing Introduction to
high-performance
computing
Concepts Concepts
Example 1: Hidden
Markov Model
Training
Example 1: Hidden Markov Model Training Example 2:
Regularized
Logistic Regression
Closing remarks
Example 2: Regularized Logistic Regression
Closing remarks
14. Using multi-core
Threads and threadblocks algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Threads: Introduction to
Perform a very limited function, but do all high-performance
computing
the heavy lifting Concepts
Are extremely lightweight, so you’ll want to Example 1: Hidden
Markov Model
launch thousands Training
Threadblocks: Example 2:
Regularized
Logistic Regression
Developer assigns threads that can cooperate
Closing remarks
on a common task into threadblocks
Threadblocks cannot communicate with one
another and run in any order
(asynchronously)
15. Using multi-core
Thread organization algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
16. Using multi-core
Memory hierarchy algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
17. Using multi-core
Kernels algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Warps/Wavefront: Example 1: Hidden
Markov Model
Describes an atomic set of threads (32 for Training
nVidia, 64 for ATI) Example 2:
Regularized
Instructions are executed in lock step across Logistic Regression
the set, each thread processing a distinct Closing remarks
data element
Developer responsible for synchronizing
across warps
Kernels:
Code that developer writes, which can
execute on a SIMD device
Essentially C functions
18. Using multi-core
An outline algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to high-performance computing Introduction to
high-performance
computing
Concepts Concepts
Example 1: Hidden
Markov Model
Training
Example 1: Hidden Markov Model Training Example 2:
Regularized
Logistic Regression
Closing remarks
Example 2: Regularized Logistic Regression
Closing remarks
19. Using multi-core
algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
Hidden Markov Models high-performance
computing
A staple in machine learning. Concepts
Many applications in statistical genetics, Example 1: Hidden
Markov Model
including imputation of untyped genotypes, Training
local ancestry, sequence alignment (e.g. Example 2:
Regularized
protein family scoring) Logistic Regression
Closing remarks
20. Using multi-core
Application to cancer tumor data algorithms to
speed up
optimization
Gary K. Chen
Extending PennCNV Biostat Noon
Seminar
Tissues are assumed to be a mixture of Introduction to
tumor/normal cells high-performance
computing
Tumors are assumed to be heterogeneous in Concepts
CN across cells, implying fractional copy Example 1: Hidden
Markov Model
number states Training
PennCNV defines 6 hidden integer states for Example 2:
Regularized
normal cells and does not infer allelic state Logistic Regression
We can make more precise estimates of both Closing remarks
copy numbers and allelic state in tumors with
little sacrifice in performance
Copy Num: z = (1-α)znormal + α ztumor
z is fractional, whereas ztumor =
I(z<=2)floor(z) + I(z>2)ceil(z)
22. Using multi-core
Training a Hidden Markov Model algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Objective: Infer the probabilities of Seminar
transitioning between any pair of states Introduction to
high-performance
computing
Apply forward-backward and Baum-Welch Concepts
algorithms Example 1: Hidden
A special case of the Markov Model
Training
Expectation-Maximization (or generally, Example 2:
Regularized
MM) family of algorithms Logistic Regression
Expectation step: forward-backward Closing remarks
computes posterior probs based on estimated
parameters
Maximization: Baum-Welch empirically
estimates parameters by averaging across
observations
23. Using multi-core
algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Forward algorithm Logistic Regression
Closing remarks
We compute the probability vector at
observation t: f0:t = f0:t−1 TOt
Each state (element of the m-state vector)
can independently compute a sum-product
Threadblocks map to states
Threads calculate products in parallel,
followed by a log2(m) addition reduction
24. Using multi-core
Gridblock of threadblocks algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
25. Using multi-core
Speedups algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
We implement 8 kernels. Examples: Introduction to
high-performance
Re-scaling transition matrix (for SNP computing
spacing) Concepts
2 Example 1: Hidden
Serial: O(2nm ); Parallel: O(n) Markov Model
Training
Forward backward Example 2:
Serial: O(2nm2 ); Parallel: O(nlog2 (m)) Regularized
Logistic Regression
Normalizing constant (Baum-Welch) Closing remarks
Serial: O(nm); Parallel: O(log2 (n))
MLE of transition matrix (Baum-Welch)
Serial: O(nm2 ); Parallel: O(n)
26. Using multi-core
Run time comparison algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Table: 1 iteration of HMM training on Chr 1 (41,263 Concepts
Example 1: Hidden
SNPs) Markov Model
Training
states CPU GPU fold-speedup Example 2:
Regularized
128 9.5m 37s 15x Logistic Regression
Closing remarks
512 2h 35m 1m 44s 108x
27. Using multi-core
An outline algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to high-performance computing Introduction to
high-performance
computing
Concepts Concepts
Example 1: Hidden
Markov Model
Training
Example 1: Hidden Markov Model Training Example 2:
Regularized
Logistic Regression
Closing remarks
Example 2: Regularized Logistic Regression
Closing remarks
28. Using multi-core
Regularized Regression algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Variable Selection Introduction to
high-performance
computing
For tractability, most GWAS analyses entail
Concepts
separate univariate tests of each variable
Example 1: Hidden
(e.g. SNP, GxG, GxE). Markov Model
Training
However, it is preferable to model all Example 2:
variables simultaneously to tease out Regularized
Logistic Regression
correlated variables Closing remarks
This is problematic when p < n. Parameters
are unestimable, matrix inversion becomes
computationally intractable
29. Using multi-core
Regularized Regression algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
The LASSO method (Tibshirani, 1996) Introduction to
high-performance
computing
Seeded a cottage industry of related methods
Concepts
e.g. Group LASSO, Elastic Net, MCP, NEG, Example 1: Hidden
Overlap LASSO, Graph LASSO Markov Model
Training
Fundamentally solves variable selection Example 2:
problem by introducing an L1 norm to invoke Regularized
Logistic Regression
sparsity Closing remarks
Limitations: Do not provide a mechanism
for hypothesis testing (e.g p-values)
30. Using multi-core
Regularized Regression algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
Bayesian methods high-performance
computing
Posterior inferences on β Concepts
e.g.: Bayesian LASSO, Bayesian Elastic Net, Example 1: Hidden
Markov Model
Highly computational. Scaling up to genome Training
wide scale is not obvious Example 2:
Regularized
MCMC is inherently serial, best option is to Logistic Regression
speed up the sampling chain Closing remarks
Proposal: Implement key bottle neck on the
GPU: fitting βLASSO to the data
31. Using multi-core
Optimization algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
For binomial logistic regression:
n Introduction to
L(β) = i=1 [yi logpi + (1 − yi )log(1 − pi )] high-performance
t computing
e µ+xi β
pi = µ+x t β Concepts
1+e i
n
i=1 [yi − pi (β)]xi
Example 1: Hidden
L(β) = Markov Model
n
−d 2 L(β) = i=1 pi (β)[1 − pi (β)]xi xi
t Training
Example 2:
For *penalized* regression: Regularized
Logistic Regression
p
f (β) = L(β) − λ j=1 |βj | Closing remarks
Find global maximum by applying Newton
Raphson one variable at a time.
n
[yi −pi (β m )]xi −λsgn(βjm )
P
βjm+1 = βjm − i=1
Pn m m t
i=1 pi (β )[1−pi (β )]xi xi
32. Using multi-core
Overview of algorithm algorithms to
speed up
optimization
Newton-Raphson kernel Gary K. Chen
Biostat Noon
Each threadblock maps to a block of 512 Seminar
subjects (theads) for 1 variable
Introduction to
Each thread calculates subject’s contribution high-performance
computing
to gradient and hessian Concepts
Sum (reduction) across 512 subjects Example 1: Hidden
Sum (reduction) across subject blocks in new Markov Model
Training
kernel Example 2:
Regularized
Compute log-likelihood change for each Logistic Regression
variable (like above). Closing remarks
Apply a max operator (log2 reduction) to
select variable with greatest contribution
to likelihood.
Iterate repeatedly until likelihood increase
less than epsilon
33. Using multi-core
Gridblock of threadblocks algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
34. Using multi-core
Consideration of datatypes algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
Need to compress genotypes
Why? Global memory is scarce, bandwidth is
expensive
A warp of 32 threads loads 32 words
(containing 512 genotypes) into local
memory
35. Using multi-core
Distributed GPU implementation algorithms to
speed up
optimization
Gary K. Chen
For really large dimensions, we can link up Biostat Noon
Seminar
an arbitrary number of GPUs Introduction to
high-performance
MPI allows us to spread work across a computing
Concepts
cluster Example 1: Hidden
Markov Model
Developed on Epigraph: 2 Tesla C2050s Training
Approach Example 2:
Regularized
Logistic Regression
MPI master node delegates heavy lifting to
Closing remarks
slaves across network
Master node performs fast serial code, such
as sampling from the full conditional
likelihood of any penalty parameter (e.g. λ)
Network traffic is minimized so slaves must
maintain up to date copies of data structures
36. Using multi-core
algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
37. Using multi-core
Evaluation on large dataset algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
GWAS data Introduction to
6,806 African American subjects in a case high-performance
computing
control study of prostate cancer Concepts
1,047,986 SNPs typed Example 1: Hidden
Markov Model
Elapsed walltime for 1 LASSO iteration Training
(sweep across all variables) Example 2:
Regularized
Logistic Regression
15 minutes on optimized serial
Closing remarks
implementation across 2 slave CPUs
5.8 seconds on parallel implementation across
2 nVidia Tesla C2050 GPU devices
155x speed up
38. Using multi-core
algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
39. Using multi-core
An outline algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to high-performance computing Introduction to
high-performance
computing
Concepts Concepts
Example 1: Hidden
Markov Model
Training
Example 1: Hidden Markov Model Training Example 2:
Regularized
Logistic Regression
Closing remarks
Example 2: Regularized Logistic Regression
Closing remarks
40. Using multi-core
Conclusion algorithms to
speed up
optimization
Gary K. Chen
Multicore programming is not a panacea Biostat Noon
Seminar
Insufficient parallelism leads to an inferior Introduction to
implementation high-performance
computing
Graph algorithms *generally* do not map Concepts
well to SIMD architectures Example 1: Hidden
Markov Model
Programming Effort Training
Expect to spend at least 90 % time Example 2:
Regularized
debugging a black box Logistic Regression
Closing remarks
Is it worth it? Human time > computer
time?
For generic problems (matrix multiplication,
sorting), absolutely
OpenCL is a bit more verbose than CUDA,
but is more portable
41. Using multi-core
Potential Future Work algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Reconstructing Bayesian Networks Introduction to
high-performance
Compute joint probability for each possible computing
Concepts
topology
Example 1: Hidden
Code graph as a sparse adjacency matrix Markov Model
Training
Approximate Bayesian Computation Example 2:
Regularized
Sample θ from some assumed prior Logistic Regression
distribution Closing remarks
Generate a dataset conditional on θ
Examine how close fake data is to the real
one
42. Using multi-core
Tomorrow’s clusters will require algorithms to
speed up
optimization
heterogeneous programming Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Concepts
Example 1: Hidden
Markov Model
Training
Example 2:
Regularized
Logistic Regression
Closing remarks
43. Using multi-core
Tianhe-1A algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
World’s faster supercomputer Introduction to
high-performance
4.7 petaflops (quadrillion floating point computing
operations/sec) Concepts
Example 1: Hidden
14,336 Xeon CPUs, 7,168 Tesla M2050s Markov Model
Training
According to nVidia Example 2:
Regularized
CPU only: 50k CPUs, twice the floor space Logistic Regression
CPU only: 12 megawatts compared to 4.04 Closing remarks
megawatts
$88 million dollars to build, $20 million for
annual energy costs
44. Using multi-core
Thanks to algorithms to
speed up
optimization
Gary K. Chen
Biostat Noon
Seminar
Introduction to
high-performance
computing
Kai: Ideas for CNV analysis Concepts
Duncan, Wei: Discussions on LASSO Example 1: Hidden
Markov Model
Training
Tim, Zack: Access to Epigraph Example 2:
Regularized
Alex, James: Lively HPC Logistic Regression
Closing remarks
discussions/debates