How to Troubleshoot Apps for the Modern Connected Worker
OpenCL applications in genomics
1. Using OpenCL to accelerate
genomic analysis
Gary K. Chen
June 16, 2011
2. An outline
OpenCL Introduction
Copy number inference in tumors
Data considerations
Hidden Relatedness
Variable Selection
3. Scientific Programming on GPGPU
devices
nVidia and ATI are currently market leaders
Very competitive in performance and price
Impressive double-precision performance - though
still about 4 times slower than 32 bit FP
ATI 9370 chipset: 528 64-GFLOPS 4GB GDDR5
$2,399
nVidia Tesla C2050: 520 64-GFLOPS 3GB GDDR5
$2,199
Source: www.sabrepc.com
4.
5. Future multi-core CPUs
Intel’s 48 core SCC chip
Potentially a more powerful solution when
considering data intenstive computing. Not
constrained by PCI bus
10. An outline
OpenCL Introduction
Copy number inference in tumors
Data considerations
Hidden Relatedness
Variable Selection
11. Biology background
DNA
A string with a four letter alphabet: A,C,G,T
Humans have two copies: one from mom, one from
dad
Most of the sequence between two strands is the
same, except for a small proportion
Example sequence: ATATTGC. We could have:
A single nucleotide polymorphism (common point
mutation): ATATAGC
Copy number variants/abberations
(deletions,amplifications,translocations):
AT–GC
ATATTATTATTGC
13. What is observed
Microarray output
Probes are dyed, and microarrays scanned with
CCD cameras
X,Y: Intensities of A and B alleles (two possible
variants)
R = X+Y: Overall intensity
LRR (log2 R ratio): Intensity relative to a standard
intensity
BAF (B allele frequency): Ratio of allelic intensity
between A and B
15. Hidden Markov Model
A formalized statistical model
We want to use information from observables
(LRR,BAF) to infer true state of nature (copy
number, genotype)
Table: Example hidden states from PennCNV software
State CN possible genotypes
1 0 Null
2 1 A,B
3 2 AA,AB,BB
4 2 AA,BB
5 3 AAA,AAB,ABB,BBB
6 4 AAAA,AAAB,AABB,ABBB,BBBB
16. Copy number inference in tumors
Inference is harder!
1. When dissecting breast tissue for example,
stromal (normal cell) contamination is almost
inevitable. Hence you are modeling a mixture of
two or more cell populations
Suppose you have a state assuming normal CN=2,
tumor CN=4,α = .2
e.g. ri = αri,n + (1 − α)ri,t
expected mean intensity: .2(1) + .8(1.68) = 1.544
2. Amplification events can be wilder than
germline (e.g. blood) events, leading to greater
copy number/genotype possibilities
Combine issues 1) and 2) and you can get a huge
search space
19. Algorithm
Initialize
Empirically estimate σ of BAF and LRR
Compute emission matrix O for each state/obs
from a Gaussian pdf
Train: Expectation Maximization
Forward backward: computes posterior probs and
overall likelhood
Baum Welch: Compute MLE of transition
probabilites in matrix T
Traverse state path
Viterbi (dynamic programming): walk the state
path based on max-product
20. Parallel Forward Algorithm
We compute the probability vector at observation
t: f0:t = f0:t−1 TOt
Each state (element of the m-state vector) can
independently compute a sum-product
Threadblocks map to states
Threads calculate products in parallel, followed by
a log2(m) addition reduction
21. Technical issue: Underflow
Tiny probabilities often have to be represented
in log space (even for FP64)
How do we deal with adding log probabilities?
We usually exponentiate, add, then log
Remedy
Add an offset to log before exponentiating
Subtract the offset from the log space answer
26. Performance
Table: One EM iteration on Chr 1 (41,263 SNPs)
states CPU GPU fold-speedup
128 9.5m 37s 15x
512 2h 35m 1m 44s 108x
27. An outline
OpenCL Introduction
Copy number inference in tumors
Data considerations
Hidden Relatedness
Variable Selection
28. Storing data
Global memory
Relatively abundant, but slow
However, even 4GB may be insufficient for modern
datasets
Genotype data
Highly compressible
We only care if a position differs from the
canonical sequence
Thus: AA,AB,BB,NULL are 4 possible genotypes
Should be able to encode this into two bits, so 4
genotypes per byte
29. Possible approaches
Store as a float array
+: Easy to implement
-: Uses 16 times as much memory as needed!
Store as an int array
Allocate a local memory array of 256 rows, 4 cols
for mapping all possible genotype 4-tuples
+: Uses global memory efficiently, maximizes
bandwidth
-: You might not even have enough local memory,
much less for real work
Store as a char array
Right bitshift pairs of bits, then OR mask with 3
+: Uses global memory efficiently, saves on local
memory
-: Threads load a minimum of 4 bytes per word,
you use 25% of available bandwidth
30. One solution: custom container
Idea:
Designate each threadblock to handle 512
genotypes
First 32 threads: each loads a packedgeno t
element
For each of the 32 threads:
Loop four times, extracting each char
Subloop four times, extacting each genotype via
bitshift/mask
32. An outline
OpenCL Introduction
Copy number inference in tumors
Data considerations
Hidden Relatedness
Variable Selection
33. Inferring Relatedness
Inferring relatedness
The human race is one large pedigree
Individuals of the same ethnicity are expected
to share more SNP alleles
We can summarize this relationship through a
correlation matrix called ’K’
34. Uses for the ’K’ matrix
Principal Components Analysis
A singular value decomposition on ’K’
K = VDV
V contains orthogonal axes, facilitating population
structure inference
Estimating heritability
In random effects models
Y = µ + βX + γ 2 K + σ 2 I
2
h2 = γ 2γ 2
+σ
36. Computing K
Essentially a matrix multiplication
(x −2fi )(xik −2f )
Kjk = m m ij 4fi (1−fi ) i
ˆ 1
i=1
Or in another words: K=ZZ’
Including more SNPs adds more precise, subtle
information
Parallel code
Carrying out matrix multiplication is
straightforward on GPU
Matrix multiplication is ideal for GPU: Approx.
240x speedup.
Because K is summed over SNPs, we can split
genotype matrix by subsets of SNPs and run each
K slice in parallel
37. An outline
OpenCL Introduction
Copy number inference in tumors
Data considerations
Hidden Relatedness
Variable Selection
38. Variable Selection
One goal in biomedical research is correlating
DNA variation to disease phenotypes
Genomics technology
The number of subjects n remains about the same
(cost of recruiting, sample preps, etc), while
number of features p is exploding
Rate that data is being generated per dollar
surpasses Moore’s Law
39.
40. Regression
Standard logistic regression
The usual method for hypothesis testing of
candidate predictors
p
log( 1−p ) = βX , p being the probability of affection
We apply Newton-Raphson scoring until f (β) is
maximized.
Logistic regression simple fails whens p > n
L1 penalized regression, aka LASSO
Idea: Fit the logistic regression model, but subject
to a penalty parameter λ
g (β) = f (β) − λ p |βj |
j=1
41. Algorithms for fitting the LASSO
One dimensional Newton Raphson at variable
j:
Cyclic Coordinate Descent
(new ) g β
∆βj = βj − βj = − g βjj
n
1
g (βj ) = xi,j yi − sgn(βj )λ
i=1
1 + exp(xi,j βj yi )
n
2 exp(xi,j βj yi )
g (βj ) = xi,j
i=1
(1 + exp(xi,j βj yi ))2
We cycle through each j until likelihood stops
increasing within some tolerance
Performs great, but only allows parallelization
across samples
ref: Genkin,Lewis,Madigan: Am Stat Assoc 2007 Vol 49,No. 3
42. Distributed GPU implementation
If possible to parallelize across variables, it is
worth splitting up design matrix
For really large dimensions, we can link up an
arbitrary number of GPUs
Message Passing Interface allows us to be
agnostic to physical location of GPU devices
43. Distributed GPU implementation
Approach:
MPI master node delegates heavy lifting to slaves
across network
Master node performs fast serial code, such as
sampling new λ, comparing logLs, broadcasting
gradients, etc.
Network traffic is kept to a minimum
Implemented for Greedy Coordinate Descent and
Gradient Descent
Developed on server at USC Epigenome Center: 2
Tesla C2050s
44.
45. Parallel algorithms for fitting the LASSO
Greedy coordinate descent (ref)
Same algorithm as CCD, except for each variable
sweep, update only j that gives greatest increase in
logL
No dependencies between subjects and variables,
massive parallelization across subjects AND
variables
Ideal if you have a huge dataset, and you want a
stringest type 1 error rate (only care about a few
variables)
Ayers and Cordell, Gen Epi 2010: Permute, and
pick largest λ that allows first “false” variable to
enter
ref: Wu, Lange: Annals Appl Stat 2008 Vol 2,No. 1
47. Overview of Greedy CD algorithm
Newton-Raphson kernel
Each threadblock maps to a block of 512 subjects
(theads) for 1 variable
Each thread calculates subject’s contribution to
gradient and hessian
Sum (reduction) across 512 subjects
Sum (reduction) across subject blocks in new
kernel
Compute log-likelihood change for each
variable (like above).
Apply a max operator (log2 reduction) to
select variable with greatest contribution to
likelihood.
Iterate repeatedly until likelihood increase less
than epsilon
48. Evaluation on large dataset
GWAS data
6,806 subjects in a case control study of prostate
cancer
1,047,986 SNPs typed
Invoke approx. 7 billion threads per iteration
Total walltime for 1 GCD iteration (sweep
across all variables)
15 minutes on optimized serial implementation
split across 2 slave CPUs
5.8 seconds on parallel implementation across 2
nVidia Tesla C2050 GPU devices
155x speed up
49. Parallel algorithms for fitting the LASSO
(Stochastic Mirror) Gradient Descent (ref)
Sometimes, we are interested in tuning λ for say
the best cross validation errors
Greedy descent seems awfully wasteful in that only
one βj is updated
However, we can update all variables in parallel
cycling through subjects
Algorithm
Extremely simple:
−yi
For subject i: gradient gi = (1+exp(xi βyi ))
Update his βi vector, where βi,j = βi,j − ηgi xi,j
η is a learning parameter, set sufficiently small
(e.g. .0001)
ref: Shwartz,Tewari: Proc. 26th Intern. Conf Machine
Learning 2009
50. Gradient descent
Performance
Slow convergence compared to serial cyclic
coordinate descent, but far more scalable
For large lambdas, slower than greedy coordinate
descent
Computation:bandwidth ratio not great
For 1 million SNPs, only about 15x speedup. Far
more SNPs are needed
Technical issues
Must store genotypes in subject major order to
enabled coalesced memory loads/stores
Makes SNP level summaries like means and SDs
difficult to compute.
Heterogeneous data types: floats: (E,ExG),
compresesed chars: (G,GxG)
Memory constrained: can perform interactions on
the fly with SNP major
51. Potential for robust variable selection:
Subsampling:
Applying LASSO once overfits data. Model
selection inconsistent
Subsampling is preferable: Bootstrapping, stability
selection, x-fold cross validation
Number of replicates << number of samples <<
number of features
Bayesian variable selection:
If we assume βLASSO conditionally independent
Master node can (quickly) sample hyperparameters
(e.g. λ) from a prior distribution