SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Using OpenCL to accelerate
     genomic analysis

        Gary K. Chen


        June 16, 2011
An outline

  OpenCL Introduction

  Copy number inference in tumors

  Data considerations

  Hidden Relatedness

  Variable Selection
Scientific Programming on GPGPU
devices

     nVidia and ATI are currently market leaders
         Very competitive in performance and price
         Impressive double-precision performance - though
         still about 4 times slower than 32 bit FP
         ATI 9370 chipset: 528 64-GFLOPS 4GB GDDR5
         $2,399
         nVidia Tesla C2050: 520 64-GFLOPS 3GB GDDR5
         $2,199
         Source: www.sabrepc.com
Future multi-core CPUs




     Intel’s 48 core SCC chip
     Potentially a more powerful solution when
     considering data intenstive computing. Not
     constrained by PCI bus
An open-standards based development
platform
Same idea as CUDA, different terms
Data parallel coding
OpenCL Concepts
An outline

  OpenCL Introduction

  Copy number inference in tumors

  Data considerations

  Hidden Relatedness

  Variable Selection
Biology background
     DNA
         A string with a four letter alphabet: A,C,G,T
         Humans have two copies: one from mom, one from
         dad
         Most of the sequence between two strands is the
         same, except for a small proportion
     Example sequence: ATATTGC. We could have:

         A single nucleotide polymorphism (common point
         mutation): ATATAGC
         Copy number variants/abberations
         (deletions,amplifications,translocations):
         AT–GC
         ATATTATTATTGC
SNP microarrays
What is observed

     Microarray output
         Probes are dyed, and microarrays scanned with
         CCD cameras
         X,Y: Intensities of A and B alleles (two possible
         variants)
         R = X+Y: Overall intensity
         LRR (log2 R ratio): Intensity relative to a standard
         intensity
         BAF (B allele frequency): Ratio of allelic intensity
         between A and B
Inferring CNVs from microarray output
Hidden Markov Model
     A formalized statistical model
           We want to use information from observables
           (LRR,BAF) to infer true state of nature (copy
           number, genotype)


    Table: Example hidden states from PennCNV software
   State   CN    possible genotypes
   1       0     Null
   2       1     A,B
   3       2     AA,AB,BB
   4       2     AA,BB
   5       3     AAA,AAB,ABB,BBB
   6       4     AAAA,AAAB,AABB,ABBB,BBBB
Copy number inference in tumors
     Inference is harder!
          1. When dissecting breast tissue for example,
          stromal (normal cell) contamination is almost
          inevitable. Hence you are modeling a mixture of
          two or more cell populations
               Suppose you have a state assuming normal CN=2,
               tumor CN=4,α = .2
               e.g. ri = αri,n + (1 − α)ri,t
               expected mean intensity: .2(1) + .8(1.68) = 1.544
          2. Amplification events can be wilder than
          germline (e.g. blood) events, leading to greater
          copy number/genotype possibilities
          Combine issues 1) and 2) and you can get a huge
          search space
Expanded state space
          ID   BACn   CNt   BACt    α      ¯
                                           r         ¯
                                                     b
           0    0      2     0     0.3     1         0
           1    0      2     0     0.6     1         0
           2    1      2     1     0.3     1        0.5
           3    1      2     1     0.6     1        0.5
           4    2      2     2     0.3     1         1
           5    2      2     2     0.6     1         1
           6    0      1     0     0.3   0.65        0
           7    0      1     0     0.6    0.8        0
           8    0      1     1     0.3   0.65   0.538462
           9    0      1     1     0.6    0.8      0.25
          10    1      1     0     0.3   0.65   0.230769
          11    1      1     0     0.6    0.8     0.375
          12    1      1     1     0.3   0.65   0.769231
          13    1      1     1     0.6    0.8     0.625
          14    2      1     0     0.3   0.65   0.461538
          15    2      1     0     0.6    0.8      0.75
          16    2      1     1     0.3   0.65        1
          17    2      1     1     0.6    0.8        1
          18    0      3     0     0.3   1.35        0
          19    0      3     0     0.6    1.2        0
          20    1      3     1     0.3   1.35    0.37037
          21    1      3     1     0.6    1.2   0.416667
          22    1      3     2     0.3   1.35    0.62963
          23    1      3     2     0.6    1.2   0.583333
          24    2      3     3     0.3   1.35        1
          25    2      3     3     0.6    1.2        1
Algorithm
     Initialize
          Empirically estimate σ of BAF and LRR
          Compute emission matrix O for each state/obs
          from a Gaussian pdf
     Train: Expectation Maximization
          Forward backward: computes posterior probs and
          overall likelhood
          Baum Welch: Compute MLE of transition
          probabilites in matrix T
     Traverse state path
          Viterbi (dynamic programming): walk the state
          path based on max-product
Parallel Forward Algorithm
    We compute the probability vector at observation
    t: f0:t = f0:t−1 TOt
    Each state (element of the m-state vector) can
    independently compute a sum-product
    Threadblocks map to states
    Threads calculate products in parallel, followed by
    a log2(m) addition reduction
Technical issue: Underflow


     Tiny probabilities often have to be represented
     in log space (even for FP64)
     How do we deal with adding log probabilities?
         We usually exponentiate, add, then log
     Remedy
         Add an offset to log before exponentiating
         Subtract the offset from the log space answer
Gridblocks: Forward Backward Calculation
Code: Computing products in parallel
Code: 2 Reductions: computing offset,
sum-product
Algorithm Improvements
     Examples:
     Re-scaling transition matrix (accounting for
     SNP spacing)
          Serial: O(2nm2 ); Parallel: O(n)
     Forward backward
          Serial: O(2nm2 ); Parallel: O(nlog2 (m))
     Viterbi
          Serial: O(nm2 ); Parallel: O(nlog2 (m))
     Normalizing constant (Baum-Welch)
          Serial: O(nm); Parallel: O(log2 (n))
     MLE of transition matrix (Baum-Welch)
          Serial: O(nm2 ); Parallel: O(n)
Performance



     Table: One EM iteration on Chr 1 (41,263 SNPs)
      states CPU   GPU fold-speedup
       128   9.5m  37s      15x
       512 2h 35m 1m 44s   108x
An outline

  OpenCL Introduction

  Copy number inference in tumors

  Data considerations

  Hidden Relatedness

  Variable Selection
Storing data

     Global memory
         Relatively abundant, but slow
         However, even 4GB may be insufficient for modern
         datasets
     Genotype data
         Highly compressible
         We only care if a position differs from the
         canonical sequence
         Thus: AA,AB,BB,NULL are 4 possible genotypes
         Should be able to encode this into two bits, so 4
         genotypes per byte
Possible approaches
     Store as a float array
         +: Easy to implement
         -: Uses 16 times as much memory as needed!
     Store as an int array
         Allocate a local memory array of 256 rows, 4 cols
         for mapping all possible genotype 4-tuples
         +: Uses global memory efficiently, maximizes
         bandwidth
         -: You might not even have enough local memory,
         much less for real work
     Store as a char array
         Right bitshift pairs of bits, then OR mask with 3
         +: Uses global memory efficiently, saves on local
         memory
         -: Threads load a minimum of 4 bytes per word,
         you use 25% of available bandwidth
One solution: custom container




     Idea:
             Designate each threadblock to handle 512
             genotypes
             First 32 threads: each loads a packedgeno t
             element
     For each of the 32 threads:
             Loop four times, extracting each char
             Subloop four times, extacting each genotype via
             bitshift/mask
Illustration
An outline

  OpenCL Introduction

  Copy number inference in tumors

  Data considerations

  Hidden Relatedness

  Variable Selection
Inferring Relatedness


     Inferring relatedness
     The human race is one large pedigree
     Individuals of the same ethnicity are expected
     to share more SNP alleles
     We can summarize this relationship through a
     correlation matrix called ’K’
Uses for the ’K’ matrix

     Principal Components Analysis
          A singular value decomposition on ’K’
          K = VDV
          V contains orthogonal axes, facilitating population
          structure inference
     Estimating heritability
          In random effects models
          Y = µ + βX + γ 2 K + σ 2 I
                   2
          h2 = γ 2γ 2
                  +σ
Example: Latino samples in LA
Computing K
    Essentially a matrix multiplication
                       (x −2fi )(xik −2f )
        Kjk = m m ij 4fi (1−fi ) i
         ˆ     1
                   i=1
        Or in another words: K=ZZ’
        Including more SNPs adds more precise, subtle
        information
    Parallel code
        Carrying out matrix multiplication is
        straightforward on GPU
        Matrix multiplication is ideal for GPU: Approx.
        240x speedup.
        Because K is summed over SNPs, we can split
        genotype matrix by subsets of SNPs and run each
        K slice in parallel
An outline

  OpenCL Introduction

  Copy number inference in tumors

  Data considerations

  Hidden Relatedness

  Variable Selection
Variable Selection


     One goal in biomedical research is correlating
     DNA variation to disease phenotypes
     Genomics technology
         The number of subjects n remains about the same
         (cost of recruiting, sample preps, etc), while
         number of features p is exploding
         Rate that data is being generated per dollar
         surpasses Moore’s Law
Regression

     Standard logistic regression
         The usual method for hypothesis testing of
         candidate predictors
               p
         log( 1−p ) = βX , p being the probability of affection
         We apply Newton-Raphson scoring until f (β) is
         maximized.
         Logistic regression simple fails whens p > n
     L1 penalized regression, aka LASSO
         Idea: Fit the logistic regression model, but subject
         to a penalty parameter λ
         g (β) = f (β) − λ p |βj |
                               j=1
Algorithms for fitting the LASSO
       One dimensional Newton Raphson at variable
       j:
            Cyclic Coordinate Descent
                    (new )          g β
            ∆βj = βj       − βj = − g βjj
                         n
                                               1
            g (βj ) =         xi,j yi                        − sgn(βj )λ
                        i=1
                                        1 + exp(xi,j βj yi )
                          n
                               2        exp(xi,j βj yi )
            g (βj ) =         xi,j
                        i=1
                                     (1 + exp(xi,j βj yi ))2
       We cycle through each j until likelihood stops
       increasing within some tolerance
       Performs great, but only allows parallelization
       across samples
  ref: Genkin,Lewis,Madigan: Am Stat Assoc 2007 Vol 49,No. 3
Distributed GPU implementation


     If possible to parallelize across variables, it is
     worth splitting up design matrix
     For really large dimensions, we can link up an
     arbitrary number of GPUs
     Message Passing Interface allows us to be
     agnostic to physical location of GPU devices
Distributed GPU implementation

     Approach:
         MPI master node delegates heavy lifting to slaves
         across network
         Master node performs fast serial code, such as
         sampling new λ, comparing logLs, broadcasting
         gradients, etc.
         Network traffic is kept to a minimum
         Implemented for Greedy Coordinate Descent and
         Gradient Descent
         Developed on server at USC Epigenome Center: 2
         Tesla C2050s
Parallel algorithms for fitting the LASSO
       Greedy coordinate descent (ref)
            Same algorithm as CCD, except for each variable
            sweep, update only j that gives greatest increase in
            logL
            No dependencies between subjects and variables,
            massive parallelization across subjects AND
            variables
            Ideal if you have a huge dataset, and you want a
            stringest type 1 error rate (only care about a few
            variables)
            Ayers and Cordell, Gen Epi 2010: Permute, and
            pick largest λ that allows first “false” variable to
            enter
  ref: Wu, Lange: Annals Appl Stat 2008 Vol 2,No. 1
Layout for greedy coordinate descent
implementation
Overview of Greedy CD algorithm
     Newton-Raphson kernel
          Each threadblock maps to a block of 512 subjects
          (theads) for 1 variable
          Each thread calculates subject’s contribution to
          gradient and hessian
          Sum (reduction) across 512 subjects
          Sum (reduction) across subject blocks in new
          kernel
     Compute log-likelihood change for each
     variable (like above).
     Apply a max operator (log2 reduction) to
     select variable with greatest contribution to
     likelihood.
     Iterate repeatedly until likelihood increase less
     than epsilon
Evaluation on large dataset
     GWAS data
         6,806 subjects in a case control study of prostate
         cancer
         1,047,986 SNPs typed
     Invoke approx. 7 billion threads per iteration
     Total walltime for 1 GCD iteration (sweep
     across all variables)
         15 minutes on optimized serial implementation
         split across 2 slave CPUs
         5.8 seconds on parallel implementation across 2
         nVidia Tesla C2050 GPU devices
         155x speed up
Parallel algorithms for fitting the LASSO
       (Stochastic Mirror) Gradient Descent (ref)
             Sometimes, we are interested in tuning λ for say
             the best cross validation errors
             Greedy descent seems awfully wasteful in that only
             one βj is updated
             However, we can update all variables in parallel
             cycling through subjects
       Algorithm
             Extremely simple:
                                               −yi
             For subject i: gradient gi = (1+exp(xi βyi ))
             Update his βi vector, where βi,j = βi,j − ηgi xi,j
             η is a learning parameter, set sufficiently small
             (e.g. .0001)
  ref: Shwartz,Tewari: Proc. 26th Intern. Conf Machine
  Learning 2009
Gradient descent
     Performance
         Slow convergence compared to serial cyclic
         coordinate descent, but far more scalable
         For large lambdas, slower than greedy coordinate
         descent
         Computation:bandwidth ratio not great
         For 1 million SNPs, only about 15x speedup. Far
         more SNPs are needed
     Technical issues
         Must store genotypes in subject major order to
         enabled coalesced memory loads/stores
         Makes SNP level summaries like means and SDs
         difficult to compute.
         Heterogeneous data types: floats: (E,ExG),
         compresesed chars: (G,GxG)
         Memory constrained: can perform interactions on
         the fly with SNP major
Potential for robust variable selection:

      Subsampling:
          Applying LASSO once overfits data. Model
          selection inconsistent
          Subsampling is preferable: Bootstrapping, stability
          selection, x-fold cross validation
          Number of replicates << number of samples <<
          number of features
      Bayesian variable selection:
          If we assume βLASSO conditionally independent
          Master node can (quickly) sample hyperparameters
          (e.g. λ) from a prior distribution

Weitere ähnliche Inhalte

Ähnlich wie OpenCL applications in genomics

Project Presentation
Project PresentationProject Presentation
Project Presentation
butest
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
zukun
 
Pham,Nhat_ResearchPoster
Pham,Nhat_ResearchPosterPham,Nhat_ResearchPoster
Pham,Nhat_ResearchPoster
Nhat Pham
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
Aritra Sarkar
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
nlt2390
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
NVIDIA Taiwan
 

Ähnlich wie OpenCL applications in genomics (20)

A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) Learning
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
Presentation overview of neural &amp; kernel based clustering
Presentation overview of neural &amp; kernel based clustering Presentation overview of neural &amp; kernel based clustering
Presentation overview of neural &amp; kernel based clustering
 
Presentation june 2020
Presentation june 2020Presentation june 2020
Presentation june 2020
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Protein Secondary Structure Prediction using Deep Learning methods
Protein Secondary Structure Prediction using Deep Learning methodsProtein Secondary Structure Prediction using Deep Learning methods
Protein Secondary Structure Prediction using Deep Learning methods
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
6조
6조6조
6조
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...
 
Pham,Nhat_ResearchPoster
Pham,Nhat_ResearchPosterPham,Nhat_ResearchPoster
Pham,Nhat_ResearchPoster
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
Microarray Analysis
Microarray AnalysisMicroarray Analysis
Microarray Analysis
 
Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-net
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 

Mehr von USC

Pathway talk for IGES 2009 Hawaii
Pathway talk for IGES 2009 HawaiiPathway talk for IGES 2009 Hawaii
Pathway talk for IGES 2009 Hawaii
USC
 
Kinship adjusted armitage trend test for ENDGAME meeting 2008
Kinship adjusted armitage trend test for ENDGAME meeting 2008Kinship adjusted armitage trend test for ENDGAME meeting 2008
Kinship adjusted armitage trend test for ENDGAME meeting 2008
USC
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
USC
 
Multi-core programming talk for weekly biostat seminar
Multi-core programming talk for weekly biostat seminarMulti-core programming talk for weekly biostat seminar
Multi-core programming talk for weekly biostat seminar
USC
 
Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011
USC
 

Mehr von USC (6)

Haplotyping and genotype imputation using Graphics Processing Units
Haplotyping and genotype imputation using Graphics Processing UnitsHaplotyping and genotype imputation using Graphics Processing Units
Haplotyping and genotype imputation using Graphics Processing Units
 
Pathway talk for IGES 2009 Hawaii
Pathway talk for IGES 2009 HawaiiPathway talk for IGES 2009 Hawaii
Pathway talk for IGES 2009 Hawaii
 
Kinship adjusted armitage trend test for ENDGAME meeting 2008
Kinship adjusted armitage trend test for ENDGAME meeting 2008Kinship adjusted armitage trend test for ENDGAME meeting 2008
Kinship adjusted armitage trend test for ENDGAME meeting 2008
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
 
Multi-core programming talk for weekly biostat seminar
Multi-core programming talk for weekly biostat seminarMulti-core programming talk for weekly biostat seminar
Multi-core programming talk for weekly biostat seminar
 
Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

OpenCL applications in genomics

  • 1. Using OpenCL to accelerate genomic analysis Gary K. Chen June 16, 2011
  • 2. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
  • 3. Scientific Programming on GPGPU devices nVidia and ATI are currently market leaders Very competitive in performance and price Impressive double-precision performance - though still about 4 times slower than 32 bit FP ATI 9370 chipset: 528 64-GFLOPS 4GB GDDR5 $2,399 nVidia Tesla C2050: 520 64-GFLOPS 3GB GDDR5 $2,199 Source: www.sabrepc.com
  • 4.
  • 5. Future multi-core CPUs Intel’s 48 core SCC chip Potentially a more powerful solution when considering data intenstive computing. Not constrained by PCI bus
  • 6. An open-standards based development platform
  • 7. Same idea as CUDA, different terms
  • 10. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
  • 11. Biology background DNA A string with a four letter alphabet: A,C,G,T Humans have two copies: one from mom, one from dad Most of the sequence between two strands is the same, except for a small proportion Example sequence: ATATTGC. We could have: A single nucleotide polymorphism (common point mutation): ATATAGC Copy number variants/abberations (deletions,amplifications,translocations): AT–GC ATATTATTATTGC
  • 13. What is observed Microarray output Probes are dyed, and microarrays scanned with CCD cameras X,Y: Intensities of A and B alleles (two possible variants) R = X+Y: Overall intensity LRR (log2 R ratio): Intensity relative to a standard intensity BAF (B allele frequency): Ratio of allelic intensity between A and B
  • 14. Inferring CNVs from microarray output
  • 15. Hidden Markov Model A formalized statistical model We want to use information from observables (LRR,BAF) to infer true state of nature (copy number, genotype) Table: Example hidden states from PennCNV software State CN possible genotypes 1 0 Null 2 1 A,B 3 2 AA,AB,BB 4 2 AA,BB 5 3 AAA,AAB,ABB,BBB 6 4 AAAA,AAAB,AABB,ABBB,BBBB
  • 16. Copy number inference in tumors Inference is harder! 1. When dissecting breast tissue for example, stromal (normal cell) contamination is almost inevitable. Hence you are modeling a mixture of two or more cell populations Suppose you have a state assuming normal CN=2, tumor CN=4,α = .2 e.g. ri = αri,n + (1 − α)ri,t expected mean intensity: .2(1) + .8(1.68) = 1.544 2. Amplification events can be wilder than germline (e.g. blood) events, leading to greater copy number/genotype possibilities Combine issues 1) and 2) and you can get a huge search space
  • 17. Expanded state space ID BACn CNt BACt α ¯ r ¯ b 0 0 2 0 0.3 1 0 1 0 2 0 0.6 1 0 2 1 2 1 0.3 1 0.5 3 1 2 1 0.6 1 0.5 4 2 2 2 0.3 1 1 5 2 2 2 0.6 1 1 6 0 1 0 0.3 0.65 0 7 0 1 0 0.6 0.8 0 8 0 1 1 0.3 0.65 0.538462 9 0 1 1 0.6 0.8 0.25 10 1 1 0 0.3 0.65 0.230769 11 1 1 0 0.6 0.8 0.375 12 1 1 1 0.3 0.65 0.769231 13 1 1 1 0.6 0.8 0.625 14 2 1 0 0.3 0.65 0.461538 15 2 1 0 0.6 0.8 0.75 16 2 1 1 0.3 0.65 1 17 2 1 1 0.6 0.8 1 18 0 3 0 0.3 1.35 0 19 0 3 0 0.6 1.2 0 20 1 3 1 0.3 1.35 0.37037 21 1 3 1 0.6 1.2 0.416667 22 1 3 2 0.3 1.35 0.62963 23 1 3 2 0.6 1.2 0.583333 24 2 3 3 0.3 1.35 1 25 2 3 3 0.6 1.2 1
  • 18.
  • 19. Algorithm Initialize Empirically estimate σ of BAF and LRR Compute emission matrix O for each state/obs from a Gaussian pdf Train: Expectation Maximization Forward backward: computes posterior probs and overall likelhood Baum Welch: Compute MLE of transition probabilites in matrix T Traverse state path Viterbi (dynamic programming): walk the state path based on max-product
  • 20. Parallel Forward Algorithm We compute the probability vector at observation t: f0:t = f0:t−1 TOt Each state (element of the m-state vector) can independently compute a sum-product Threadblocks map to states Threads calculate products in parallel, followed by a log2(m) addition reduction
  • 21. Technical issue: Underflow Tiny probabilities often have to be represented in log space (even for FP64) How do we deal with adding log probabilities? We usually exponentiate, add, then log Remedy Add an offset to log before exponentiating Subtract the offset from the log space answer
  • 24. Code: 2 Reductions: computing offset, sum-product
  • 25. Algorithm Improvements Examples: Re-scaling transition matrix (accounting for SNP spacing) Serial: O(2nm2 ); Parallel: O(n) Forward backward Serial: O(2nm2 ); Parallel: O(nlog2 (m)) Viterbi Serial: O(nm2 ); Parallel: O(nlog2 (m)) Normalizing constant (Baum-Welch) Serial: O(nm); Parallel: O(log2 (n)) MLE of transition matrix (Baum-Welch) Serial: O(nm2 ); Parallel: O(n)
  • 26. Performance Table: One EM iteration on Chr 1 (41,263 SNPs) states CPU GPU fold-speedup 128 9.5m 37s 15x 512 2h 35m 1m 44s 108x
  • 27. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
  • 28. Storing data Global memory Relatively abundant, but slow However, even 4GB may be insufficient for modern datasets Genotype data Highly compressible We only care if a position differs from the canonical sequence Thus: AA,AB,BB,NULL are 4 possible genotypes Should be able to encode this into two bits, so 4 genotypes per byte
  • 29. Possible approaches Store as a float array +: Easy to implement -: Uses 16 times as much memory as needed! Store as an int array Allocate a local memory array of 256 rows, 4 cols for mapping all possible genotype 4-tuples +: Uses global memory efficiently, maximizes bandwidth -: You might not even have enough local memory, much less for real work Store as a char array Right bitshift pairs of bits, then OR mask with 3 +: Uses global memory efficiently, saves on local memory -: Threads load a minimum of 4 bytes per word, you use 25% of available bandwidth
  • 30. One solution: custom container Idea: Designate each threadblock to handle 512 genotypes First 32 threads: each loads a packedgeno t element For each of the 32 threads: Loop four times, extracting each char Subloop four times, extacting each genotype via bitshift/mask
  • 32. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
  • 33. Inferring Relatedness Inferring relatedness The human race is one large pedigree Individuals of the same ethnicity are expected to share more SNP alleles We can summarize this relationship through a correlation matrix called ’K’
  • 34. Uses for the ’K’ matrix Principal Components Analysis A singular value decomposition on ’K’ K = VDV V contains orthogonal axes, facilitating population structure inference Estimating heritability In random effects models Y = µ + βX + γ 2 K + σ 2 I 2 h2 = γ 2γ 2 +σ
  • 36. Computing K Essentially a matrix multiplication (x −2fi )(xik −2f ) Kjk = m m ij 4fi (1−fi ) i ˆ 1 i=1 Or in another words: K=ZZ’ Including more SNPs adds more precise, subtle information Parallel code Carrying out matrix multiplication is straightforward on GPU Matrix multiplication is ideal for GPU: Approx. 240x speedup. Because K is summed over SNPs, we can split genotype matrix by subsets of SNPs and run each K slice in parallel
  • 37. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
  • 38. Variable Selection One goal in biomedical research is correlating DNA variation to disease phenotypes Genomics technology The number of subjects n remains about the same (cost of recruiting, sample preps, etc), while number of features p is exploding Rate that data is being generated per dollar surpasses Moore’s Law
  • 39.
  • 40. Regression Standard logistic regression The usual method for hypothesis testing of candidate predictors p log( 1−p ) = βX , p being the probability of affection We apply Newton-Raphson scoring until f (β) is maximized. Logistic regression simple fails whens p > n L1 penalized regression, aka LASSO Idea: Fit the logistic regression model, but subject to a penalty parameter λ g (β) = f (β) − λ p |βj | j=1
  • 41. Algorithms for fitting the LASSO One dimensional Newton Raphson at variable j: Cyclic Coordinate Descent (new ) g β ∆βj = βj − βj = − g βjj n 1 g (βj ) = xi,j yi − sgn(βj )λ i=1 1 + exp(xi,j βj yi ) n 2 exp(xi,j βj yi ) g (βj ) = xi,j i=1 (1 + exp(xi,j βj yi ))2 We cycle through each j until likelihood stops increasing within some tolerance Performs great, but only allows parallelization across samples ref: Genkin,Lewis,Madigan: Am Stat Assoc 2007 Vol 49,No. 3
  • 42. Distributed GPU implementation If possible to parallelize across variables, it is worth splitting up design matrix For really large dimensions, we can link up an arbitrary number of GPUs Message Passing Interface allows us to be agnostic to physical location of GPU devices
  • 43. Distributed GPU implementation Approach: MPI master node delegates heavy lifting to slaves across network Master node performs fast serial code, such as sampling new λ, comparing logLs, broadcasting gradients, etc. Network traffic is kept to a minimum Implemented for Greedy Coordinate Descent and Gradient Descent Developed on server at USC Epigenome Center: 2 Tesla C2050s
  • 44.
  • 45. Parallel algorithms for fitting the LASSO Greedy coordinate descent (ref) Same algorithm as CCD, except for each variable sweep, update only j that gives greatest increase in logL No dependencies between subjects and variables, massive parallelization across subjects AND variables Ideal if you have a huge dataset, and you want a stringest type 1 error rate (only care about a few variables) Ayers and Cordell, Gen Epi 2010: Permute, and pick largest λ that allows first “false” variable to enter ref: Wu, Lange: Annals Appl Stat 2008 Vol 2,No. 1
  • 46. Layout for greedy coordinate descent implementation
  • 47. Overview of Greedy CD algorithm Newton-Raphson kernel Each threadblock maps to a block of 512 subjects (theads) for 1 variable Each thread calculates subject’s contribution to gradient and hessian Sum (reduction) across 512 subjects Sum (reduction) across subject blocks in new kernel Compute log-likelihood change for each variable (like above). Apply a max operator (log2 reduction) to select variable with greatest contribution to likelihood. Iterate repeatedly until likelihood increase less than epsilon
  • 48. Evaluation on large dataset GWAS data 6,806 subjects in a case control study of prostate cancer 1,047,986 SNPs typed Invoke approx. 7 billion threads per iteration Total walltime for 1 GCD iteration (sweep across all variables) 15 minutes on optimized serial implementation split across 2 slave CPUs 5.8 seconds on parallel implementation across 2 nVidia Tesla C2050 GPU devices 155x speed up
  • 49. Parallel algorithms for fitting the LASSO (Stochastic Mirror) Gradient Descent (ref) Sometimes, we are interested in tuning λ for say the best cross validation errors Greedy descent seems awfully wasteful in that only one βj is updated However, we can update all variables in parallel cycling through subjects Algorithm Extremely simple: −yi For subject i: gradient gi = (1+exp(xi βyi )) Update his βi vector, where βi,j = βi,j − ηgi xi,j η is a learning parameter, set sufficiently small (e.g. .0001) ref: Shwartz,Tewari: Proc. 26th Intern. Conf Machine Learning 2009
  • 50. Gradient descent Performance Slow convergence compared to serial cyclic coordinate descent, but far more scalable For large lambdas, slower than greedy coordinate descent Computation:bandwidth ratio not great For 1 million SNPs, only about 15x speedup. Far more SNPs are needed Technical issues Must store genotypes in subject major order to enabled coalesced memory loads/stores Makes SNP level summaries like means and SDs difficult to compute. Heterogeneous data types: floats: (E,ExG), compresesed chars: (G,GxG) Memory constrained: can perform interactions on the fly with SNP major
  • 51. Potential for robust variable selection: Subsampling: Applying LASSO once overfits data. Model selection inconsistent Subsampling is preferable: Bootstrapping, stability selection, x-fold cross validation Number of replicates << number of samples << number of features Bayesian variable selection: If we assume βLASSO conditionally independent Master node can (quickly) sample hyperparameters (e.g. λ) from a prior distribution