SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Using multi-core
                                   algorithms to
                                      speed up
                                   optimization

                                   Gary K. Chen
                                   Biostat Noon
                                     Seminar

Using multi-core algorithms to   Introduction to
                                 high-performance

    speed up optimization        computing

                                 Concepts

                                 Example 1: Hidden
                                 Markov Model
                                 Training

           Gary K. Chen          Example 2:
                                 Regularized
       Biostat Noon Seminar      Logistic Regression

                                 Closing remarks




         March 23, 2011
Using multi-core

An outline                                     algorithms to
                                                  speed up
                                               optimization

                                               Gary K. Chen
                                               Biostat Noon
                                                 Seminar

Introduction to high-performance computing   Introduction to
                                             high-performance
                                             computing


Concepts                                     Concepts

                                             Example 1: Hidden
                                             Markov Model
                                             Training

Example 1: Hidden Markov Model Training      Example 2:
                                             Regularized
                                             Logistic Regression

                                             Closing remarks
Example 2: Regularized Logistic Regression

Closing remarks
Using multi-core

CPUs are not getting any faster                          algorithms to
                                                            speed up
                                                         optimization

                                                         Gary K. Chen
                                                         Biostat Noon
                                                           Seminar

                                                       Introduction to
                                                       high-performance
                                                       computing

                                                       Concepts

                                                       Example 1: Hidden
                                                       Markov Model
                                                       Training

                                                       Example 2:
                                                       Regularized

    Heat and power are the sole obstacles              Logistic Regression

                                                       Closing remarks
        According to Intel: underclock a single core
        by 20 percent and you save half the power
        while sacrificing only 13 percent of the
        performance.
        Implication? Two cores at the same power
        have 73% more performance
        (100 − 13) ∗ 2/100
Using multi-core
1. High performance computing                          algorithms to
                                                          speed up
                                                       optimization

clusters                                               Gary K. Chen
                                                       Biostat Noon
                                                         Seminar

   Coarse-grained, aka “embararassingly              Introduction to
   parallel”, problems                               high-performance
                                                     computing

       1. Launch multiple instances of the program   Concepts

       2. Compute summary statistics across log      Example 1: Hidden
                                                     Markov Model
       files                                          Training


   Examples                                          Example 2:
                                                     Regularized
                                                     Logistic Regression
       Monte Carlo simulations (power/specificity),   Closing remarks
       GWAS scans, imputation, etc.
   Remarks
       Pros: maximizes throughput (CPUs kept
       busy), gentle learning curve
       Cons: Doesn’t address some interesting
       computational problems
Using multi-core

Cluster Resource Example                               algorithms to
                                                          speed up
                                                       optimization

                                                       Gary K. Chen
                                                       Biostat Noon
                                                         Seminar


   HPCC at USC                                       Introduction to
                                                     high-performance
       94 teraflop cluster                            computing

       1,980 simultaneous processes running on       Concepts

                                                     Example 1: Hidden
       main queue                                    Markov Model
                                                     Training
       Jobs are asynchronous; can start and end in
                                                     Example 2:
       any order                                     Regularized
                                                     Logistic Regression
   Portable Batch System                             Closing remarks

       Simply prepend some headers in your shell
       script, describing how much memory you
       want, how long your job will run, etc.
Using multi-core
2. High performance computing                             algorithms to
                                                             speed up
                                                          optimization

clusters                                                  Gary K. Chen
                                                          Biostat Noon
                                                            Seminar

   Tightly-coupled parallel programs                    Introduction to
                                                        high-performance
   Message Passing Interface                            computing

                                                        Concepts
       1. Programs are distributed across multiple
                                                        Example 1: Hidden
       physical hosts                                   Markov Model
                                                        Training
       2. Each program executes the exact same          Example 2:
       code                                             Regularized
                                                        Logistic Regression
       3. All processes can be synchronized at          Closing remarks
       strategic points
   Remarks
       Pro: Can run interesting algorithms like
       parallel tempered MCMC
       Con: Developer is responsible for establishing
       a communication protocol
Using multi-core

Exploiting multiple-core processors                     algorithms to
                                                           speed up
                                                        optimization

                                                        Gary K. Chen
                                                        Biostat Noon
                                                          Seminar

                                                      Introduction to
    Fine-grained parallelism                          high-performance
                                                      computing
        Suggests a much higher degree of              Concepts
        inter-dependence between each process         Example 1: Hidden
                                                      Markov Model
        A “master” process executes majority of       Training

        code base. “Slave” processes are invoked to   Example 2:
                                                      Regularized
        ease bottlenecks.                             Logistic Regression

        We hope to minimize the time spent in the     Closing remarks

        master process
        Some Bayesian algorithms stand to benefit
Using multi-core

Amdahl’s Law                algorithms to
                               speed up
                            optimization

                            Gary K. Chen
                            Biostat Noon
                              Seminar

                          Introduction to
                          high-performance
                          computing

                          Concepts

                          Example 1: Hidden
                          Markov Model
                          Training

                          Example 2:
                          Regularized
                          Logistic Regression

                          Closing remarks




                  1
                      P
               (1−P)+ N
Using multi-core

Heterogeneous Computing     algorithms to
                               speed up
                            optimization

                            Gary K. Chen
                            Biostat Noon
                              Seminar

                          Introduction to
                          high-performance
                          computing

                          Concepts

                          Example 1: Hidden
                          Markov Model
                          Training

                          Example 2:
                          Regularized
                          Logistic Regression

                          Closing remarks
Using multi-core

Multi-core programming                                   algorithms to
                                                            speed up
                                                         optimization

                                                         Gary K. Chen
   aka data-parallel programming                         Biostat Noon
                                                           Seminar

   Built in to common compilers (e.g. gcc)             Introduction to
                                                       high-performance
       Very easy to get started!                       computing
       SSE or Streaming SIMD Extensions: each          Concepts

       core can do vector operations                   Example 1: Hidden
                                                       Markov Model
       OpenMP: parallel processing across multiple     Training

       cores                                           Example 2:
                                                       Regularized
       e.g. simply insert ”pragma omp for” directive   Logistic Regression

       and compile with gcc!                           Closing remarks


   CUDA/OpenCL
       CUDA is a proprietary C-based language
       endorsed by nVidia
       OpenCL: standards based implementation
       backed by the Khronos Group
Using multi-core

OpenCL and CUDA                                     algorithms to
                                                       speed up
                                                    optimization

                                                    Gary K. Chen
                                                    Biostat Noon
                                                      Seminar
   CUDA
                                                  Introduction to
      Powerful libraries available to enrich      high-performance
                                                  computing
      productivity
                                                  Concepts
      Thrust: C++ generics, cuBLAS: Level 1 and   Example 1: Hidden
      2 parallel BLAS                             Markov Model
                                                  Training
      Supported only on nVidia GPU devices        Example 2:

   OpenCL                                         Regularized
                                                  Logistic Regression

      Compatible with nVidia and ATI GPU          Closing remarks

      devices, as well as AMD/Intel CPUs
      Lags behind CUDA in libraries and tools
      Good to work with, given ATI hardware
      currently leads in value
Using multi-core

A $60 HPC under your desk     algorithms to
                                 speed up
                              optimization

                              Gary K. Chen
                              Biostat Noon
                                Seminar

                            Introduction to
                            high-performance
                            computing

                            Concepts

                            Example 1: Hidden
                            Markov Model
                            Training

                            Example 2:
                            Regularized
                            Logistic Regression

                            Closing remarks
Using multi-core

An outline                                     algorithms to
                                                  speed up
                                               optimization

                                               Gary K. Chen
                                               Biostat Noon
                                                 Seminar

Introduction to high-performance computing   Introduction to
                                             high-performance
                                             computing


Concepts                                     Concepts

                                             Example 1: Hidden
                                             Markov Model
                                             Training

Example 1: Hidden Markov Model Training      Example 2:
                                             Regularized
                                             Logistic Regression

                                             Closing remarks
Example 2: Regularized Logistic Regression

Closing remarks
Using multi-core

Threads and threadblocks                                algorithms to
                                                           speed up
                                                        optimization

                                                        Gary K. Chen
                                                        Biostat Noon
                                                          Seminar

   Threads:                                           Introduction to
       Perform a very limited function, but do all    high-performance
                                                      computing
       the heavy lifting                              Concepts

       Are extremely lightweight, so you’ll want to   Example 1: Hidden
                                                      Markov Model
       launch thousands                               Training

   Threadblocks:                                      Example 2:
                                                      Regularized
                                                      Logistic Regression
       Developer assigns threads that can cooperate
                                                      Closing remarks
       on a common task into threadblocks
       Threadblocks cannot communicate with one
       another and run in any order
       (asynchronously)
Using multi-core

Thread organization     algorithms to
                           speed up
                        optimization

                        Gary K. Chen
                        Biostat Noon
                          Seminar

                      Introduction to
                      high-performance
                      computing

                      Concepts

                      Example 1: Hidden
                      Markov Model
                      Training

                      Example 2:
                      Regularized
                      Logistic Regression

                      Closing remarks
Using multi-core

Memory hierarchy     algorithms to
                        speed up
                     optimization

                     Gary K. Chen
                     Biostat Noon
                       Seminar

                   Introduction to
                   high-performance
                   computing

                   Concepts

                   Example 1: Hidden
                   Markov Model
                   Training

                   Example 2:
                   Regularized
                   Logistic Regression

                   Closing remarks
Using multi-core

Kernels                                                     algorithms to
                                                               speed up
                                                            optimization

                                                            Gary K. Chen
                                                            Biostat Noon
                                                              Seminar

                                                          Introduction to
                                                          high-performance
                                                          computing

                                                          Concepts
   Warps/Wavefront:                                       Example 1: Hidden
                                                          Markov Model
          Describes an atomic set of threads (32 for      Training
          nVidia, 64 for ATI)                             Example 2:
                                                          Regularized
          Instructions are executed in lock step across   Logistic Regression

          the set, each thread processing a distinct      Closing remarks

          data element
          Developer responsible for synchronizing
          across warps
   Kernels:
          Code that developer writes, which can
          execute on a SIMD device
          Essentially C functions
Using multi-core

An outline                                     algorithms to
                                                  speed up
                                               optimization

                                               Gary K. Chen
                                               Biostat Noon
                                                 Seminar

Introduction to high-performance computing   Introduction to
                                             high-performance
                                             computing


Concepts                                     Concepts

                                             Example 1: Hidden
                                             Markov Model
                                             Training

Example 1: Hidden Markov Model Training      Example 2:
                                             Regularized
                                             Logistic Regression

                                             Closing remarks
Example 2: Regularized Logistic Regression

Closing remarks
Using multi-core
                                                   algorithms to
                                                      speed up
                                                   optimization

                                                   Gary K. Chen
                                                   Biostat Noon
                                                     Seminar

                                                 Introduction to
Hidden Markov Models                             high-performance
                                                 computing
    A staple in machine learning.                Concepts
    Many applications in statistical genetics,   Example 1: Hidden
                                                 Markov Model
    including imputation of untyped genotypes,   Training

    local ancestry, sequence alignment (e.g.     Example 2:
                                                 Regularized
    protein family scoring)                      Logistic Regression

                                                 Closing remarks
Using multi-core

Application to cancer tumor data                         algorithms to
                                                            speed up
                                                         optimization

                                                         Gary K. Chen
    Extending PennCNV                                    Biostat Noon
                                                           Seminar

       Tissues are assumed to be a mixture of          Introduction to
       tumor/normal cells                              high-performance
                                                       computing
       Tumors are assumed to be heterogeneous in       Concepts
       CN across cells, implying fractional copy       Example 1: Hidden
                                                       Markov Model
       number states                                   Training

       PennCNV defines 6 hidden integer states for      Example 2:
                                                       Regularized
       normal cells and does not infer allelic state   Logistic Regression

       We can make more precise estimates of both      Closing remarks

       copy numbers and allelic state in tumors with
       little sacrifice in performance
       Copy Num: z = (1-α)znormal + α ztumor
       z is fractional, whereas ztumor =
       I(z<=2)floor(z) + I(z>2)ceil(z)
Using multi-core

State Space                                                algorithms to
                                                              speed up
                                                           optimization
       state   CNfrac   BACnormal   CNtumor   BACtumor
       0       2        0           2         0            Gary K. Chen
       1       2        1           2         1            Biostat Noon
       2       2        2           2         2              Seminar
       3       0        0           0         0
       4       0        1           0         0          Introduction to
       5       0        2           0         0          high-performance
       6       0.5      0           0         0          computing
       7       0.5      1           0         0
       8       0.5      2           0         0          Concepts
       9       1        0           1         0
       10      1        1           1         0          Example 1: Hidden
       11      1        1           1         1          Markov Model
       12      1        2           1         1          Training
       13      1.5      0           1         0
       14      1.5      1           1         0          Example 2:
       15      1.5      1           1         1          Regularized
       16      1.5      2           1         1          Logistic Regression
       17      2.5      0           3         0
       18      2.5      1           3         1          Closing remarks
       19      2.5      1           3         2
       20      2.5      2           3         3
       21      3        0           4         0
       22      3        1           4         1
       23      3        1           4         2
       24      3        1           4         3
       25      3        2           4         4
       26      3.5      0           4         0
       27      3.5      1           4         1
       28      3.5      1           4         2
       29      3.5      1           4         3
       30      3.5      2           4         4
Using multi-core

Training a Hidden Markov Model                         algorithms to
                                                          speed up
                                                       optimization

                                                       Gary K. Chen
                                                       Biostat Noon
   Objective: Infer the probabilities of                 Seminar


   transitioning between any pair of states          Introduction to
                                                     high-performance
                                                     computing
   Apply forward-backward and Baum-Welch             Concepts
   algorithms                                        Example 1: Hidden
       A special case of the                         Markov Model
                                                     Training
       Expectation-Maximization (or generally,       Example 2:
                                                     Regularized
       MM) family of algorithms                      Logistic Regression

       Expectation step: forward-backward            Closing remarks

       computes posterior probs based on estimated
       parameters
       Maximization: Baum-Welch empirically
       estimates parameters by averaging across
       observations
Using multi-core
                                                   algorithms to
                                                      speed up
                                                   optimization

                                                   Gary K. Chen
                                                   Biostat Noon
                                                     Seminar

                                                 Introduction to
                                                 high-performance
                                                 computing

                                                 Concepts

                                                 Example 1: Hidden
                                                 Markov Model
                                                 Training

                                                 Example 2:
                                                 Regularized
Forward algorithm                                Logistic Regression

                                                 Closing remarks
    We compute the probability vector at
    observation t: f0:t = f0:t−1 TOt
    Each state (element of the m-state vector)
    can independently compute a sum-product
    Threadblocks map to states
    Threads calculate products in parallel,
    followed by a log2(m) addition reduction
Using multi-core

Gridblock of threadblocks     algorithms to
                                 speed up
                              optimization

                              Gary K. Chen
                              Biostat Noon
                                Seminar

                            Introduction to
                            high-performance
                            computing

                            Concepts

                            Example 1: Hidden
                            Markov Model
                            Training

                            Example 2:
                            Regularized
                            Logistic Regression

                            Closing remarks
Using multi-core

Speedups                                            algorithms to
                                                       speed up
                                                    optimization

                                                    Gary K. Chen
                                                    Biostat Noon
                                                      Seminar

   We implement 8 kernels. Examples:              Introduction to
                                                  high-performance
   Re-scaling transition matrix (for SNP          computing

   spacing)                                       Concepts

                     2                            Example 1: Hidden
       Serial: O(2nm ); Parallel: O(n)            Markov Model
                                                  Training
   Forward backward                               Example 2:
       Serial: O(2nm2 ); Parallel: O(nlog2 (m))   Regularized
                                                  Logistic Regression

   Normalizing constant (Baum-Welch)              Closing remarks


       Serial: O(nm); Parallel: O(log2 (n))
   MLE of transition matrix (Baum-Welch)
       Serial: O(nm2 ); Parallel: O(n)
Using multi-core

Run time comparison                                     algorithms to
                                                           speed up
                                                        optimization

                                                        Gary K. Chen
                                                        Biostat Noon
                                                          Seminar

                                                      Introduction to
                                                      high-performance
                                                      computing

Table: 1 iteration of HMM training on Chr 1 (41,263   Concepts

                                                      Example 1: Hidden
SNPs)                                                 Markov Model
                                                      Training

    states CPU   GPU fold-speedup                     Example 2:
                                                      Regularized
     128   9.5m  37s      15x                         Logistic Regression

                                                      Closing remarks
     512 2h 35m 1m 44s   108x
Using multi-core

An outline                                     algorithms to
                                                  speed up
                                               optimization

                                               Gary K. Chen
                                               Biostat Noon
                                                 Seminar

Introduction to high-performance computing   Introduction to
                                             high-performance
                                             computing


Concepts                                     Concepts

                                             Example 1: Hidden
                                             Markov Model
                                             Training

Example 1: Hidden Markov Model Training      Example 2:
                                             Regularized
                                             Logistic Regression

                                             Closing remarks
Example 2: Regularized Logistic Regression

Closing remarks
Using multi-core

Regularized Regression                                  algorithms to
                                                           speed up
                                                        optimization

                                                        Gary K. Chen
                                                        Biostat Noon
                                                          Seminar


    Variable Selection                                Introduction to
                                                      high-performance
                                                      computing
        For tractability, most GWAS analyses entail
                                                      Concepts
        separate univariate tests of each variable
                                                      Example 1: Hidden
        (e.g. SNP, GxG, GxE).                         Markov Model
                                                      Training
        However, it is preferable to model all        Example 2:
        variables simultaneously to tease out         Regularized
                                                      Logistic Regression
        correlated variables                          Closing remarks
        This is problematic when p < n. Parameters
        are unestimable, matrix inversion becomes
        computationally intractable
Using multi-core

Regularized Regression                                   algorithms to
                                                            speed up
                                                         optimization

                                                         Gary K. Chen
                                                         Biostat Noon
                                                           Seminar


    The LASSO method (Tibshirani, 1996)                Introduction to
                                                       high-performance
                                                       computing
        Seeded a cottage industry of related methods
                                                       Concepts
        e.g. Group LASSO, Elastic Net, MCP, NEG,       Example 1: Hidden
        Overlap LASSO, Graph LASSO                     Markov Model
                                                       Training
        Fundamentally solves variable selection        Example 2:
        problem by introducing an L1 norm to invoke    Regularized
                                                       Logistic Regression
        sparsity                                       Closing remarks

    Limitations: Do not provide a mechanism
    for hypothesis testing (e.g p-values)
Using multi-core

Regularized Regression                                   algorithms to
                                                            speed up
                                                         optimization

                                                         Gary K. Chen
                                                         Biostat Noon
                                                           Seminar

                                                       Introduction to
    Bayesian methods                                   high-performance
                                                       computing
        Posterior inferences on β                      Concepts
        e.g.: Bayesian LASSO, Bayesian Elastic Net,    Example 1: Hidden
                                                       Markov Model
        Highly computational. Scaling up to genome     Training

        wide scale is not obvious                      Example 2:
                                                       Regularized
        MCMC is inherently serial, best option is to   Logistic Regression

        speed up the sampling chain                    Closing remarks

        Proposal: Implement key bottle neck on the
        GPU: fitting βLASSO to the data
Using multi-core

Optimization                                                             algorithms to
                                                                            speed up
                                                                         optimization

                                                                         Gary K. Chen
                                                                         Biostat Noon
                                                                           Seminar
   For binomial logistic regression:
                      n                                                Introduction to
        L(β) =        i=1 [yi logpi        + (1 − yi )log(1 − pi )]    high-performance
                      t                                                computing
                e µ+xi β
        pi =       µ+x t β                                             Concepts
               1+e    i
                             n
                             i=1 [yi − pi (β)]xi
                                                                       Example 1: Hidden
         L(β) =                                                        Markov Model
                               n
        −d 2 L(β) =            i=1 pi (β)[1 − pi (β)]xi xi
                                                          t            Training

                                                                       Example 2:
   For *penalized* regression:                                         Regularized
                                                                       Logistic Regression
                                        p
        f (β) = L(β) − λ                j=1   |βj |                    Closing remarks


   Find global maximum by applying Newton
   Raphson one variable at a time.
                                 n
                                       [yi −pi (β m )]xi −λsgn(βjm )
                   P
        βjm+1 = βjm −           i=1
                                Pn            m            m     t
                                     i=1 pi (β )[1−pi (β )]xi xi
Using multi-core

Overview of algorithm                                     algorithms to
                                                             speed up
                                                          optimization
   Newton-Raphson kernel                                  Gary K. Chen
                                                          Biostat Noon
        Each threadblock maps to a block of 512             Seminar
        subjects (theads) for 1 variable
                                                        Introduction to
        Each thread calculates subject’s contribution   high-performance
                                                        computing
        to gradient and hessian                         Concepts
        Sum (reduction) across 512 subjects             Example 1: Hidden
        Sum (reduction) across subject blocks in new    Markov Model
                                                        Training
        kernel                                          Example 2:
                                                        Regularized
   Compute log-likelihood change for each               Logistic Regression


   variable (like above).                               Closing remarks


   Apply a max operator (log2 reduction) to
   select variable with greatest contribution
   to likelihood.
   Iterate repeatedly until likelihood increase
   less than epsilon
Using multi-core

Gridblock of threadblocks     algorithms to
                                 speed up
                              optimization

                              Gary K. Chen
                              Biostat Noon
                                Seminar

                            Introduction to
                            high-performance
                            computing

                            Concepts

                            Example 1: Hidden
                            Markov Model
                            Training

                            Example 2:
                            Regularized
                            Logistic Regression

                            Closing remarks
Using multi-core

Consideration of datatypes                             algorithms to
                                                          speed up
                                                       optimization

                                                       Gary K. Chen
                                                       Biostat Noon
                                                         Seminar

                                                     Introduction to
                                                     high-performance
                                                     computing

                                                     Concepts

                                                     Example 1: Hidden
                                                     Markov Model
                                                     Training

                                                     Example 2:
                                                     Regularized
                                                     Logistic Regression

                                                     Closing remarks

    Need to compress genotypes
        Why? Global memory is scarce, bandwidth is
        expensive
        A warp of 32 threads loads 32 words
        (containing 512 genotypes) into local
        memory
Using multi-core

Distributed GPU implementation                           algorithms to
                                                            speed up
                                                         optimization

                                                         Gary K. Chen
   For really large dimensions, we can link up           Biostat Noon
                                                           Seminar

   an arbitrary number of GPUs                         Introduction to
                                                       high-performance
   MPI allows us to spread work across a               computing

                                                       Concepts
   cluster                                             Example 1: Hidden
                                                       Markov Model
   Developed on Epigraph: 2 Tesla C2050s               Training

   Approach                                            Example 2:
                                                       Regularized
                                                       Logistic Regression
       MPI master node delegates heavy lifting to
                                                       Closing remarks
       slaves across network
       Master node performs fast serial code, such
       as sampling from the full conditional
       likelihood of any penalty parameter (e.g. λ)
       Network traffic is minimized so slaves must
       maintain up to date copies of data structures
Using multi-core
  algorithms to
     speed up
  optimization

  Gary K. Chen
  Biostat Noon
    Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks
Using multi-core

Evaluation on large dataset                               algorithms to
                                                             speed up
                                                          optimization

                                                          Gary K. Chen
                                                          Biostat Noon
                                                            Seminar

    GWAS data                                           Introduction to
        6,806 African American subjects in a case       high-performance
                                                        computing
        control study of prostate cancer                Concepts

        1,047,986 SNPs typed                            Example 1: Hidden
                                                        Markov Model
    Elapsed walltime for 1 LASSO iteration              Training

    (sweep across all variables)                        Example 2:
                                                        Regularized
                                                        Logistic Regression
        15 minutes on optimized serial
                                                        Closing remarks
        implementation across 2 slave CPUs
        5.8 seconds on parallel implementation across
        2 nVidia Tesla C2050 GPU devices
        155x speed up
Using multi-core
  algorithms to
     speed up
  optimization

  Gary K. Chen
  Biostat Noon
    Seminar

Introduction to
high-performance
computing

Concepts

Example 1: Hidden
Markov Model
Training

Example 2:
Regularized
Logistic Regression

Closing remarks
Using multi-core

An outline                                     algorithms to
                                                  speed up
                                               optimization

                                               Gary K. Chen
                                               Biostat Noon
                                                 Seminar

Introduction to high-performance computing   Introduction to
                                             high-performance
                                             computing


Concepts                                     Concepts

                                             Example 1: Hidden
                                             Markov Model
                                             Training

Example 1: Hidden Markov Model Training      Example 2:
                                             Regularized
                                             Logistic Regression

                                             Closing remarks
Example 2: Regularized Logistic Regression

Closing remarks
Using multi-core

Conclusion                                              algorithms to
                                                           speed up
                                                        optimization

                                                        Gary K. Chen
   Multicore programming is not a panacea               Biostat Noon
                                                          Seminar

       Insufficient parallelism leads to an inferior    Introduction to
       implementation                                 high-performance
                                                      computing
       Graph algorithms *generally* do not map        Concepts
       well to SIMD architectures                     Example 1: Hidden
                                                      Markov Model
   Programming Effort                                  Training

       Expect to spend at least 90 % time             Example 2:
                                                      Regularized
       debugging a black box                          Logistic Regression

                                                      Closing remarks
       Is it worth it? Human time > computer
       time?
       For generic problems (matrix multiplication,
       sorting), absolutely
       OpenCL is a bit more verbose than CUDA,
       but is more portable
Using multi-core

Potential Future Work                                  algorithms to
                                                          speed up
                                                       optimization

                                                       Gary K. Chen
                                                       Biostat Noon
                                                         Seminar


   Reconstructing Bayesian Networks                  Introduction to
                                                     high-performance
       Compute joint probability for each possible   computing

                                                     Concepts
       topology
                                                     Example 1: Hidden
       Code graph as a sparse adjacency matrix       Markov Model
                                                     Training
   Approximate Bayesian Computation                  Example 2:
                                                     Regularized
       Sample θ from some assumed prior              Logistic Regression

       distribution                                  Closing remarks

       Generate a dataset conditional on θ
       Examine how close fake data is to the real
       one
Using multi-core
Tomorrow’s clusters will require     algorithms to
                                        speed up
                                     optimization

heterogeneous programming            Gary K. Chen
                                     Biostat Noon
                                       Seminar

                                   Introduction to
                                   high-performance
                                   computing

                                   Concepts

                                   Example 1: Hidden
                                   Markov Model
                                   Training

                                   Example 2:
                                   Regularized
                                   Logistic Regression

                                   Closing remarks
Using multi-core

Tianhe-1A                                                algorithms to
                                                            speed up
                                                         optimization

                                                         Gary K. Chen
                                                         Biostat Noon
                                                           Seminar


   World’s faster supercomputer                        Introduction to
                                                       high-performance
       4.7 petaflops (quadrillion floating point         computing

       operations/sec)                                 Concepts

                                                       Example 1: Hidden
       14,336 Xeon CPUs, 7,168 Tesla M2050s            Markov Model
                                                       Training
   According to nVidia                                 Example 2:
                                                       Regularized
       CPU only: 50k CPUs, twice the floor space        Logistic Regression
       CPU only: 12 megawatts compared to 4.04         Closing remarks

       megawatts
       $88 million dollars to build, $20 million for
       annual energy costs
Using multi-core

Thanks to                                algorithms to
                                            speed up
                                         optimization

                                         Gary K. Chen
                                         Biostat Noon
                                           Seminar

                                       Introduction to
                                       high-performance
                                       computing
   Kai: Ideas for CNV analysis         Concepts

   Duncan, Wei: Discussions on LASSO   Example 1: Hidden
                                       Markov Model
                                       Training
   Tim, Zack: Access to Epigraph       Example 2:
                                       Regularized
   Alex, James: Lively HPC             Logistic Regression

                                       Closing remarks
   discussions/debates

Weitere ähnliche Inhalte

Was ist angesagt?

Master thesispresentation
Master thesispresentationMaster thesispresentation
Master thesispresentation
Matthew Urffer
 
Optimizing Intelligent Agents Constraint Satisfaction with ...
Optimizing Intelligent Agents Constraint Satisfaction with ...Optimizing Intelligent Agents Constraint Satisfaction with ...
Optimizing Intelligent Agents Constraint Satisfaction with ...
butest
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
IJECEIAES
 
A Game Theoretic Framework for Heterogenous Information Network Clustering
A Game Theoretic Framework for Heterogenous Information Network ClusteringA Game Theoretic Framework for Heterogenous Information Network Clustering
A Game Theoretic Framework for Heterogenous Information Network Clustering
Faris Alqadah
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
James McGalliard
 
Application scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designApplication scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system design
Mr. Chanuwan
 

Was ist angesagt? (20)

Ad24210214
Ad24210214Ad24210214
Ad24210214
 
142 144
142 144142 144
142 144
 
Master thesispresentation
Master thesispresentationMaster thesispresentation
Master thesispresentation
 
Optimizing Intelligent Agents Constraint Satisfaction with ...
Optimizing Intelligent Agents Constraint Satisfaction with ...Optimizing Intelligent Agents Constraint Satisfaction with ...
Optimizing Intelligent Agents Constraint Satisfaction with ...
 
Implementation and study of the BandWidth Inheritance protocol in the Linux k...
Implementation and study of the BandWidth Inheritance protocol in the Linux k...Implementation and study of the BandWidth Inheritance protocol in the Linux k...
Implementation and study of the BandWidth Inheritance protocol in the Linux k...
 
D25014017
D25014017D25014017
D25014017
 
Gh2411361141
Gh2411361141Gh2411361141
Gh2411361141
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
 
FIR Filter Implementation by Systolization using DA-based Decomposition
FIR Filter Implementation by Systolization using DA-based DecompositionFIR Filter Implementation by Systolization using DA-based Decomposition
FIR Filter Implementation by Systolization using DA-based Decomposition
 
AIBE 68
AIBE 68AIBE 68
AIBE 68
 
A Game Theoretic Framework for Heterogenous Information Network Clustering
A Game Theoretic Framework for Heterogenous Information Network ClusteringA Game Theoretic Framework for Heterogenous Information Network Clustering
A Game Theoretic Framework for Heterogenous Information Network Clustering
 
Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)
 
High Speed and Area Efficient 2D DWT Processor Based Image Compression
High Speed and Area Efficient 2D DWT Processor Based Image CompressionHigh Speed and Area Efficient 2D DWT Processor Based Image Compression
High Speed and Area Efficient 2D DWT Processor Based Image Compression
 
C04841417
C04841417C04841417
C04841417
 
Performance Comparison of Rerouting Schemes of Multi Protocol Label Switching...
Performance Comparison of Rerouting Schemes of Multi Protocol Label Switching...Performance Comparison of Rerouting Schemes of Multi Protocol Label Switching...
Performance Comparison of Rerouting Schemes of Multi Protocol Label Switching...
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Application scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designApplication scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system design
 
icvgip poster
icvgip postericvgip poster
icvgip poster
 
22). smlevel energy eff-dynamictaskschedng
22). smlevel energy eff-dynamictaskschedng22). smlevel energy eff-dynamictaskschedng
22). smlevel energy eff-dynamictaskschedng
 
Real Time Image Processing
Real Time Image Processing Real Time Image Processing
Real Time Image Processing
 

Andere mochten auch

14 dell-multicore
14 dell-multicore14 dell-multicore
14 dell-multicore
suji9
 
10 Multicore 07
10 Multicore 0710 Multicore 07
10 Multicore 07
timcrack
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers
Sher Shah Merkhel
 
8087 Data Processor
8087 Data Processor8087 Data Processor
8087 Data Processor
manu2008
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
Kishan Panara
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
Naoki Shibata
 

Andere mochten auch (20)

14 dell-multicore
14 dell-multicore14 dell-multicore
14 dell-multicore
 
10 Multicore 07
10 Multicore 0710 Multicore 07
10 Multicore 07
 
Time critical multitasking for multicore
Time critical multitasking for multicoreTime critical multitasking for multicore
Time critical multitasking for multicore
 
CA presentation of multicore processor
CA presentation of multicore processorCA presentation of multicore processor
CA presentation of multicore processor
 
Uday walvekar dsp_seminar
Uday walvekar dsp_seminarUday walvekar dsp_seminar
Uday walvekar dsp_seminar
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processors
 
Advanced trends in microcontrollers by suhel
Advanced trends in microcontrollers by suhelAdvanced trends in microcontrollers by suhel
Advanced trends in microcontrollers by suhel
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers
 
Altera trcak g
Altera  trcak gAltera  trcak g
Altera trcak g
 
8087 Data Processor
8087 Data Processor8087 Data Processor
8087 Data Processor
 
Aca2 08 new
Aca2 08 newAca2 08 new
Aca2 08 new
 
The Thesis Toolbox: Research Design for Academic Writing
The Thesis Toolbox: Research Design for Academic WritingThe Thesis Toolbox: Research Design for Academic Writing
The Thesis Toolbox: Research Design for Academic Writing
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
Multi core processor
Multi core processorMulti core processor
Multi core processor
 
History of processor
History of processorHistory of processor
History of processor
 
13. multiprocessing
13. multiprocessing13. multiprocessing
13. multiprocessing
 
Multi-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureMulti-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architecture
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash Prajapati
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
 

Ähnlich wie Multi-core programming talk for weekly biostat seminar

Computer Architecture Seminar
Computer Architecture SeminarComputer Architecture Seminar
Computer Architecture Seminar
Naman Kumar
 
Interprocedural Constant Propagation
Interprocedural Constant PropagationInterprocedural Constant Propagation
Interprocedural Constant Propagation
james marioki
 

Ähnlich wie Multi-core programming talk for weekly biostat seminar (20)

Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
 
Lecture8 - From CBR to IBk
Lecture8 - From CBR to IBkLecture8 - From CBR to IBk
Lecture8 - From CBR to IBk
 
Calibration of Deployment Simulation Models - A Multi-Paradigm Modelling Appr...
Calibration of Deployment Simulation Models - A Multi-Paradigm Modelling Appr...Calibration of Deployment Simulation Models - A Multi-Paradigm Modelling Appr...
Calibration of Deployment Simulation Models - A Multi-Paradigm Modelling Appr...
 
Kinetics reaction scheme_v1.2
Kinetics reaction scheme_v1.2Kinetics reaction scheme_v1.2
Kinetics reaction scheme_v1.2
 
ICST11.ppt
ICST11.pptICST11.ppt
ICST11.ppt
 
Independent tasks scheduling based on genetic
Independent tasks scheduling based on geneticIndependent tasks scheduling based on genetic
Independent tasks scheduling based on genetic
 
Profit based unit commitment for GENCOs using Parallel PSO in a distributed c...
Profit based unit commitment for GENCOs using Parallel PSO in a distributed c...Profit based unit commitment for GENCOs using Parallel PSO in a distributed c...
Profit based unit commitment for GENCOs using Parallel PSO in a distributed c...
 
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement ProblemEvolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
 
Computer Architecture Seminar
Computer Architecture SeminarComputer Architecture Seminar
Computer Architecture Seminar
 
Interprocedural Constant Propagation
Interprocedural Constant PropagationInterprocedural Constant Propagation
Interprocedural Constant Propagation
 
Shanghai Automotive - Application of Process Automation and Optimisation
Shanghai Automotive - Application of Process Automation and OptimisationShanghai Automotive - Application of Process Automation and Optimisation
Shanghai Automotive - Application of Process Automation and Optimisation
 
V35 59
V35 59V35 59
V35 59
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic Computing
 
Coca1
Coca1Coca1
Coca1
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
 
Genetic Algorithms and Genetic Programming for Multiscale Modeling
Genetic Algorithms and Genetic Programming for Multiscale ModelingGenetic Algorithms and Genetic Programming for Multiscale Modeling
Genetic Algorithms and Genetic Programming for Multiscale Modeling
 
Kinetics reaction scheme_v1 3
Kinetics reaction scheme_v1 3Kinetics reaction scheme_v1 3
Kinetics reaction scheme_v1 3
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
 
Implementation of area optimized low power multiplication and accumulation
Implementation of area optimized low power multiplication and accumulationImplementation of area optimized low power multiplication and accumulation
Implementation of area optimized low power multiplication and accumulation
 

Mehr von USC

Pathway talk for IGES 2009 Hawaii
Pathway talk for IGES 2009 HawaiiPathway talk for IGES 2009 Hawaii
Pathway talk for IGES 2009 Hawaii
USC
 
Kinship adjusted armitage trend test for ENDGAME meeting 2008
Kinship adjusted armitage trend test for ENDGAME meeting 2008Kinship adjusted armitage trend test for ENDGAME meeting 2008
Kinship adjusted armitage trend test for ENDGAME meeting 2008
USC
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
USC
 
Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011
USC
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
USC
 

Mehr von USC (6)

Haplotyping and genotype imputation using Graphics Processing Units
Haplotyping and genotype imputation using Graphics Processing UnitsHaplotyping and genotype imputation using Graphics Processing Units
Haplotyping and genotype imputation using Graphics Processing Units
 
Pathway talk for IGES 2009 Hawaii
Pathway talk for IGES 2009 HawaiiPathway talk for IGES 2009 Hawaii
Pathway talk for IGES 2009 Hawaii
 
Kinship adjusted armitage trend test for ENDGAME meeting 2008
Kinship adjusted armitage trend test for ENDGAME meeting 2008Kinship adjusted armitage trend test for ENDGAME meeting 2008
Kinship adjusted armitage trend test for ENDGAME meeting 2008
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
 
Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Multi-core programming talk for weekly biostat seminar

  • 1. Using multi-core algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Using multi-core algorithms to Introduction to high-performance speed up optimization computing Concepts Example 1: Hidden Markov Model Training Gary K. Chen Example 2: Regularized Biostat Noon Seminar Logistic Regression Closing remarks March 23, 2011
  • 2. Using multi-core An outline algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Introduction to high-performance computing Concepts Concepts Example 1: Hidden Markov Model Training Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks Example 2: Regularized Logistic Regression Closing remarks
  • 3. Using multi-core CPUs are not getting any faster algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Heat and power are the sole obstacles Logistic Regression Closing remarks According to Intel: underclock a single core by 20 percent and you save half the power while sacrificing only 13 percent of the performance. Implication? Two cores at the same power have 73% more performance (100 − 13) ∗ 2/100
  • 4. Using multi-core 1. High performance computing algorithms to speed up optimization clusters Gary K. Chen Biostat Noon Seminar Coarse-grained, aka “embararassingly Introduction to parallel”, problems high-performance computing 1. Launch multiple instances of the program Concepts 2. Compute summary statistics across log Example 1: Hidden Markov Model files Training Examples Example 2: Regularized Logistic Regression Monte Carlo simulations (power/specificity), Closing remarks GWAS scans, imputation, etc. Remarks Pros: maximizes throughput (CPUs kept busy), gentle learning curve Cons: Doesn’t address some interesting computational problems
  • 5. Using multi-core Cluster Resource Example algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar HPCC at USC Introduction to high-performance 94 teraflop cluster computing 1,980 simultaneous processes running on Concepts Example 1: Hidden main queue Markov Model Training Jobs are asynchronous; can start and end in Example 2: any order Regularized Logistic Regression Portable Batch System Closing remarks Simply prepend some headers in your shell script, describing how much memory you want, how long your job will run, etc.
  • 6. Using multi-core 2. High performance computing algorithms to speed up optimization clusters Gary K. Chen Biostat Noon Seminar Tightly-coupled parallel programs Introduction to high-performance Message Passing Interface computing Concepts 1. Programs are distributed across multiple Example 1: Hidden physical hosts Markov Model Training 2. Each program executes the exact same Example 2: code Regularized Logistic Regression 3. All processes can be synchronized at Closing remarks strategic points Remarks Pro: Can run interesting algorithms like parallel tempered MCMC Con: Developer is responsible for establishing a communication protocol
  • 7. Using multi-core Exploiting multiple-core processors algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to Fine-grained parallelism high-performance computing Suggests a much higher degree of Concepts inter-dependence between each process Example 1: Hidden Markov Model A “master” process executes majority of Training code base. “Slave” processes are invoked to Example 2: Regularized ease bottlenecks. Logistic Regression We hope to minimize the time spent in the Closing remarks master process Some Bayesian algorithms stand to benefit
  • 8. Using multi-core Amdahl’s Law algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks 1 P (1−P)+ N
  • 9. Using multi-core Heterogeneous Computing algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • 10. Using multi-core Multi-core programming algorithms to speed up optimization Gary K. Chen aka data-parallel programming Biostat Noon Seminar Built in to common compilers (e.g. gcc) Introduction to high-performance Very easy to get started! computing SSE or Streaming SIMD Extensions: each Concepts core can do vector operations Example 1: Hidden Markov Model OpenMP: parallel processing across multiple Training cores Example 2: Regularized e.g. simply insert ”pragma omp for” directive Logistic Regression and compile with gcc! Closing remarks CUDA/OpenCL CUDA is a proprietary C-based language endorsed by nVidia OpenCL: standards based implementation backed by the Khronos Group
  • 11. Using multi-core OpenCL and CUDA algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar CUDA Introduction to Powerful libraries available to enrich high-performance computing productivity Concepts Thrust: C++ generics, cuBLAS: Level 1 and Example 1: Hidden 2 parallel BLAS Markov Model Training Supported only on nVidia GPU devices Example 2: OpenCL Regularized Logistic Regression Compatible with nVidia and ATI GPU Closing remarks devices, as well as AMD/Intel CPUs Lags behind CUDA in libraries and tools Good to work with, given ATI hardware currently leads in value
  • 12. Using multi-core A $60 HPC under your desk algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • 13. Using multi-core An outline algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Introduction to high-performance computing Concepts Concepts Example 1: Hidden Markov Model Training Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks Example 2: Regularized Logistic Regression Closing remarks
  • 14. Using multi-core Threads and threadblocks algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Threads: Introduction to Perform a very limited function, but do all high-performance computing the heavy lifting Concepts Are extremely lightweight, so you’ll want to Example 1: Hidden Markov Model launch thousands Training Threadblocks: Example 2: Regularized Logistic Regression Developer assigns threads that can cooperate Closing remarks on a common task into threadblocks Threadblocks cannot communicate with one another and run in any order (asynchronously)
  • 15. Using multi-core Thread organization algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • 16. Using multi-core Memory hierarchy algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • 17. Using multi-core Kernels algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Warps/Wavefront: Example 1: Hidden Markov Model Describes an atomic set of threads (32 for Training nVidia, 64 for ATI) Example 2: Regularized Instructions are executed in lock step across Logistic Regression the set, each thread processing a distinct Closing remarks data element Developer responsible for synchronizing across warps Kernels: Code that developer writes, which can execute on a SIMD device Essentially C functions
  • 18. Using multi-core An outline algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Introduction to high-performance computing Concepts Concepts Example 1: Hidden Markov Model Training Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks Example 2: Regularized Logistic Regression Closing remarks
  • 19. Using multi-core algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to Hidden Markov Models high-performance computing A staple in machine learning. Concepts Many applications in statistical genetics, Example 1: Hidden Markov Model including imputation of untyped genotypes, Training local ancestry, sequence alignment (e.g. Example 2: Regularized protein family scoring) Logistic Regression Closing remarks
  • 20. Using multi-core Application to cancer tumor data algorithms to speed up optimization Gary K. Chen Extending PennCNV Biostat Noon Seminar Tissues are assumed to be a mixture of Introduction to tumor/normal cells high-performance computing Tumors are assumed to be heterogeneous in Concepts CN across cells, implying fractional copy Example 1: Hidden Markov Model number states Training PennCNV defines 6 hidden integer states for Example 2: Regularized normal cells and does not infer allelic state Logistic Regression We can make more precise estimates of both Closing remarks copy numbers and allelic state in tumors with little sacrifice in performance Copy Num: z = (1-α)znormal + α ztumor z is fractional, whereas ztumor = I(z<=2)floor(z) + I(z>2)ceil(z)
  • 21. Using multi-core State Space algorithms to speed up optimization state CNfrac BACnormal CNtumor BACtumor 0 2 0 2 0 Gary K. Chen 1 2 1 2 1 Biostat Noon 2 2 2 2 2 Seminar 3 0 0 0 0 4 0 1 0 0 Introduction to 5 0 2 0 0 high-performance 6 0.5 0 0 0 computing 7 0.5 1 0 0 8 0.5 2 0 0 Concepts 9 1 0 1 0 10 1 1 1 0 Example 1: Hidden 11 1 1 1 1 Markov Model 12 1 2 1 1 Training 13 1.5 0 1 0 14 1.5 1 1 0 Example 2: 15 1.5 1 1 1 Regularized 16 1.5 2 1 1 Logistic Regression 17 2.5 0 3 0 18 2.5 1 3 1 Closing remarks 19 2.5 1 3 2 20 2.5 2 3 3 21 3 0 4 0 22 3 1 4 1 23 3 1 4 2 24 3 1 4 3 25 3 2 4 4 26 3.5 0 4 0 27 3.5 1 4 1 28 3.5 1 4 2 29 3.5 1 4 3 30 3.5 2 4 4
  • 22. Using multi-core Training a Hidden Markov Model algorithms to speed up optimization Gary K. Chen Biostat Noon Objective: Infer the probabilities of Seminar transitioning between any pair of states Introduction to high-performance computing Apply forward-backward and Baum-Welch Concepts algorithms Example 1: Hidden A special case of the Markov Model Training Expectation-Maximization (or generally, Example 2: Regularized MM) family of algorithms Logistic Regression Expectation step: forward-backward Closing remarks computes posterior probs based on estimated parameters Maximization: Baum-Welch empirically estimates parameters by averaging across observations
  • 23. Using multi-core algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Forward algorithm Logistic Regression Closing remarks We compute the probability vector at observation t: f0:t = f0:t−1 TOt Each state (element of the m-state vector) can independently compute a sum-product Threadblocks map to states Threads calculate products in parallel, followed by a log2(m) addition reduction
  • 24. Using multi-core Gridblock of threadblocks algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • 25. Using multi-core Speedups algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar We implement 8 kernels. Examples: Introduction to high-performance Re-scaling transition matrix (for SNP computing spacing) Concepts 2 Example 1: Hidden Serial: O(2nm ); Parallel: O(n) Markov Model Training Forward backward Example 2: Serial: O(2nm2 ); Parallel: O(nlog2 (m)) Regularized Logistic Regression Normalizing constant (Baum-Welch) Closing remarks Serial: O(nm); Parallel: O(log2 (n)) MLE of transition matrix (Baum-Welch) Serial: O(nm2 ); Parallel: O(n)
  • 26. Using multi-core Run time comparison algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Table: 1 iteration of HMM training on Chr 1 (41,263 Concepts Example 1: Hidden SNPs) Markov Model Training states CPU GPU fold-speedup Example 2: Regularized 128 9.5m 37s 15x Logistic Regression Closing remarks 512 2h 35m 1m 44s 108x
  • 27. Using multi-core An outline algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Introduction to high-performance computing Concepts Concepts Example 1: Hidden Markov Model Training Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks Example 2: Regularized Logistic Regression Closing remarks
  • 28. Using multi-core Regularized Regression algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Variable Selection Introduction to high-performance computing For tractability, most GWAS analyses entail Concepts separate univariate tests of each variable Example 1: Hidden (e.g. SNP, GxG, GxE). Markov Model Training However, it is preferable to model all Example 2: variables simultaneously to tease out Regularized Logistic Regression correlated variables Closing remarks This is problematic when p < n. Parameters are unestimable, matrix inversion becomes computationally intractable
  • 29. Using multi-core Regularized Regression algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar The LASSO method (Tibshirani, 1996) Introduction to high-performance computing Seeded a cottage industry of related methods Concepts e.g. Group LASSO, Elastic Net, MCP, NEG, Example 1: Hidden Overlap LASSO, Graph LASSO Markov Model Training Fundamentally solves variable selection Example 2: problem by introducing an L1 norm to invoke Regularized Logistic Regression sparsity Closing remarks Limitations: Do not provide a mechanism for hypothesis testing (e.g p-values)
  • 30. Using multi-core Regularized Regression algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to Bayesian methods high-performance computing Posterior inferences on β Concepts e.g.: Bayesian LASSO, Bayesian Elastic Net, Example 1: Hidden Markov Model Highly computational. Scaling up to genome Training wide scale is not obvious Example 2: Regularized MCMC is inherently serial, best option is to Logistic Regression speed up the sampling chain Closing remarks Proposal: Implement key bottle neck on the GPU: fitting βLASSO to the data
  • 31. Using multi-core Optimization algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar For binomial logistic regression: n Introduction to L(β) = i=1 [yi logpi + (1 − yi )log(1 − pi )] high-performance t computing e µ+xi β pi = µ+x t β Concepts 1+e i n i=1 [yi − pi (β)]xi Example 1: Hidden L(β) = Markov Model n −d 2 L(β) = i=1 pi (β)[1 − pi (β)]xi xi t Training Example 2: For *penalized* regression: Regularized Logistic Regression p f (β) = L(β) − λ j=1 |βj | Closing remarks Find global maximum by applying Newton Raphson one variable at a time. n [yi −pi (β m )]xi −λsgn(βjm ) P βjm+1 = βjm − i=1 Pn m m t i=1 pi (β )[1−pi (β )]xi xi
  • 32. Using multi-core Overview of algorithm algorithms to speed up optimization Newton-Raphson kernel Gary K. Chen Biostat Noon Each threadblock maps to a block of 512 Seminar subjects (theads) for 1 variable Introduction to Each thread calculates subject’s contribution high-performance computing to gradient and hessian Concepts Sum (reduction) across 512 subjects Example 1: Hidden Sum (reduction) across subject blocks in new Markov Model Training kernel Example 2: Regularized Compute log-likelihood change for each Logistic Regression variable (like above). Closing remarks Apply a max operator (log2 reduction) to select variable with greatest contribution to likelihood. Iterate repeatedly until likelihood increase less than epsilon
  • 33. Using multi-core Gridblock of threadblocks algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • 34. Using multi-core Consideration of datatypes algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks Need to compress genotypes Why? Global memory is scarce, bandwidth is expensive A warp of 32 threads loads 32 words (containing 512 genotypes) into local memory
  • 35. Using multi-core Distributed GPU implementation algorithms to speed up optimization Gary K. Chen For really large dimensions, we can link up Biostat Noon Seminar an arbitrary number of GPUs Introduction to high-performance MPI allows us to spread work across a computing Concepts cluster Example 1: Hidden Markov Model Developed on Epigraph: 2 Tesla C2050s Training Approach Example 2: Regularized Logistic Regression MPI master node delegates heavy lifting to Closing remarks slaves across network Master node performs fast serial code, such as sampling from the full conditional likelihood of any penalty parameter (e.g. λ) Network traffic is minimized so slaves must maintain up to date copies of data structures
  • 36. Using multi-core algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • 37. Using multi-core Evaluation on large dataset algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar GWAS data Introduction to 6,806 African American subjects in a case high-performance computing control study of prostate cancer Concepts 1,047,986 SNPs typed Example 1: Hidden Markov Model Elapsed walltime for 1 LASSO iteration Training (sweep across all variables) Example 2: Regularized Logistic Regression 15 minutes on optimized serial Closing remarks implementation across 2 slave CPUs 5.8 seconds on parallel implementation across 2 nVidia Tesla C2050 GPU devices 155x speed up
  • 38. Using multi-core algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • 39. Using multi-core An outline algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Introduction to high-performance computing Concepts Concepts Example 1: Hidden Markov Model Training Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks Example 2: Regularized Logistic Regression Closing remarks
  • 40. Using multi-core Conclusion algorithms to speed up optimization Gary K. Chen Multicore programming is not a panacea Biostat Noon Seminar Insufficient parallelism leads to an inferior Introduction to implementation high-performance computing Graph algorithms *generally* do not map Concepts well to SIMD architectures Example 1: Hidden Markov Model Programming Effort Training Expect to spend at least 90 % time Example 2: Regularized debugging a black box Logistic Regression Closing remarks Is it worth it? Human time > computer time? For generic problems (matrix multiplication, sorting), absolutely OpenCL is a bit more verbose than CUDA, but is more portable
  • 41. Using multi-core Potential Future Work algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Reconstructing Bayesian Networks Introduction to high-performance Compute joint probability for each possible computing Concepts topology Example 1: Hidden Code graph as a sparse adjacency matrix Markov Model Training Approximate Bayesian Computation Example 2: Regularized Sample θ from some assumed prior Logistic Regression distribution Closing remarks Generate a dataset conditional on θ Examine how close fake data is to the real one
  • 42. Using multi-core Tomorrow’s clusters will require algorithms to speed up optimization heterogeneous programming Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • 43. Using multi-core Tianhe-1A algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar World’s faster supercomputer Introduction to high-performance 4.7 petaflops (quadrillion floating point computing operations/sec) Concepts Example 1: Hidden 14,336 Xeon CPUs, 7,168 Tesla M2050s Markov Model Training According to nVidia Example 2: Regularized CPU only: 50k CPUs, twice the floor space Logistic Regression CPU only: 12 megawatts compared to 4.04 Closing remarks megawatts $88 million dollars to build, $20 million for annual energy costs
  • 44. Using multi-core Thanks to algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Kai: Ideas for CNV analysis Concepts Duncan, Wei: Discussions on LASSO Example 1: Hidden Markov Model Training Tim, Zack: Access to Epigraph Example 2: Regularized Alex, James: Lively HPC Logistic Regression Closing remarks discussions/debates