SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
BOLT: A C++ TEMPLATE LIBRARY
FOR HSA

Ben Sander
AMD
Senior Fellow
MOTIVATION
§ Improve developer productivity
   –  Optimized library routines for common GPU operations
   –  Works with open standards (OpenCL™ and C++ AMP)
   –  Distributed as open source


§ Make GPU programming as easy as CPU programming
   –  Resemble familiar C++ Standard Template Library
   –  Customizable via C++ template parameters
   –  Leverage high-performance shared virtual memory

                                                             C++ Template Library For HSA
§ Optimize for HSA
   –  Single source base for GPU and CPU
   –  Platform Load Balancing



3 | BOLT | June 2012
AGENDA


§ Introduction and Motivation
§ Bolt Code Examples for C++ AMP and OpenCL™
§ ISV Proof Point
§ Single source code base for CPU and GPU
§ Platform Load Balancing
§ Summary




4 | BOLT | June 2012
SIMPLE BOLT EXAMPLE
      #include <bolt/sort.h>
      #include <vector>
      #include <algorithm>

      void main()
      {
          // generate random data (on host)
          std::vector<int> a(1000000);
          std::generate(a.begin(), a.end(), rand);

             // sort, run on best device
             bolt::sort(a.begin(), a.end());
      }


§ Interface similar to familiar C++ Standard Template Library
§ No explicit mention of C++ AMP or OpenCL™ (or GPU!)
   –  More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL™
§ Direct use of host data structures (ie std::vector)
§ bolt::sort implicitly runs on the platform
   –  Runtime automatically selects CPU or GPU (or both)

 5 | BOLT | June 2012
BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR
#include <bolt/transform.h>
#include <vector>

struct SaxpyFunctor
{
   float _a;
   SaxpyFunctor(float a) : _a(a) {};

     float operator() (const float &xx, const float &yy) restrict(cpu,amp)
     {
           return _a * xx + yy;
     };
};

void main() {
   SaxpyFunctor s(100);
   std::vector<float> x(1000000); // initialization not shown
   std::vector<float> y(1000000); // initialization not shown
   std::vector<float> z(1000000);

     bolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s);
};




6 | BOLT | June 2012
BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA

 #include <bolt/transform.h>
 #include <vector>

 void main(void)
 {
    const float a=100;
    std::vector<float> x(1000000); // initialization not shown
    std::vector<float> y(1000000); // initialization not shown
    std::vector<float> z(1000000);

      // saxpy with C++ Lambda
      bolt::transform(x.begin(), x.end(), y.begin(), z.begin(),
          [=] (float xx, float yy) restrict(cpu, amp) {
                 return a * xx + yy;
          });
 };



§ Functor (“a * xx + yy”) now specified inline
§ Can capture variables from surrounding scope (“a”) – eliminate boilerplate class


7 | BOLT | June 2012
BOLT FOR OPENCL™

         #include <clbolt/sort.h>
         #include <vector>
         #include <algorithm>

         void main()
         {
             // generate random data (on host)
             std::vector<int> a(1000000);
             std::generate(a.begin(), a.end(), rand);

               // sort, run on best device
               clbolt::sort(a.begin(), a.end());
         }


§ Interface similar to familiar C++ Standard Template Library
§ clbolt uses OpenCL™ below the API level
     –  Host data copied or mapped to the GPU
     –  First call to clbolt::sort will generate and compile a kernel
§ More advanced use case allow programmer to supply a kernel in OpenCL™
8 | BOLT | June 2012
BOLT FOR OPENCL™ : USER-SPECIFIED FUNCTOR

#include <clbolt/transform.h>                                          § Challenge: OpenCL™ split-source model
#include <vector>
                                                                          –  Host code in C or C++
                                                                          –  OpenCL™ code specified in strings
BOLT_FUNCTOR(SaxpyFunctor,
struct SaxpyFunctor
{
   float _a;                                                           § Solution:
   SaxpyFunctor(float a) : _a(a) {};
                                                                          –  BOLT_FUNCTOR macro creates both host-side
     float operator() (const float &xx, const float &yy)                     and string versions of “SaxpyFunctor” class
     {                                                                       definition
           return _a * xx + yy;
     };                                                                      §  Class name (“SaxpyFunctor”) stored in TypeName trait
};                                                                           §  OpenCL™ kernel code (SaxpyFunctor class def) stored
);                                                                               in ClCode trait.
void main2() {                                                            –  Clbolt function implementation
   SaxpyFunctor s(100);
   std::vector<float> x(1000000); // initialization not shown                §  Can retrieve traits from class name
   std::vector<float> y(1000000); // initialization not shown                §  Uses TypeName and ClCode to construct a customized
   std::vector<float> z(1000000);                                                transform kernel

     clbolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s);         §  First call to clbolt::transform compiles the kernel
};
                                                                          –  Advanced users can directly create
                                                                             ClCode trait
 9 | BOLT | June 2012
BOLT: C++ AMP VS. OPENCL™

BOLT for C++ AMP                                               BOLT for OpenCL™
§  C++ template library for HSA                               §  C++ template library for HSA
    –  Developer can customize data types and operations          –  Developer can customize data types and operations
    –  Provide library of optimized routines for AMD GPUs.        –  Provide library of optimized routines for AMD GPUs.
§  C++ Host Language                                          §  C++ Host Language
§  Kernels marked with “restrict(cpu, amp)”                   §  Kernels marked with “BOLT_FUNCTOR” macro
§  Kernels written in C++ AMP kernel language                 §  Kernels written in OpenCL™ kernel language
    –  Restricted set of C++                                      –  Subset of C99, with extensions (ie vectors, builtins)
§  Kernels compiled at compile-time                           §  Kernels compiled at runtime, on first call
                                                                  –  Some compile errors shown on first call
§  C++ Lambda Syntax Supported                                §  C++11 Lambda Syntax NOT supported
§  Functors may contain array_view                            §  Functors may not contain pointers
§  Parameters can use host data structures (ie std::vector)   §  Parameters can use host data structures (ie std::vector)
§  Parameters can be array or array_view types                §  Parameters can be cl::Buffer or cl_buffer types
§  Use “bolt” namespace                                       §  Use “clbolt” namespace

10 | BOLT | June 2012
BOLT : WHAT’S NEW?

§ Optimized template library routines for common GPU functions
    –  For OpenCL™ and C++ AMP, across multiple platforms
§ Direct interfaces to host memory structures (ie std::vectors)
    –  Leverage HSA unified address space and zero-copy memory
    –  C++ AMP array and cl::Buffer also supported if memory already on device
§ Bolt submits to the entire platform rather than a specific device
    –  Runtime automatically selects the device
    –  Provides opportunities for load-balancing
    –  Provides optimal CPU path if no GPU is available.
    –  Override to specify specific accelerator is supported
    –  Enables developers to fearlessly move to the GPU
§ Bolt will contain new APIs optimized for HSA Devices
    –  Multi-device bolt::pipeline, bolt::parallel_filter

11 | BOLT | June 2012
EXAMPLARY ISV PROOF-POINT

                                                                                 Hessian Algorithm Pseudo Code:

§ “Hessian” kernel from “MotionDSP Ikena”                                       // x,y are coordinates of pixel to transform

    –  Commercially available video enhancement software                         // Pixel difference:
                                                                                 It = W(y, x) - I(y, x);

    –  Optimized for CPU and GPU                                                 // average left/right pixels:
                                                                                 Ix = 0.5f *( W(y, x+1) - W(y, x-1) );

                                                                                 // average top/bottom pixels:
                                                                                 Iy = 0.5f*( W(y+1, x) - W(y-1, x) );
§ Basic Hessian Algorithm
                                                                                 X = x dist of this pixel from center
    –  Two input images I and W                                                  Y = y dist of this pixel from center

                                                                                 …
    –  Transform, followed by reduce (“transform_reduce”)                        // Compute for each   pixel:
                                                                                 H[ 0] = (Ix*X+Iy*Y)   * (Ix*X+Iy*Y)
        §  For each pixel in image, compute 14 float coefficients               H[ 1] = (Ix*X-Iy*Y)   * (Ix*X+Iy*Y)
                                                                                 H[ 2] = (Ix*X-Iy*Y)   * (Ix*X-Iy*Y)
                                                                                 H[ 3] = (Ix       )   * (Ix*X+Iy*Y)
        §  Sum the coefficients for all the pixels– final result is 14 floats   H[ 4] = (Ix       )   * (Ix*X-Iy*Y)
                                                                                 H[ 5] = (Ix       )   * (Ix       )
    –  Complex, computationally intense, real-world algorithm                    H[ 6] = (Iy       )   * (Ix*X+Iy*Y)
                                                                                 H[ 7] = (Iy       )   * (Ix*X-Iy*Y)
                                                                                 H[ 8] = (Iy       )   * (Ix       )
                                                                                 H[ 9] = (Iy       )   * (Iy       )
                                                                                 H[10] = (It       )   * (Ix*X+Iy*Y)
§ Developed multiple implementations of Hessian kernel                          H[11] = (It       )   * (Ix*X-Iy*Y)
                                                                                 H[12] = (It       )   * (Ix       )
    –  CPU, GPU, Bolt                                                            H[13] = (It       )   * (Iy       )



12 | BOLT | June 2012
LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS

                                                                              (Exemplary ISV “Hessian” Kernel)
         350	
                                                                                                                                  35.00	
  


         300	
                                                                                                                                  30.00	
  

                                                                                          Init.

         250	
                                                                                                                                  25.00	
  




                                                                                                                                                            Relative Performance
                                                                     Launch


         200	
                                                                          Compile                                                 20.00	
  
   LOC




                                                                                                       Compile
                                                                                         Copy
                                                                                                           Copy
         150	
                                                                                                                                  15.00	
  
                                                                                        Launch             Launch        Launch

                                                                    Algorithm
         100	
                               Launch                                                                                             10.00	
  
                          Launch
                                                                                       Algorithm       Algorithm        Algorithm    Launch

           50	
                                                                                                                                 5.00	
  
                         Algorithm          Algorithm                                                                               Algorithm


                                                                                       Copy-back      Copy-back         Copy-back
             0	
                                                                                                                                0	
  
                         Serial CPU              TBB                Intrinsics+TBB      OpenCL™-C OpenCL™ -C++          C++ AMP      HSA Bolt



                     Copy-back       Algorithm          Launch           Copy           Compile     Init            Performance

13 | The Programmer’s Guide to a Universe of Possibility         | June 12, 2012
PERFORMANCE PORTABILITY - INTRODUCTION


§ For many algorithms, core operation same between CPU and GPU
    –  See sort, saxpy, hessian examples
    –  Same Core Operation
    –  Differences in how data is routed to the core operation


§ Bolt hides the device-specific routing details inside the library function implementation
    –  GPU implementations:
        §  GPU-friendly data strides
        §  Launch enough threads to hide memory latency
        §  Group Memory and work-group communication
    –  CPU implementations:
        §  CPU-friendly data strides
        §  Launch enough threads to use all cores


14 | BOLT | June 2012
PERFORMANCE PORTABILITY – RESULTS

                                                        CPU	
  Performance	
  vs	
  Programming	
  Model	
  
                                                                 (Exemplary	
  ISV	
  "Hessian"	
  Kernel")	
  	
  
                         4.50	
  
                                                                                         	
  
                         4.00	
  

                         3.50	
  

                         3.00	
  
       Rel	
  Perf	
  




                         2.50	
  
           	
  




                         2.00	
  

                         1.50	
  

                         1.00	
  

                         0.50	
  

                         0.00	
  
                                    Serial	
  CPU	
                     TBB	
  CPU	
                  OpenCL	
  (CPU)	
     HSA	
  Bolt	
  (CPU)	
  




15 | BOLT | June 2012
PERFORMANCE PORTABILITY – WHAT’S NEW ?

§ New GPU programming models are close to CPU programming models
   –  C++ AMP : Single-source, (restricted) C++11 kernel language, high-quality debugger/profiler, etc
§ Shared Virtual Memory in HSA
   –  Removes tedious copies between address spaces
   –  Will allow use of complex pointer-containing data structures
§ Less performance cliffs in modern GPU architectures (ie AMD GCN)
   –  Reduce need for GPU-specific optimizations in core operation
   –  Example: 14:7:1 Bandwidth Ratio for Group:Cache:Global Memory
§ Autovectorization
   –  Modern compilers include auto-vectorization support
   –  Restrictions of GPU programming models facilitate vectorization
§ Finally, Bolt functors can provide device-specific implementations if needed




16 | BOLT | June 2012
HSA LOAD BALANCING : KEY FEATURES AND OBSERVATIONS

§  High-performance shared virtual memory
    –  Developers no longer have to worry about data location (ie device vs host)


§  HSA platforms have tightly integrated CPU and GPU
    –  GPU better at wide vector parallelism, extracting memory bandwidth, latency hiding
    –  CPU better at fine-grained vector parallelism, cache-sensitive code, control-flow


§  Bolt Abstractions
    –  Provides insight into the characteristics of the algorithm
        §  Reduce vs Transform vs parallel_filter
    –  Abstraction above the details of a “kernel launch”
        §  Don’t need to specify device, workgroup shape, work-items, number of kernels, etc
        §  Runtime may optimize these for the platform


§  Bolt has access to both optimized CPU and GPU implementations, at the same time
    –  Let’s use both!

17 | BOLT | June 2012
EXAMPLES OF HSA LOAD-BALANCING

    Example	
             DescripBon	
                                                          Exemplary	
  Use	
  Cases	
  


    Data	
  Size	
        Run	
  large	
  data	
  sizes	
  on	
  GPU,	
  small	
  on	
  CPU	
   Same	
  call-­‐site	
  used	
  for	
  varying	
  data	
  sizes.	
  

                          Run	
  iniWal	
  reducWon	
  phases	
  on	
  GPU,	
  run	
  
    ReducWon	
            final	
  stages	
  on	
  CPU	
                                         Any	
  reducWon	
  operaWon.	
  

    Border/Edge	
         Run	
  wide	
  center	
  regions	
  on	
  GPU,	
  run	
  
    OpWmizaWon	
          border	
  regions	
  on	
  CPU.	
  	
  	
                             Image	
  processing.	
  
    PlaUorm	
  Super-­‐   Distribute	
  workgroups	
  to	
  available	
                        Kernel	
  has	
  similar	
  performance	
  /energy	
  on	
  
    Device	
              processing	
  units	
  on	
  the	
  enWre	
  plaUorm.	
              CPU	
  and	
  GPU.	
  
                          Run	
  a	
  pipelined	
  series	
  of	
  user-­‐defined	
  
    Heterogeneous	
       stages.	
  	
  Stages	
  can	
  be	
  CPU-­‐only,	
  GPU-­‐only,	
  
    Pipeline	
            or	
  CPU	
  or	
  GPU.	
                                            Video	
  processing	
  pipeline.	
  
                          GPU	
  scans	
  all	
  candidates	
  and	
  rejects	
  early	
  
                          mismatches;	
  CPU	
  more	
  deeply	
  evaluates	
  
    Parallel_filter	
      the	
  survivors.	
                                                  Haar	
  detector,	
  word	
  search,	
  audio	
  search.	
  


18 | BOLT | June 2012
HETEROGENEOUS PIPELINE
§ Mimics a traditional manufacturing assembly line
    –  Developer supplies a series of pipeline stages
    –  Each stage processes it’s input token, passes an output token to the next stage
    –  Stages can be either CPU-only, GPU-only, or CPU/GPU
§ CPU/GPU tasks are dynamically scheduled
    –  Use queue depth and estimated execution time to drive scheduling decision
    –  Adapt to variation in target hardware or system utilization
    –  Data location not an issue in HSA
    –  Leverage single source code
§ GPU kernels scheduled asynchronously
    –  Completion invokes next stage of the pipeline
§ Simple Video Pipeline Example:                             Video
                                         Video                                   Video
                                        Decode              Processing          Render
                                      (CPU-only)            (CPU/GPU)         (GPU-only)


19 | BOLT | June 2012
CASCADE DEPTH ANALYSIS
                                                                         Cascade Depth 25

                                                                                         20

                                                                                        15

                                                                                        10

                                                                                    5
                                                                                    0         20-25
                                                                                              15-20
                                                                                              10-15
                                                                                              5-10
                                                                                              0-5




20 | The Programmer’s Guide to a Universe of Possibility   | June 12, 2012
PARALLEL_FILTER
§  Target applications with a “Filter” pattern
    –  Filter out a small number of results from a large initial pool of candidates
    –  Initial phases best run on GPU:
        §  Large data sets (too big for caches), wide vector, high-bandwidth
    –  Tail phases best run on CPU
        §  Smaller data sets (may fit in cache), divergent control flow, fine-grained vector width
    –  Examples: Haar detector, word search, acoustic search
§  Developer specifies:
    –  Execution Grid
    –  Iteration state type and initial value
    –  Filter function
        §  Accepts a point to process and the current iteration state
        §  Return True to continue processing or False to exit
§  BOLT / HSA Runtime
    –  Automatically hands off work between CPU and GPU
    –  Balances work by adjusting the split point between GPU and CPU

21 | BOLT | June 2012
SUMMARY

 § Bolt: C++ Template Library
     –  Optimized GPU and HSA Library routines
     –  Customizable via templates
     –  For both OpenCL™ and C++ AMP


 § Enjoy the unique advantages of the HSA Platform
     –  High-performance shared virtual memory
     –  Tightly integrated CPU and GPU

                                                      C++ Template Library For HSA
 § Enable advanced HSA features
     –  A single source base for CPU and GPU
     –  Platform load balancing across CPU and GPU




22 | BOLT | June 2012
BACKUP




23 | BOLT | June 2012
BENCHMARK CONFIGURATION INFORMATION


§ Slide13, 15
    –  AMD A10-5800K APU with Radeon™ HD Graphics
        §  CPU: 4cores, 3800Mhz (4200Mhz Turbo)
        §  GPU: AMD Radeon™ HD 7660D, 6 compute units, 800Mhz
        §  4GB RAM

    –  Software:
        §  Windows 7 Professional SP1 (64-bit OS)
        §  AMD OpenCL™ 1.2 AMD-APP (937.2)
        §  Microsoft Visual Studio 11 Beta




24 | BOLT | June 2012
Disclaimer & Attribution
          The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
          and typographical errors.

          The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
          to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
          differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
          obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
          make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

          NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
          RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
          INFORMATION.

          ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
          DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
          OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
          EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

          AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in
          this presentation are for informational purposes only and may be trademarks of their respective owners.

          [For AMD-speakers only] © 2012 Advanced Micro Devices, Inc.
          [For non-AMD speakers only] The contents of this presentation were provided by individual(s) and/or company listed on the title
          page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions.
          Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.



25 | BOLT | June 2012

Weitere ähnliche Inhalte

Was ist angesagt?

PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSA Foundation
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
 
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoAMD Developer Central
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...AMD Developer Central
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
C11/C++11 Memory model. What is it, and why?
C11/C++11 Memory model. What is it, and why?C11/C++11 Memory model. What is it, and why?
C11/C++11 Memory model. What is it, and why?Mikael Rosbacke
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosAMD Developer Central
 

Was ist angesagt? (20)

PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
 
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
Hsa10 whitepaper
Hsa10 whitepaperHsa10 whitepaper
Hsa10 whitepaper
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
C11/C++11 Memory model. What is it, and why?
C11/C++11 Memory model. What is it, and why?C11/C++11 Memory model. What is it, and why?
C11/C++11 Memory model. What is it, and why?
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 

Ähnlich wie Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD

Compiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesCompiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesEelco Visser
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxgopikahari7
 
The Joy of ServerSide Swift Development
The Joy  of ServerSide Swift DevelopmentThe Joy  of ServerSide Swift Development
The Joy of ServerSide Swift DevelopmentGiordano Scalzo
 
The Joy Of Server Side Swift Development
The Joy Of Server Side Swift DevelopmentThe Joy Of Server Side Swift Development
The Joy Of Server Side Swift DevelopmentGiordano Scalzo
 
The Joy of Server Side Swift Development
The Joy  of Server Side Swift DevelopmentThe Joy  of Server Side Swift Development
The Joy of Server Side Swift DevelopmentGiordano Scalzo
 
C++totural file
C++totural fileC++totural file
C++totural filehalaisumit
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linuxMiller Lee
 
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]Chris Adamson
 
Take advantage of C++ from Python
Take advantage of C++ from PythonTake advantage of C++ from Python
Take advantage of C++ from PythonYung-Yu Chen
 
Blocks & GCD
Blocks & GCDBlocks & GCD
Blocks & GCDrsebbe
 
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben SanderPT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben SanderAMD Developer Central
 
Open cl programming using python syntax
Open cl programming using python syntaxOpen cl programming using python syntax
Open cl programming using python syntaxcsandit
 
OpenCL programming using Python syntax
OpenCL programming using Python syntax OpenCL programming using Python syntax
OpenCL programming using Python syntax cscpconf
 
Declare Your Language: Virtual Machines & Code Generation
Declare Your Language: Virtual Machines & Code GenerationDeclare Your Language: Virtual Machines & Code Generation
Declare Your Language: Virtual Machines & Code GenerationEelco Visser
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeYung-Yu Chen
 

Ähnlich wie Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD (20)

Compiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesCompiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual Machines
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
Rust vs C++
Rust vs C++Rust vs C++
Rust vs C++
 
Return of c++
Return of c++Return of c++
Return of c++
 
The Joy of ServerSide Swift Development
The Joy  of ServerSide Swift DevelopmentThe Joy  of ServerSide Swift Development
The Joy of ServerSide Swift Development
 
The Joy Of Server Side Swift Development
The Joy Of Server Side Swift DevelopmentThe Joy Of Server Side Swift Development
The Joy Of Server Side Swift Development
 
The Joy of Server Side Swift Development
The Joy  of Server Side Swift DevelopmentThe Joy  of Server Side Swift Development
The Joy of Server Side Swift Development
 
C++totural file
C++totural fileC++totural file
C++totural file
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]
 
Take advantage of C++ from Python
Take advantage of C++ from PythonTake advantage of C++ from Python
Take advantage of C++ from Python
 
Blocks & GCD
Blocks & GCDBlocks & GCD
Blocks & GCD
 
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben SanderPT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
 
Open cl programming using python syntax
Open cl programming using python syntaxOpen cl programming using python syntax
Open cl programming using python syntax
 
OpenCL programming using Python syntax
OpenCL programming using Python syntax OpenCL programming using Python syntax
OpenCL programming using Python syntax
 
C++ Boot Camp Part 2
C++ Boot Camp Part 2C++ Boot Camp Part 2
C++ Boot Camp Part 2
 
Swift core
Swift coreSwift core
Swift core
 
Declare Your Language: Virtual Machines & Code Generation
Declare Your Language: Virtual Machines & Code GenerationDeclare Your Language: Virtual Machines & Code Generation
Declare Your Language: Virtual Machines & Code Generation
 
C++ tutorial
C++ tutorialC++ tutorial
C++ tutorial
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New Rope
 

Mehr von HSA Foundation

KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUKeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUHSA Foundation
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective HSA Foundation
 
Hsa Runtime version 1.00 Provisional
Hsa Runtime version  1.00  ProvisionalHsa Runtime version  1.00  Provisional
Hsa Runtime version 1.00 ProvisionalHSA Foundation
 
Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)HSA Foundation
 
ISCA final presentation - Runtime
ISCA final presentation - RuntimeISCA final presentation - Runtime
ISCA final presentation - RuntimeHSA Foundation
 
ISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelHSA Foundation
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory ModelHSA Foundation
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - CompilationsHSA Foundation
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsHSA Foundation
 
ISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILHSA Foundation
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - IntroHSA Foundation
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...HSA Foundation
 
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed HSA Foundation
 
Apu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareApu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareHSA Foundation
 
HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Foundation
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013HSA Foundation
 
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation
 
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Foundation
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...HSA Foundation
 
Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012HSA Foundation
 

Mehr von HSA Foundation (20)

KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUKeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
Hsa Runtime version 1.00 Provisional
Hsa Runtime version  1.00  ProvisionalHsa Runtime version  1.00  Provisional
Hsa Runtime version 1.00 Provisional
 
Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)
 
ISCA final presentation - Runtime
ISCA final presentation - RuntimeISCA final presentation - Runtime
ISCA final presentation - Runtime
 
ISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelISCA final presentation - Queuing Model
ISCA final presentation - Queuing Model
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
 
ISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAIL
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
 
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
 
Apu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareApu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshare
 
HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013
 
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer
 
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
 
Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012
 

Kürzlich hochgeladen

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD

  • 1.
  • 2. BOLT: A C++ TEMPLATE LIBRARY FOR HSA Ben Sander AMD Senior Fellow
  • 3. MOTIVATION § Improve developer productivity –  Optimized library routines for common GPU operations –  Works with open standards (OpenCL™ and C++ AMP) –  Distributed as open source § Make GPU programming as easy as CPU programming –  Resemble familiar C++ Standard Template Library –  Customizable via C++ template parameters –  Leverage high-performance shared virtual memory C++ Template Library For HSA § Optimize for HSA –  Single source base for GPU and CPU –  Platform Load Balancing 3 | BOLT | June 2012
  • 4. AGENDA § Introduction and Motivation § Bolt Code Examples for C++ AMP and OpenCL™ § ISV Proof Point § Single source code base for CPU and GPU § Platform Load Balancing § Summary 4 | BOLT | June 2012
  • 5. SIMPLE BOLT EXAMPLE #include <bolt/sort.h> #include <vector> #include <algorithm> void main() { // generate random data (on host) std::vector<int> a(1000000); std::generate(a.begin(), a.end(), rand); // sort, run on best device bolt::sort(a.begin(), a.end()); } § Interface similar to familiar C++ Standard Template Library § No explicit mention of C++ AMP or OpenCL™ (or GPU!) –  More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL™ § Direct use of host data structures (ie std::vector) § bolt::sort implicitly runs on the platform –  Runtime automatically selects CPU or GPU (or both) 5 | BOLT | June 2012
  • 6. BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR #include <bolt/transform.h> #include <vector> struct SaxpyFunctor { float _a; SaxpyFunctor(float a) : _a(a) {}; float operator() (const float &xx, const float &yy) restrict(cpu,amp) { return _a * xx + yy; }; }; void main() { SaxpyFunctor s(100); std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); bolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s); }; 6 | BOLT | June 2012
  • 7. BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA #include <bolt/transform.h> #include <vector> void main(void) { const float a=100; std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); // saxpy with C++ Lambda bolt::transform(x.begin(), x.end(), y.begin(), z.begin(), [=] (float xx, float yy) restrict(cpu, amp) { return a * xx + yy; }); }; § Functor (“a * xx + yy”) now specified inline § Can capture variables from surrounding scope (“a”) – eliminate boilerplate class 7 | BOLT | June 2012
  • 8. BOLT FOR OPENCL™ #include <clbolt/sort.h> #include <vector> #include <algorithm> void main() { // generate random data (on host) std::vector<int> a(1000000); std::generate(a.begin(), a.end(), rand); // sort, run on best device clbolt::sort(a.begin(), a.end()); } § Interface similar to familiar C++ Standard Template Library § clbolt uses OpenCL™ below the API level –  Host data copied or mapped to the GPU –  First call to clbolt::sort will generate and compile a kernel § More advanced use case allow programmer to supply a kernel in OpenCL™ 8 | BOLT | June 2012
  • 9. BOLT FOR OPENCL™ : USER-SPECIFIED FUNCTOR #include <clbolt/transform.h> § Challenge: OpenCL™ split-source model #include <vector> –  Host code in C or C++ –  OpenCL™ code specified in strings BOLT_FUNCTOR(SaxpyFunctor, struct SaxpyFunctor { float _a; § Solution: SaxpyFunctor(float a) : _a(a) {}; –  BOLT_FUNCTOR macro creates both host-side float operator() (const float &xx, const float &yy) and string versions of “SaxpyFunctor” class { definition return _a * xx + yy; }; §  Class name (“SaxpyFunctor”) stored in TypeName trait }; §  OpenCL™ kernel code (SaxpyFunctor class def) stored ); in ClCode trait. void main2() { –  Clbolt function implementation SaxpyFunctor s(100); std::vector<float> x(1000000); // initialization not shown §  Can retrieve traits from class name std::vector<float> y(1000000); // initialization not shown §  Uses TypeName and ClCode to construct a customized std::vector<float> z(1000000); transform kernel clbolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s); §  First call to clbolt::transform compiles the kernel }; –  Advanced users can directly create ClCode trait 9 | BOLT | June 2012
  • 10. BOLT: C++ AMP VS. OPENCL™ BOLT for C++ AMP BOLT for OpenCL™ §  C++ template library for HSA §  C++ template library for HSA –  Developer can customize data types and operations –  Developer can customize data types and operations –  Provide library of optimized routines for AMD GPUs. –  Provide library of optimized routines for AMD GPUs. §  C++ Host Language §  C++ Host Language §  Kernels marked with “restrict(cpu, amp)” §  Kernels marked with “BOLT_FUNCTOR” macro §  Kernels written in C++ AMP kernel language §  Kernels written in OpenCL™ kernel language –  Restricted set of C++ –  Subset of C99, with extensions (ie vectors, builtins) §  Kernels compiled at compile-time §  Kernels compiled at runtime, on first call –  Some compile errors shown on first call §  C++ Lambda Syntax Supported §  C++11 Lambda Syntax NOT supported §  Functors may contain array_view §  Functors may not contain pointers §  Parameters can use host data structures (ie std::vector) §  Parameters can use host data structures (ie std::vector) §  Parameters can be array or array_view types §  Parameters can be cl::Buffer or cl_buffer types §  Use “bolt” namespace §  Use “clbolt” namespace 10 | BOLT | June 2012
  • 11. BOLT : WHAT’S NEW? § Optimized template library routines for common GPU functions –  For OpenCL™ and C++ AMP, across multiple platforms § Direct interfaces to host memory structures (ie std::vectors) –  Leverage HSA unified address space and zero-copy memory –  C++ AMP array and cl::Buffer also supported if memory already on device § Bolt submits to the entire platform rather than a specific device –  Runtime automatically selects the device –  Provides opportunities for load-balancing –  Provides optimal CPU path if no GPU is available. –  Override to specify specific accelerator is supported –  Enables developers to fearlessly move to the GPU § Bolt will contain new APIs optimized for HSA Devices –  Multi-device bolt::pipeline, bolt::parallel_filter 11 | BOLT | June 2012
  • 12. EXAMPLARY ISV PROOF-POINT Hessian Algorithm Pseudo Code: § “Hessian” kernel from “MotionDSP Ikena” // x,y are coordinates of pixel to transform –  Commercially available video enhancement software // Pixel difference: It = W(y, x) - I(y, x); –  Optimized for CPU and GPU // average left/right pixels: Ix = 0.5f *( W(y, x+1) - W(y, x-1) ); // average top/bottom pixels: Iy = 0.5f*( W(y+1, x) - W(y-1, x) ); § Basic Hessian Algorithm X = x dist of this pixel from center –  Two input images I and W Y = y dist of this pixel from center … –  Transform, followed by reduce (“transform_reduce”) // Compute for each pixel: H[ 0] = (Ix*X+Iy*Y) * (Ix*X+Iy*Y) §  For each pixel in image, compute 14 float coefficients H[ 1] = (Ix*X-Iy*Y) * (Ix*X+Iy*Y) H[ 2] = (Ix*X-Iy*Y) * (Ix*X-Iy*Y) H[ 3] = (Ix ) * (Ix*X+Iy*Y) §  Sum the coefficients for all the pixels– final result is 14 floats H[ 4] = (Ix ) * (Ix*X-Iy*Y) H[ 5] = (Ix ) * (Ix ) –  Complex, computationally intense, real-world algorithm H[ 6] = (Iy ) * (Ix*X+Iy*Y) H[ 7] = (Iy ) * (Ix*X-Iy*Y) H[ 8] = (Iy ) * (Ix ) H[ 9] = (Iy ) * (Iy ) H[10] = (It ) * (Ix*X+Iy*Y) § Developed multiple implementations of Hessian kernel H[11] = (It ) * (Ix*X-Iy*Y) H[12] = (It ) * (Ix ) –  CPU, GPU, Bolt H[13] = (It ) * (Iy ) 12 | BOLT | June 2012
  • 13. LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS (Exemplary ISV “Hessian” Kernel) 350   35.00   300   30.00   Init. 250   25.00   Relative Performance Launch 200   Compile 20.00   LOC Compile Copy Copy 150   15.00   Launch Launch Launch Algorithm 100   Launch 10.00   Launch Algorithm Algorithm Algorithm Launch 50   5.00   Algorithm Algorithm Algorithm Copy-back Copy-back Copy-back 0   0   Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt Copy-back Algorithm Launch Copy Compile Init Performance 13 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
  • 14. PERFORMANCE PORTABILITY - INTRODUCTION § For many algorithms, core operation same between CPU and GPU –  See sort, saxpy, hessian examples –  Same Core Operation –  Differences in how data is routed to the core operation § Bolt hides the device-specific routing details inside the library function implementation –  GPU implementations: §  GPU-friendly data strides §  Launch enough threads to hide memory latency §  Group Memory and work-group communication –  CPU implementations: §  CPU-friendly data strides §  Launch enough threads to use all cores 14 | BOLT | June 2012
  • 15. PERFORMANCE PORTABILITY – RESULTS CPU  Performance  vs  Programming  Model   (Exemplary  ISV  "Hessian"  Kernel")     4.50     4.00   3.50   3.00   Rel  Perf   2.50     2.00   1.50   1.00   0.50   0.00   Serial  CPU   TBB  CPU   OpenCL  (CPU)   HSA  Bolt  (CPU)   15 | BOLT | June 2012
  • 16. PERFORMANCE PORTABILITY – WHAT’S NEW ? § New GPU programming models are close to CPU programming models –  C++ AMP : Single-source, (restricted) C++11 kernel language, high-quality debugger/profiler, etc § Shared Virtual Memory in HSA –  Removes tedious copies between address spaces –  Will allow use of complex pointer-containing data structures § Less performance cliffs in modern GPU architectures (ie AMD GCN) –  Reduce need for GPU-specific optimizations in core operation –  Example: 14:7:1 Bandwidth Ratio for Group:Cache:Global Memory § Autovectorization –  Modern compilers include auto-vectorization support –  Restrictions of GPU programming models facilitate vectorization § Finally, Bolt functors can provide device-specific implementations if needed 16 | BOLT | June 2012
  • 17. HSA LOAD BALANCING : KEY FEATURES AND OBSERVATIONS §  High-performance shared virtual memory –  Developers no longer have to worry about data location (ie device vs host) §  HSA platforms have tightly integrated CPU and GPU –  GPU better at wide vector parallelism, extracting memory bandwidth, latency hiding –  CPU better at fine-grained vector parallelism, cache-sensitive code, control-flow §  Bolt Abstractions –  Provides insight into the characteristics of the algorithm §  Reduce vs Transform vs parallel_filter –  Abstraction above the details of a “kernel launch” §  Don’t need to specify device, workgroup shape, work-items, number of kernels, etc §  Runtime may optimize these for the platform §  Bolt has access to both optimized CPU and GPU implementations, at the same time –  Let’s use both! 17 | BOLT | June 2012
  • 18. EXAMPLES OF HSA LOAD-BALANCING Example   DescripBon   Exemplary  Use  Cases   Data  Size   Run  large  data  sizes  on  GPU,  small  on  CPU   Same  call-­‐site  used  for  varying  data  sizes.   Run  iniWal  reducWon  phases  on  GPU,  run   ReducWon   final  stages  on  CPU   Any  reducWon  operaWon.   Border/Edge   Run  wide  center  regions  on  GPU,  run   OpWmizaWon   border  regions  on  CPU.       Image  processing.   PlaUorm  Super-­‐ Distribute  workgroups  to  available   Kernel  has  similar  performance  /energy  on   Device   processing  units  on  the  enWre  plaUorm.   CPU  and  GPU.   Run  a  pipelined  series  of  user-­‐defined   Heterogeneous   stages.    Stages  can  be  CPU-­‐only,  GPU-­‐only,   Pipeline   or  CPU  or  GPU.   Video  processing  pipeline.   GPU  scans  all  candidates  and  rejects  early   mismatches;  CPU  more  deeply  evaluates   Parallel_filter   the  survivors.   Haar  detector,  word  search,  audio  search.   18 | BOLT | June 2012
  • 19. HETEROGENEOUS PIPELINE § Mimics a traditional manufacturing assembly line –  Developer supplies a series of pipeline stages –  Each stage processes it’s input token, passes an output token to the next stage –  Stages can be either CPU-only, GPU-only, or CPU/GPU § CPU/GPU tasks are dynamically scheduled –  Use queue depth and estimated execution time to drive scheduling decision –  Adapt to variation in target hardware or system utilization –  Data location not an issue in HSA –  Leverage single source code § GPU kernels scheduled asynchronously –  Completion invokes next stage of the pipeline § Simple Video Pipeline Example: Video Video Video Decode Processing Render (CPU-only) (CPU/GPU) (GPU-only) 19 | BOLT | June 2012
  • 20. CASCADE DEPTH ANALYSIS Cascade Depth 25 20 15 10 5 0 20-25 15-20 10-15 5-10 0-5 20 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
  • 21. PARALLEL_FILTER §  Target applications with a “Filter” pattern –  Filter out a small number of results from a large initial pool of candidates –  Initial phases best run on GPU: §  Large data sets (too big for caches), wide vector, high-bandwidth –  Tail phases best run on CPU §  Smaller data sets (may fit in cache), divergent control flow, fine-grained vector width –  Examples: Haar detector, word search, acoustic search §  Developer specifies: –  Execution Grid –  Iteration state type and initial value –  Filter function §  Accepts a point to process and the current iteration state §  Return True to continue processing or False to exit §  BOLT / HSA Runtime –  Automatically hands off work between CPU and GPU –  Balances work by adjusting the split point between GPU and CPU 21 | BOLT | June 2012
  • 22. SUMMARY § Bolt: C++ Template Library –  Optimized GPU and HSA Library routines –  Customizable via templates –  For both OpenCL™ and C++ AMP § Enjoy the unique advantages of the HSA Platform –  High-performance shared virtual memory –  Tightly integrated CPU and GPU C++ Template Library For HSA § Enable advanced HSA features –  A single source base for CPU and GPU –  Platform load balancing across CPU and GPU 22 | BOLT | June 2012
  • 23. BACKUP 23 | BOLT | June 2012
  • 24. BENCHMARK CONFIGURATION INFORMATION § Slide13, 15 –  AMD A10-5800K APU with Radeon™ HD Graphics §  CPU: 4cores, 3800Mhz (4200Mhz Turbo) §  GPU: AMD Radeon™ HD 7660D, 6 compute units, 800Mhz §  4GB RAM –  Software: §  Windows 7 Professional SP1 (64-bit OS) §  AMD OpenCL™ 1.2 AMD-APP (937.2) §  Microsoft Visual Studio 11 Beta 24 | BOLT | June 2012
  • 25. Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. [For AMD-speakers only] © 2012 Advanced Micro Devices, Inc. [For non-AMD speakers only] The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied. 25 | BOLT | June 2012