SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Downloaden Sie, um offline zu lesen
Heterogeneous
Architectures
                   accULL: An User-directed Approach to
accULL: An Early
OpenACC
                       Heterogeneous Programming
Implementation

Results

Conclusions and
Future Work
                   Ruym´n Reyes
                       a           Iv´n L´pez-Rodr´
                                     a o          ıguez          Juan J. Fumero
                                    Francisco de Sande

                                    1
                                      Dept. E.I.O. y Computaci´n,
                                                              o
                            Univ. de La Laguna, 38271–La Laguna, Spain



                    International Workshop on Heterogeneous
                          Architectures and Computing
                                Legan´s, July 13 2012
                                     e

                                                                                  1 / 66
Outline


Heterogeneous
Architectures

accULL: An Early
                   1   Heterogeneous Architectures
OpenACC
Implementation

Results

Conclusions and    2   accULL: An Early OpenACC Implementation
Future Work




                   3   Results


                   4   Conclusions and Future Work




                                                                 2 / 66
Outline


Heterogeneous
Architectures

accULL: An Early
                   1   Heterogeneous Architectures
OpenACC
Implementation

Results

Conclusions and    2   accULL: An Early OpenACC Implementation
Future Work




                   3   Results


                   4   Conclusions and Future Work




                                                                 3 / 66
Introduction

                   The irruption of GPUs: Impressive Results
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                               4 / 66
GPUs

                   Successfully used for general purpose computing (GPGPU)
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                                         5 / 66
Heterogeneous Architectures


Heterogeneous
Architectures
                   But ...
accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                      It is not Easy!




                                                        6 / 66
Heterogeneous Architectures

                   A GPU is not a CPU
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                       GPUs are inherently SIMD processors
                       CPUs and GPUs tackle the processing of tasks differently
                       CPUs excel at serial processing
                       GPUs are better at handling applications that require high
                       floating point calculations and lower power consumption
                                                                                    7 / 66
Parallel Languages: MPI (DM) and OpenMP (SM)


Heterogeneous      They are not valid for programming GPUs
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                              New programming models are required...



                                                                       8 / 66
GPGPU Programming


Heterogeneous      Nowadays Software Stack:
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                              9 / 66
CUDA from NVIDIA


Heterogeneous
Architectures
                                                                            Pros: Performance, Easier
accULL: An Early
OpenACC                                                                     than OpenCL
Implementation

Results                                                                     Con: Only for NVIDIA
Conclusions and                                                             hardware
Future Work




                       CUDA Code Example
                   1   __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n ,
                   2     int m , int p) {
                   3      i n t i = blockIdx . x ∗32 + threadIdx . x ;
                   4      i n t j = blockIdx . y ;
                   5      f l o a t sum = 0 . 0 f ;
                   6      f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ;
                   7      a [ i+n∗j ] = sum ;
                   8   }
                                                                                                                 10 / 66
GPGPU Programming

                   OpenCL: Open Computing Language
Heterogeneous
Architectures
                      A framework developed by the Khronos Group
accULL: An Early      A standard
OpenACC
Implementation        OpenCL programs execute across heterogeneous platforms:
Results
                      CPUs + GPUs + other processors
Conclusions and
Future Work               Pros: can be used with any device, it is a standard
                          Cons: more complex than CUDA, inmature




                                                                                11 / 66
GPGPU Programming

                   Common Problems
Heterogeneous
                    1   The programmer needs to know low-level details of the
Architectures
                        architecture
accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                                                12 / 66
GPGPU Programming


Heterogeneous
Architectures

accULL: An Early
OpenACC
                   Common Problems
Implementation      1   The programmer needs to know low-level details of the
Results
                        architecture
Conclusions and
Future Work         2   Source codes need to be rewritten:
                            One version for CPU
                            A different version for GPU
                    3   Good performance requires a great effort in parameter tunning
                    4   CUDA and OpenCL are new and complex for non-experts




                                                                                       13 / 66
GPGPU Programming


Heterogeneous
Architectures

accULL: An Early
                   Our Claim: New models and tools are needed if we want
OpenACC
Implementation
                   to widespread the use of GPUs in HPC
Results

Conclusions and
Future Work
                   Is there anything new in the horizon?
                       hiCUDA
                       PGI accelerator model
                       CAPS HMPP
                       OpenACC




                                                                           14 / 66
GPGPU Programming

                        hiCUDA
Heterogeneous
                               Translates each directive into a CUDA call
Architectures
                               It is able to use the GPU Shared Memory
accULL: An Early
OpenACC
Implementation
                               Only works with NVIDIA devices
Results                        The programmer still needs to know hardware details
Conclusions and
Future Work

                        hiCUDA Code Example:
                   1    ...
                   2    # pragma h i c u d a g l o b a l    a l l o c c [ ∗ ] [ ∗ ] copyin

                    4   # pragma h i c u d a k e r n e l mxm t b l o c k (N/ 1 6 ,N/ 1 6 ) t h r e a d ( 1 6 , 1 6 )
                    5       #pragma h i c u d a loop _partit ion over_tblock over_thread
                    6       f o r ( i = 0 ; i < N ; i++ ) {
                    7       #pragma h i c u d a loop _partit ion over_tblock over_thread
                    8       f o r ( j = 0 ; j < N ; j++) {
                    9            d o u b l e sum = 0 . 0 ;
                   10         ...


                                                                                                                       15 / 66
GPGPU Programming

                        PGI accelerator model
                               It is a higher level (directive-based) approach
Heterogeneous
Architectures                  Fortran and C are supported
accULL: An Early
OpenACC                        Precursor to OpenACC
Implementation

Results

Conclusions and         PGI Accelerator Model Code Example:
Future Work

                    1   # pragma a c c d a t a c o p y i n ( b [ 0 : n∗ l ] , c [ 0 :m∗ l ] ) copy ( a [ 0 : n∗m] )
                    2   {
                    3      #pragma a c c r e g i o n
                    4      {
                    5      #pragma a c c l o o p independent
                    6      f o r ( j = 0 ; j < n ; j++)
                    7          {
                    8         #pragma a c c l o o p independent
                    9            f o r ( i = 0 ; i < l ; i++ ) {
                   10                d o u b l e sum = 0 . 0 ;
                   11                f o r ( k = 0 ; k < m ; k++ ) {
                   12                    sum += b [ i+k∗l ] ∗ c [ k+j∗m ] ;
                   13                }
                   14                a [ i+j∗l ] = sum ;
                   15            }
                   16          }                                                                                      16 / 66
GPGPU Programming


Heterogeneous
Architectures

accULL: An Early
                   OpenACC: introduced last November in
OpenACC
Implementation     SuperComputing’2011
Results
                      A directive based language
Conclusions and
Future Work           Aim to be standard
                      Supported by: Cray, NVIDIA, PGI and CAPS
                      A single source code for CPU/GPU
                      Platform independent
                      Easier for beginners




                                                                 17 / 66
GPGPU Programming


                   OpenACC Code Example:
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                           18 / 66
Outline


Heterogeneous
Architectures

accULL: An Early
                   1   Heterogeneous Architectures
OpenACC
Implementation

Results

Conclusions and    2   accULL: An Early OpenACC Implementation
Future Work




                   3   Results


                   4   Conclusions and Future Work




                                                                 19 / 66
accULL: Our OpenACC implementation


Heterogeneous
Architectures

accULL: An Early   accULL is a framework developed to support OpenACC
OpenACC
Implementation     programs
Results

Conclusions and
Future Work




                                                                        20 / 66
accULL: Our OpenACC implementation


Heterogeneous
Architectures
                   accULL = YaCF + Frangollo
accULL: An Early
OpenACC
Implementation
                   It is a two-layer based implementation:
Results                                  Compiler + RunTime Library
Conclusions and
Future Work




                                                                      21 / 66
YaCF: the compiler


Heterogeneous
                   YaCF (Yet Another Compiler Framework) is the compiler
Architectures

accULL: An Early
                   framework we have developed
OpenACC
Implementation     Some features:
Results                It is a StS compiler
Conclusions and
Future Work            Written in Python from scratch with an OO approach
                       Receives C99 as input
                       It is able to generate CUDA/OpenCL kernels from an annotated
                       code
                       A driver for compiling OpenACC directives has been added
                       YaCF translates the directives into Frangollo calls
                       A public-domain development



                                                                                  22 / 66
Frangollo: the RunTime


Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation
                   Frangollo
Results            It is a RunTime to support the execution over heterogeneous
Conclusions and
Future Work
                   platforms
                     1   Encapsulates the hardware issues
                     2   Is able to run in NVIDIA devices using CUDA
                     3   Is able to manage a wider range of devices using OpenCL




                                                                                   23 / 66
Frangollo: the RunTime

                   Compilation flow
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                            24 / 66
Frangollo: the RunTime


Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation     Its Responsibilities
Results

Conclusions and
                     1   Manages the memory
Future Work
                     2   Initializes the devices
                     3   Launches the kernels




                                                   25 / 66
Frangollo: the RunTime


Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation     Its Responsibilities
Results

Conclusions and
                     1   Manages the memory
Future Work
                     2   Initializes the devices
                     3   Launches the kernels
                   Makes programmers’ life easier!




                                                     26 / 66
Frangollo: Memory Management

                   A program workflow
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                  27 / 66
Frangollo: Structure


Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                   Interface layer: A door to Frangollo
                   Some functions in the C interface:
                       registerVar
                       launchKernel
                       getNumDevices

                                                          28 / 66
Frangollo: Structure


Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                   Abstract layer
                       Frangollo uses a class-hierarchy
                       All classes in this layer are abstracts



                                                                 29 / 66
Frangollo: Structure

                   Device layer
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation
                                          Encapsulates all target
Results                                   language related functions
Conclusions and
Future Work
                                          New platforms could be
                                          added in the future




                                                                       30 / 66
Outline


Heterogeneous
Architectures

accULL: An Early
                   1   Heterogeneous Architectures
OpenACC
Implementation

Results

Conclusions and    2   accULL: An Early OpenACC Implementation
Future Work




                   3   Results


                   4   Conclusions and Future Work




                                                                 31 / 66
Platforms


                   M1: A Desktop computer
Heterogeneous
Architectures          Intel Core i7 930 processor (2.80 GHz)
accULL: An Early
OpenACC
                       1MB of L2 cache, 8MB of L3 cache, shared by the four cores
Implementation
                       4 GB RAM
Results

Conclusions and        2 GPU devices attached:
Future Work
                           Tesla C1060 with 3Gb memory (M1a)
                           Tesla C2050 (Fermi) with 4GB memory (M1b)
                           Accelerator platform is CUDA 4.0


                       M1a/ M1b mimic the scenario of an OpenACC average developer
                       She can purchase a GPU card and plug in it into her desktop
                       computer
                       It features a relatively cheap platform

                                                                                     32 / 66
Platforms


                   M2: A cluster node
Heterogeneous
Architectures
                       M2: 2 quad core Intel Xeon E5410 (2.25GHz) processors
accULL: An Early
OpenACC
Implementation
                       24 GB memory
Results                Attached a Fermi C2050 card with 448 multiprocessors and 4
Conclusions and        GB memory
Future Work
                       Accelerator platform: CUDA 4.0


                       M2 is a node of a common multinode cluster
                       Nowadays clusters combine multicore processors and GPU
                       devices, so we can take advantage of OpenACC
                       This kind of compute node has higher acquisition and
                       maintenance costs than M1


                                                                                    33 / 66
Platforms

                   M3: A second cluster
Heterogeneous
Architectures
                       M3 is a shared memory system
accULL: An Early       4 Intel Xeon E7 4850 CPU
OpenACC
Implementation
                       2.50MB L2 cache and 24MB L3 cache (for all its 10 cores)
Results

Conclusions and
                       6GB of memory per core
Future Work
                       Accelerator platform: Intel OpenCL SDK 1.5, running on the
                       CPU


                       M3 showcases an alternative use of OpenCL
                       There are implementations of OpenCL targeting shared memory
                       systems
                       Using CPU-targeted OpenCL platforms along with OpenACC
                       represents an interesting alternative to OpenMP programming

                                                                                     34 / 66
Some of our Experiments


Heterogeneous
                   Blocked Matrix Multiplication (M×M)
Architectures

accULL: An Early
OpenACC
Implementation     Rodinia Benchmark
Results
                      The Rodinia Benchmark suite comprises compute-heavy
Conclusions and
Future Work           applications
                      It covers a wide range of applications
                      OpenMP, CUDA and OpenCL versions are available for most of
                      the codes in the suite
                      From them, we have selected:
                          Needleman-Wunsch (NW)
                          HotSpot (HS)
                          Speckle Reducing Anisotropic Diffusion (SRAD)



                                                                               35 / 66
Matrix Multiplication

                        Sketch of M×M in OpenACC
Heterogeneous
Architectures       1   # pragma a c c k e r n e l s name ( " mxm " ) copy ( a [ L∗N ] )
accULL: An Early    2                                     c o p y i n ( b [ L∗M ] , c [ M∗N ] . . . )
OpenACC             3   {
Implementation
                    4   # pragma a c c l o o p p r i v a t e ( i , j ) c o l l a p s e ( 2 )
Results             5   f o r ( i = 0 ; i < L ; i++)
Conclusions and
                    6       f o r ( j = 0 ; j < N ; j++)
Future Work         7           a[i ∗ L + j] = 0.0;
                    8   /∗ I t e r a t e o v e r b l o c k s ∗/
                    9   f o r ( ii = 0 ; ii < L ; ii += tile_size )
                   10     f o r ( jj = 0 ; jj < N ; jj += tile_size )
                   11       f o r ( kk = 0 ; kk < M ; kk += tile_size ) {
                   12         /∗ I t e r a t e i n s i d e a b l o c k ∗/
                   13        #pragma a c c l o o p c o l l a p s e ( 2 ) p r i v a t e ( i , j , k )
                   14         f o r ( j=jj ; j < min ( N , jj+tile_size ) ; j++)
                   15           f o r ( i=ii ; i < min ( L , ii+tile_size ) ; i++)
                   16             f o r ( k=kk ; k < min ( M , kk+tile_size ) ; k++)
                   17               a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ;
                   18         }
                   19   }




                                                                                                        36 / 66
Matrix Multiplication

                   Floating point performance for M×M in M2
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                              37 / 66
Matrix Multiplication

                   Floating point performance comparison between OpenMP,
                   accULL, PGI and hiCUDA in M1
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                                       38 / 66
Matrix Multiplication

                   Comparison between OpenMP-gcc implementation and
Heterogeneous
                   Frangollo+OpenCL in M3 (SM system 40 cores)
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                                      39 / 66
Needleman-Wunsch

                   Performance comparisons of NW in M1b
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                   accULL performs worse than native versions
                                                                40 / 66
Needleman-Wunsch

                   Performance comparisons of NW in M3 (SM, 40 cores)
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                   The OpenMP versions outperform to the OpenCL counterparts
                                                                               41 / 66
HotSpot

                   Performance comparison of different implementations
                   showing efficiency over native CUDA code in M1
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                   In this case, accULL performs similarly to hiCUDA    42 / 66
HotSpot

                   Speed-Up comparison with native CUDA code in
Heterogeneous
                   M1b (Fermi)
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                                  43 / 66
HotSpot

                   Efficiency w.r.t. Intel-OpenMP in M3 (SM, 40 cores)
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                                       44 / 66
SRAD

                   Speedup over the OpenMP implementation in M1b
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                                   45 / 66
SRAD

                   Speedup over the OpenMP implementation in M3
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work




                                                                  46 / 66
Outline


Heterogeneous
Architectures

accULL: An Early
                   1   Heterogeneous Architectures
OpenACC
Implementation

Results

Conclusions and    2   accULL: An Early OpenACC Implementation
Future Work




                   3   Results


                   4   Conclusions and Future Work




                                                                 47 / 66
Conclusions I


Heterogeneous
Architectures

accULL: An Early
OpenACC            accULL
Implementation

Results
                       First OpenACC implementation with support for both CUDA
Conclusions and        and OpenCL
Future Work




                                                                                 48 / 66
Conclusions I


Heterogeneous
Architectures

accULL: An Early
OpenACC            accULL
Implementation

Results
                       First OpenACC implementation with support for both CUDA
Conclusions and        and OpenCL
Future Work
                       It supports most of the standard




                                                                                 49 / 66
Conclusions I


Heterogeneous
Architectures

accULL: An Early
OpenACC            accULL
Implementation

Results
                       First OpenACC implementation with support for both CUDA
Conclusions and        and OpenCL
Future Work
                       It supports most of the standard
                       We validate accULL using codes from widely available
                       benchmarks using GPUs and CPUs




                                                                                 50 / 66
Conclusions I


Heterogeneous
Architectures

accULL: An Early
OpenACC            accULL
Implementation

Results
                       First OpenACC implementation with support for both CUDA
Conclusions and        and OpenCL
Future Work
                       It supports most of the standard
                       We validate accULL using codes from widely available
                       benchmarks using GPUs and CPUs
                       It meets the requirements of a non-expert developer




                                                                                 51 / 66
Conclusions II


Heterogeneous
Architectures
                   accULL
accULL: An Early       YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation         optimizations
Results

Conclusions and
Future Work




                                                                                52 / 66
Conclusions II


Heterogeneous
Architectures
                   accULL
accULL: An Early       YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation         optimizations
Results
                       Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work            production-ready compiler




                                                                                 53 / 66
Conclusions II


Heterogeneous
Architectures
                   accULL
accULL: An Early       YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation         optimizations
Results
                       Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work            production-ready compiler
                       Some issues that can be tackled within Frangollo
                       independently from the compiler




                                                                                 54 / 66
Conclusions II


Heterogeneous
Architectures
                   accULL
accULL: An Early       YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation         optimizations
Results
                       Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work            production-ready compiler
                       Some issues that can be tackled within Frangollo
                       independently from the compiler
                            Memory allocation




                                                                                 55 / 66
Conclusions II


Heterogeneous
Architectures
                   accULL
accULL: An Early       YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation         optimizations
Results
                       Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work            production-ready compiler
                       Some issues that can be tackled within Frangollo
                       independently from the compiler
                            Memory allocation
                            Kernel scheduling




                                                                                 56 / 66
Conclusions II


Heterogeneous
Architectures
                   accULL
accULL: An Early       YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation         optimizations
Results
                       Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work            production-ready compiler
                       Some issues that can be tackled within Frangollo
                       independently from the compiler
                            Memory allocation
                            Kernel scheduling
                            Data splitting




                                                                                 57 / 66
Conclusions II


Heterogeneous
Architectures
                   accULL
accULL: An Early       YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation         optimizations
Results
                       Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work            production-ready compiler
                       Some issues that can be tackled within Frangollo
                       independently from the compiler
                            Memory allocation
                            Kernel scheduling
                            Data splitting
                            Overlapping of computation and communications




                                                                                 58 / 66
Conclusions II


Heterogeneous
Architectures
                   accULL
accULL: An Early       YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation         optimizations
Results
                       Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work            production-ready compiler
                       Some issues that can be tackled within Frangollo
                       independently from the compiler
                            Memory allocation
                            Kernel scheduling
                            Data splitting
                            Overlapping of computation and communications
                            Parallel reduction implementation




                                                                                 59 / 66
Future work


Heterogeneous
Architectures
                   There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
                       To implement 2D arrays as cudaMatrix or OCLImages to
Results                improve non-contiguous memory access
Conclusions and
Future Work




                                                                              60 / 66
Future work


Heterogeneous
Architectures
                   There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
                       To implement 2D arrays as cudaMatrix or OCLImages to
Results                improve non-contiguous memory access
Conclusions and
Future Work
                       To complete the implementation of the asynchronous calls for
                       better performance




                                                                                      61 / 66
Future work


Heterogeneous
Architectures
                   There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
                       To implement 2D arrays as cudaMatrix or OCLImages to
Results                improve non-contiguous memory access
Conclusions and
Future Work
                       To complete the implementation of the asynchronous calls for
                       better performance
                       Multi-GPU support




                                                                                      62 / 66
Future work


Heterogeneous
Architectures
                   There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
                       To implement 2D arrays as cudaMatrix or OCLImages to
Results                improve non-contiguous memory access
Conclusions and
Future Work
                       To complete the implementation of the asynchronous calls for
                       better performance
                       Multi-GPU support
                       To explore different possibilities of integration with MPI




                                                                                      63 / 66
Future work


Heterogeneous
Architectures
                   There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
                       To implement 2D arrays as cudaMatrix or OCLImages to
Results                improve non-contiguous memory access
Conclusions and
Future Work
                       To complete the implementation of the asynchronous calls for
                       better performance
                       Multi-GPU support
                       To explore different possibilities of integration with MPI
                       Integration of Frangollo with a production-ready compiler




                                                                                      64 / 66
Future work


Heterogeneous
Architectures
                   There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
                       To implement 2D arrays as cudaMatrix or OCLImages to
Results                improve non-contiguous memory access
Conclusions and
Future Work
                       To complete the implementation of the asynchronous calls for
                       better performance
                       Multi-GPU support
                       To explore different possibilities of integration with MPI
                       Integration of Frangollo with a production-ready compiler
                       New backend for FPGAs




                                                                                      65 / 66
Thank you for your attention!


                           accULL: An User-directed Approach to
Heterogeneous
Architectures
                               Heterogeneous Programming
accULL: An Early
OpenACC
Implementation

Results
                            http://accull.wordpress.com/
Conclusions and
Future Work


                       This work has been partially supported by the EU (FEDER),
                        the Spanish MEC (contracts TIN2008-06570-C04-03 and
                        TIN2011-24598), HPC-EUROPA2 and the Canary Islands
                                          Government, ACIISI


                                                                         F. de Sande
                                                                         fsande@ull.es


                                                                                    66 / 66

Weitere ähnliche Inhalte

Ähnlich wie accULL (HAC Leganés)

COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...IJCSEA Journal
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...IJCSEA Journal
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...IJCSEA Journal
 
OpenACC Monthly Highlights September 2019
OpenACC Monthly Highlights September 2019OpenACC Monthly Highlights September 2019
OpenACC Monthly Highlights September 2019OpenACC
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC
 
OpenACC Highlights: GTC Digital April 2020
OpenACC Highlights: GTC Digital April 2020OpenACC Highlights: GTC Digital April 2020
OpenACC Highlights: GTC Digital April 2020OpenACC
 
OpenACC Monthly Highlights April 2018
OpenACC Monthly Highlights April 2018OpenACC Monthly Highlights April 2018
OpenACC Monthly Highlights April 2018NVIDIA
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018NVIDIA
 
OpenACC Monthly Highlights March 2019
OpenACC Monthly Highlights March 2019OpenACC Monthly Highlights March 2019
OpenACC Monthly Highlights March 2019OpenACC
 
OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingRuymán Reyes
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC
 

Ähnlich wie accULL (HAC Leganés) (20)

Yacf
YacfYacf
Yacf
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
OpenACC Monthly Highlights September 2019
OpenACC Monthly Highlights September 2019OpenACC Monthly Highlights September 2019
OpenACC Monthly Highlights September 2019
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021
 
OpenACC Highlights: GTC Digital April 2020
OpenACC Highlights: GTC Digital April 2020OpenACC Highlights: GTC Digital April 2020
OpenACC Highlights: GTC Digital April 2020
 
OpenACC Monthly Highlights April 2018
OpenACC Monthly Highlights April 2018OpenACC Monthly Highlights April 2018
OpenACC Monthly Highlights April 2018
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018
 
OpenACC Monthly Highlights March 2019
OpenACC Monthly Highlights March 2019OpenACC Monthly Highlights March 2019
OpenACC Monthly Highlights March 2019
 
OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
Nvidia GTC 2014 Talk
Nvidia GTC 2014 TalkNvidia GTC 2014 Talk
Nvidia GTC 2014 Talk
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020
 

Kürzlich hochgeladen

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

accULL (HAC Leganés)

  • 1. Heterogeneous Architectures accULL: An User-directed Approach to accULL: An Early OpenACC Heterogeneous Programming Implementation Results Conclusions and Future Work Ruym´n Reyes a Iv´n L´pez-Rodr´ a o ıguez Juan J. Fumero Francisco de Sande 1 Dept. E.I.O. y Computaci´n, o Univ. de La Laguna, 38271–La Laguna, Spain International Workshop on Heterogeneous Architectures and Computing Legan´s, July 13 2012 e 1 / 66
  • 2. Outline Heterogeneous Architectures accULL: An Early 1 Heterogeneous Architectures OpenACC Implementation Results Conclusions and 2 accULL: An Early OpenACC Implementation Future Work 3 Results 4 Conclusions and Future Work 2 / 66
  • 3. Outline Heterogeneous Architectures accULL: An Early 1 Heterogeneous Architectures OpenACC Implementation Results Conclusions and 2 accULL: An Early OpenACC Implementation Future Work 3 Results 4 Conclusions and Future Work 3 / 66
  • 4. Introduction The irruption of GPUs: Impressive Results Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 4 / 66
  • 5. GPUs Successfully used for general purpose computing (GPGPU) Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 5 / 66
  • 6. Heterogeneous Architectures Heterogeneous Architectures But ... accULL: An Early OpenACC Implementation Results Conclusions and Future Work It is not Easy! 6 / 66
  • 7. Heterogeneous Architectures A GPU is not a CPU Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work GPUs are inherently SIMD processors CPUs and GPUs tackle the processing of tasks differently CPUs excel at serial processing GPUs are better at handling applications that require high floating point calculations and lower power consumption 7 / 66
  • 8. Parallel Languages: MPI (DM) and OpenMP (SM) Heterogeneous They are not valid for programming GPUs Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work New programming models are required... 8 / 66
  • 9. GPGPU Programming Heterogeneous Nowadays Software Stack: Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 9 / 66
  • 10. CUDA from NVIDIA Heterogeneous Architectures Pros: Performance, Easier accULL: An Early OpenACC than OpenCL Implementation Results Con: Only for NVIDIA Conclusions and hardware Future Work CUDA Code Example 1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n , 2 int m , int p) { 3 i n t i = blockIdx . x ∗32 + threadIdx . x ; 4 i n t j = blockIdx . y ; 5 f l o a t sum = 0 . 0 f ; 6 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ; 7 a [ i+n∗j ] = sum ; 8 } 10 / 66
  • 11. GPGPU Programming OpenCL: Open Computing Language Heterogeneous Architectures A framework developed by the Khronos Group accULL: An Early A standard OpenACC Implementation OpenCL programs execute across heterogeneous platforms: Results CPUs + GPUs + other processors Conclusions and Future Work Pros: can be used with any device, it is a standard Cons: more complex than CUDA, inmature 11 / 66
  • 12. GPGPU Programming Common Problems Heterogeneous 1 The programmer needs to know low-level details of the Architectures architecture accULL: An Early OpenACC Implementation Results Conclusions and Future Work 12 / 66
  • 13. GPGPU Programming Heterogeneous Architectures accULL: An Early OpenACC Common Problems Implementation 1 The programmer needs to know low-level details of the Results architecture Conclusions and Future Work 2 Source codes need to be rewritten: One version for CPU A different version for GPU 3 Good performance requires a great effort in parameter tunning 4 CUDA and OpenCL are new and complex for non-experts 13 / 66
  • 14. GPGPU Programming Heterogeneous Architectures accULL: An Early Our Claim: New models and tools are needed if we want OpenACC Implementation to widespread the use of GPUs in HPC Results Conclusions and Future Work Is there anything new in the horizon? hiCUDA PGI accelerator model CAPS HMPP OpenACC 14 / 66
  • 15. GPGPU Programming hiCUDA Heterogeneous Translates each directive into a CUDA call Architectures It is able to use the GPU Shared Memory accULL: An Early OpenACC Implementation Only works with NVIDIA devices Results The programmer still needs to know hardware details Conclusions and Future Work hiCUDA Code Example: 1 ... 2 # pragma h i c u d a g l o b a l a l l o c c [ ∗ ] [ ∗ ] copyin 4 # pragma h i c u d a k e r n e l mxm t b l o c k (N/ 1 6 ,N/ 1 6 ) t h r e a d ( 1 6 , 1 6 ) 5 #pragma h i c u d a loop _partit ion over_tblock over_thread 6 f o r ( i = 0 ; i < N ; i++ ) { 7 #pragma h i c u d a loop _partit ion over_tblock over_thread 8 f o r ( j = 0 ; j < N ; j++) { 9 d o u b l e sum = 0 . 0 ; 10 ... 15 / 66
  • 16. GPGPU Programming PGI accelerator model It is a higher level (directive-based) approach Heterogeneous Architectures Fortran and C are supported accULL: An Early OpenACC Precursor to OpenACC Implementation Results Conclusions and PGI Accelerator Model Code Example: Future Work 1 # pragma a c c d a t a c o p y i n ( b [ 0 : n∗ l ] , c [ 0 :m∗ l ] ) copy ( a [ 0 : n∗m] ) 2 { 3 #pragma a c c r e g i o n 4 { 5 #pragma a c c l o o p independent 6 f o r ( j = 0 ; j < n ; j++) 7 { 8 #pragma a c c l o o p independent 9 f o r ( i = 0 ; i < l ; i++ ) { 10 d o u b l e sum = 0 . 0 ; 11 f o r ( k = 0 ; k < m ; k++ ) { 12 sum += b [ i+k∗l ] ∗ c [ k+j∗m ] ; 13 } 14 a [ i+j∗l ] = sum ; 15 } 16 } 16 / 66
  • 17. GPGPU Programming Heterogeneous Architectures accULL: An Early OpenACC: introduced last November in OpenACC Implementation SuperComputing’2011 Results A directive based language Conclusions and Future Work Aim to be standard Supported by: Cray, NVIDIA, PGI and CAPS A single source code for CPU/GPU Platform independent Easier for beginners 17 / 66
  • 18. GPGPU Programming OpenACC Code Example: Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 18 / 66
  • 19. Outline Heterogeneous Architectures accULL: An Early 1 Heterogeneous Architectures OpenACC Implementation Results Conclusions and 2 accULL: An Early OpenACC Implementation Future Work 3 Results 4 Conclusions and Future Work 19 / 66
  • 20. accULL: Our OpenACC implementation Heterogeneous Architectures accULL: An Early accULL is a framework developed to support OpenACC OpenACC Implementation programs Results Conclusions and Future Work 20 / 66
  • 21. accULL: Our OpenACC implementation Heterogeneous Architectures accULL = YaCF + Frangollo accULL: An Early OpenACC Implementation It is a two-layer based implementation: Results Compiler + RunTime Library Conclusions and Future Work 21 / 66
  • 22. YaCF: the compiler Heterogeneous YaCF (Yet Another Compiler Framework) is the compiler Architectures accULL: An Early framework we have developed OpenACC Implementation Some features: Results It is a StS compiler Conclusions and Future Work Written in Python from scratch with an OO approach Receives C99 as input It is able to generate CUDA/OpenCL kernels from an annotated code A driver for compiling OpenACC directives has been added YaCF translates the directives into Frangollo calls A public-domain development 22 / 66
  • 23. Frangollo: the RunTime Heterogeneous Architectures accULL: An Early OpenACC Implementation Frangollo Results It is a RunTime to support the execution over heterogeneous Conclusions and Future Work platforms 1 Encapsulates the hardware issues 2 Is able to run in NVIDIA devices using CUDA 3 Is able to manage a wider range of devices using OpenCL 23 / 66
  • 24. Frangollo: the RunTime Compilation flow Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 24 / 66
  • 25. Frangollo: the RunTime Heterogeneous Architectures accULL: An Early OpenACC Implementation Its Responsibilities Results Conclusions and 1 Manages the memory Future Work 2 Initializes the devices 3 Launches the kernels 25 / 66
  • 26. Frangollo: the RunTime Heterogeneous Architectures accULL: An Early OpenACC Implementation Its Responsibilities Results Conclusions and 1 Manages the memory Future Work 2 Initializes the devices 3 Launches the kernels Makes programmers’ life easier! 26 / 66
  • 27. Frangollo: Memory Management A program workflow Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 27 / 66
  • 28. Frangollo: Structure Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work Interface layer: A door to Frangollo Some functions in the C interface: registerVar launchKernel getNumDevices 28 / 66
  • 29. Frangollo: Structure Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work Abstract layer Frangollo uses a class-hierarchy All classes in this layer are abstracts 29 / 66
  • 30. Frangollo: Structure Device layer Heterogeneous Architectures accULL: An Early OpenACC Implementation Encapsulates all target Results language related functions Conclusions and Future Work New platforms could be added in the future 30 / 66
  • 31. Outline Heterogeneous Architectures accULL: An Early 1 Heterogeneous Architectures OpenACC Implementation Results Conclusions and 2 accULL: An Early OpenACC Implementation Future Work 3 Results 4 Conclusions and Future Work 31 / 66
  • 32. Platforms M1: A Desktop computer Heterogeneous Architectures Intel Core i7 930 processor (2.80 GHz) accULL: An Early OpenACC 1MB of L2 cache, 8MB of L3 cache, shared by the four cores Implementation 4 GB RAM Results Conclusions and 2 GPU devices attached: Future Work Tesla C1060 with 3Gb memory (M1a) Tesla C2050 (Fermi) with 4GB memory (M1b) Accelerator platform is CUDA 4.0 M1a/ M1b mimic the scenario of an OpenACC average developer She can purchase a GPU card and plug in it into her desktop computer It features a relatively cheap platform 32 / 66
  • 33. Platforms M2: A cluster node Heterogeneous Architectures M2: 2 quad core Intel Xeon E5410 (2.25GHz) processors accULL: An Early OpenACC Implementation 24 GB memory Results Attached a Fermi C2050 card with 448 multiprocessors and 4 Conclusions and GB memory Future Work Accelerator platform: CUDA 4.0 M2 is a node of a common multinode cluster Nowadays clusters combine multicore processors and GPU devices, so we can take advantage of OpenACC This kind of compute node has higher acquisition and maintenance costs than M1 33 / 66
  • 34. Platforms M3: A second cluster Heterogeneous Architectures M3 is a shared memory system accULL: An Early 4 Intel Xeon E7 4850 CPU OpenACC Implementation 2.50MB L2 cache and 24MB L3 cache (for all its 10 cores) Results Conclusions and 6GB of memory per core Future Work Accelerator platform: Intel OpenCL SDK 1.5, running on the CPU M3 showcases an alternative use of OpenCL There are implementations of OpenCL targeting shared memory systems Using CPU-targeted OpenCL platforms along with OpenACC represents an interesting alternative to OpenMP programming 34 / 66
  • 35. Some of our Experiments Heterogeneous Blocked Matrix Multiplication (M×M) Architectures accULL: An Early OpenACC Implementation Rodinia Benchmark Results The Rodinia Benchmark suite comprises compute-heavy Conclusions and Future Work applications It covers a wide range of applications OpenMP, CUDA and OpenCL versions are available for most of the codes in the suite From them, we have selected: Needleman-Wunsch (NW) HotSpot (HS) Speckle Reducing Anisotropic Diffusion (SRAD) 35 / 66
  • 36. Matrix Multiplication Sketch of M×M in OpenACC Heterogeneous Architectures 1 # pragma a c c k e r n e l s name ( " mxm " ) copy ( a [ L∗N ] ) accULL: An Early 2 c o p y i n ( b [ L∗M ] , c [ M∗N ] . . . ) OpenACC 3 { Implementation 4 # pragma a c c l o o p p r i v a t e ( i , j ) c o l l a p s e ( 2 ) Results 5 f o r ( i = 0 ; i < L ; i++) Conclusions and 6 f o r ( j = 0 ; j < N ; j++) Future Work 7 a[i ∗ L + j] = 0.0; 8 /∗ I t e r a t e o v e r b l o c k s ∗/ 9 f o r ( ii = 0 ; ii < L ; ii += tile_size ) 10 f o r ( jj = 0 ; jj < N ; jj += tile_size ) 11 f o r ( kk = 0 ; kk < M ; kk += tile_size ) { 12 /∗ I t e r a t e i n s i d e a b l o c k ∗/ 13 #pragma a c c l o o p c o l l a p s e ( 2 ) p r i v a t e ( i , j , k ) 14 f o r ( j=jj ; j < min ( N , jj+tile_size ) ; j++) 15 f o r ( i=ii ; i < min ( L , ii+tile_size ) ; i++) 16 f o r ( k=kk ; k < min ( M , kk+tile_size ) ; k++) 17 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ; 18 } 19 } 36 / 66
  • 37. Matrix Multiplication Floating point performance for M×M in M2 Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 37 / 66
  • 38. Matrix Multiplication Floating point performance comparison between OpenMP, accULL, PGI and hiCUDA in M1 Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 38 / 66
  • 39. Matrix Multiplication Comparison between OpenMP-gcc implementation and Heterogeneous Frangollo+OpenCL in M3 (SM system 40 cores) Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 39 / 66
  • 40. Needleman-Wunsch Performance comparisons of NW in M1b Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work accULL performs worse than native versions 40 / 66
  • 41. Needleman-Wunsch Performance comparisons of NW in M3 (SM, 40 cores) Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work The OpenMP versions outperform to the OpenCL counterparts 41 / 66
  • 42. HotSpot Performance comparison of different implementations showing efficiency over native CUDA code in M1 Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work In this case, accULL performs similarly to hiCUDA 42 / 66
  • 43. HotSpot Speed-Up comparison with native CUDA code in Heterogeneous M1b (Fermi) Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 43 / 66
  • 44. HotSpot Efficiency w.r.t. Intel-OpenMP in M3 (SM, 40 cores) Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 44 / 66
  • 45. SRAD Speedup over the OpenMP implementation in M1b Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 45 / 66
  • 46. SRAD Speedup over the OpenMP implementation in M3 Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work 46 / 66
  • 47. Outline Heterogeneous Architectures accULL: An Early 1 Heterogeneous Architectures OpenACC Implementation Results Conclusions and 2 accULL: An Early OpenACC Implementation Future Work 3 Results 4 Conclusions and Future Work 47 / 66
  • 48. Conclusions I Heterogeneous Architectures accULL: An Early OpenACC accULL Implementation Results First OpenACC implementation with support for both CUDA Conclusions and and OpenCL Future Work 48 / 66
  • 49. Conclusions I Heterogeneous Architectures accULL: An Early OpenACC accULL Implementation Results First OpenACC implementation with support for both CUDA Conclusions and and OpenCL Future Work It supports most of the standard 49 / 66
  • 50. Conclusions I Heterogeneous Architectures accULL: An Early OpenACC accULL Implementation Results First OpenACC implementation with support for both CUDA Conclusions and and OpenCL Future Work It supports most of the standard We validate accULL using codes from widely available benchmarks using GPUs and CPUs 50 / 66
  • 51. Conclusions I Heterogeneous Architectures accULL: An Early OpenACC accULL Implementation Results First OpenACC implementation with support for both CUDA Conclusions and and OpenCL Future Work It supports most of the standard We validate accULL using codes from widely available benchmarks using GPUs and CPUs It meets the requirements of a non-expert developer 51 / 66
  • 52. Conclusions II Heterogeneous Architectures accULL accULL: An Early YaCF can be used as a fast-prototyping tool to explore OpenACC Implementation optimizations Results Conclusions and Future Work 52 / 66
  • 53. Conclusions II Heterogeneous Architectures accULL accULL: An Early YaCF can be used as a fast-prototyping tool to explore OpenACC Implementation optimizations Results Frangollo can be detached from YaCF and combined with a Conclusions and Future Work production-ready compiler 53 / 66
  • 54. Conclusions II Heterogeneous Architectures accULL accULL: An Early YaCF can be used as a fast-prototyping tool to explore OpenACC Implementation optimizations Results Frangollo can be detached from YaCF and combined with a Conclusions and Future Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler 54 / 66
  • 55. Conclusions II Heterogeneous Architectures accULL accULL: An Early YaCF can be used as a fast-prototyping tool to explore OpenACC Implementation optimizations Results Frangollo can be detached from YaCF and combined with a Conclusions and Future Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler Memory allocation 55 / 66
  • 56. Conclusions II Heterogeneous Architectures accULL accULL: An Early YaCF can be used as a fast-prototyping tool to explore OpenACC Implementation optimizations Results Frangollo can be detached from YaCF and combined with a Conclusions and Future Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler Memory allocation Kernel scheduling 56 / 66
  • 57. Conclusions II Heterogeneous Architectures accULL accULL: An Early YaCF can be used as a fast-prototyping tool to explore OpenACC Implementation optimizations Results Frangollo can be detached from YaCF and combined with a Conclusions and Future Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler Memory allocation Kernel scheduling Data splitting 57 / 66
  • 58. Conclusions II Heterogeneous Architectures accULL accULL: An Early YaCF can be used as a fast-prototyping tool to explore OpenACC Implementation optimizations Results Frangollo can be detached from YaCF and combined with a Conclusions and Future Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler Memory allocation Kernel scheduling Data splitting Overlapping of computation and communications 58 / 66
  • 59. Conclusions II Heterogeneous Architectures accULL accULL: An Early YaCF can be used as a fast-prototyping tool to explore OpenACC Implementation optimizations Results Frangollo can be detached from YaCF and combined with a Conclusions and Future Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler Memory allocation Kernel scheduling Data splitting Overlapping of computation and communications Parallel reduction implementation 59 / 66
  • 60. Future work Heterogeneous Architectures There are plenty of opportunities to improve performance accULL: An Early OpenACC Implementation To implement 2D arrays as cudaMatrix or OCLImages to Results improve non-contiguous memory access Conclusions and Future Work 60 / 66
  • 61. Future work Heterogeneous Architectures There are plenty of opportunities to improve performance accULL: An Early OpenACC Implementation To implement 2D arrays as cudaMatrix or OCLImages to Results improve non-contiguous memory access Conclusions and Future Work To complete the implementation of the asynchronous calls for better performance 61 / 66
  • 62. Future work Heterogeneous Architectures There are plenty of opportunities to improve performance accULL: An Early OpenACC Implementation To implement 2D arrays as cudaMatrix or OCLImages to Results improve non-contiguous memory access Conclusions and Future Work To complete the implementation of the asynchronous calls for better performance Multi-GPU support 62 / 66
  • 63. Future work Heterogeneous Architectures There are plenty of opportunities to improve performance accULL: An Early OpenACC Implementation To implement 2D arrays as cudaMatrix or OCLImages to Results improve non-contiguous memory access Conclusions and Future Work To complete the implementation of the asynchronous calls for better performance Multi-GPU support To explore different possibilities of integration with MPI 63 / 66
  • 64. Future work Heterogeneous Architectures There are plenty of opportunities to improve performance accULL: An Early OpenACC Implementation To implement 2D arrays as cudaMatrix or OCLImages to Results improve non-contiguous memory access Conclusions and Future Work To complete the implementation of the asynchronous calls for better performance Multi-GPU support To explore different possibilities of integration with MPI Integration of Frangollo with a production-ready compiler 64 / 66
  • 65. Future work Heterogeneous Architectures There are plenty of opportunities to improve performance accULL: An Early OpenACC Implementation To implement 2D arrays as cudaMatrix or OCLImages to Results improve non-contiguous memory access Conclusions and Future Work To complete the implementation of the asynchronous calls for better performance Multi-GPU support To explore different possibilities of integration with MPI Integration of Frangollo with a production-ready compiler New backend for FPGAs 65 / 66
  • 66. Thank you for your attention! accULL: An User-directed Approach to Heterogeneous Architectures Heterogeneous Programming accULL: An Early OpenACC Implementation Results http://accull.wordpress.com/ Conclusions and Future Work This work has been partially supported by the EU (FEDER), the Spanish MEC (contracts TIN2008-06570-C04-03 and TIN2011-24598), HPC-EUROPA2 and the Canary Islands Government, ACIISI F. de Sande fsande@ull.es 66 / 66