accULL (HAC Leganés)

Heterogeneous
Architectures
accULL: An User-directed Approach to
accULL: An Early
OpenACC
Heterogeneous Programming
Implementation

Results

Conclusions and
Future Work
Ruymń Reyes
a Ivń L´pez-Rodr´
a o ıguez Juan J. Fumero
Francisco de Sande

1
Dept. E.I.O. y Computaciń,
o
Univ. de La Laguna, 38271–La Laguna, Spain

International Workshop on Heterogeneous
Architectures and Computing
Legan´s, July 13 2012
e

1 / 66

Outline

Heterogeneous
Architectures

accULL: An Early
1 Heterogeneous Architectures
OpenACC
Implementation

Results

Conclusions and 2 accULL: An Early OpenACC Implementation
Future Work

3 Results

4 Conclusions and Future Work

2 / 66

Outline

Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Future Work

3 Results


3 / 66

Introduction

The irruption of GPUs: Impressive Results
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

4 / 66

GPUs

Successfully used for general purpose computing (GPGPU)
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

5 / 66

Heterogeneous Architectures

Heterogeneous
Architectures
But ...
accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

It is not Easy!

6 / 66

Heterogeneous Architectures

A GPU is not a CPU
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

GPUs are inherently SIMD processors
CPUs and GPUs tackle the processing of tasks diﬀerently
CPUs excel at serial processing
GPUs are better at handling applications that require high
ﬂoating point calculations and lower power consumption
7 / 66

Parallel Languages: MPI (DM) and OpenMP (SM)

Heterogeneous They are not valid for programming GPUs
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

New programming models are required...

8 / 66

GPGPU Programming

Heterogeneous Nowadays Software Stack:
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

9 / 66

CUDA from NVIDIA

Heterogeneous
Architectures
Pros: Performance, Easier
accULL: An Early
OpenACC than OpenCL
Implementation

Results Con: Only for NVIDIA
Conclusions and hardware
Future Work

CUDA Code Example
1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n ,
2 int m , int p) {
3 i n t i = blockIdx . x ∗32 + threadIdx . x ;
4 i n t j = blockIdx . y ;
5 f l o a t sum = 0 . 0 f ;
6 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ;
7 a [ i+n∗j ] = sum ;
8 }
10 / 66

GPGPU Programming

OpenCL: Open Computing Language
Heterogeneous
Architectures
A framework developed by the Khronos Group
accULL: An Early A standard
OpenACC
Implementation OpenCL programs execute across heterogeneous platforms:
Results
CPUs + GPUs + other processors
Conclusions and
Future Work Pros: can be used with any device, it is a standard
Cons: more complex than CUDA, inmature

11 / 66

GPGPU Programming

Common Problems
Heterogeneous
1 The programmer needs to know low-level details of the
Architectures
architecture
accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

12 / 66

GPGPU Programming

Heterogeneous
Architectures

accULL: An Early
OpenACC
Common Problems
Implementation 1 The programmer needs to know low-level details of the
Results
architecture
Conclusions and
Future Work 2 Source codes need to be rewritten:
One version for CPU
A diﬀerent version for GPU
3 Good performance requires a great eﬀort in parameter tunning
4 CUDA and OpenCL are new and complex for non-experts

13 / 66

GPGPU Programming

Heterogeneous
Architectures

accULL: An Early
Our Claim: New models and tools are needed if we want
OpenACC
Implementation
to widespread the use of GPUs in HPC
Results

Conclusions and
Future Work
Is there anything new in the horizon?
hiCUDA
PGI accelerator model
CAPS HMPP
OpenACC

14 / 66

GPGPU Programming

hiCUDA
Heterogeneous
Translates each directive into a CUDA call
Architectures
It is able to use the GPU Shared Memory
accULL: An Early
OpenACC
Implementation
Only works with NVIDIA devices
Results The programmer still needs to know hardware details
Conclusions and
Future Work

hiCUDA Code Example:
1 ...
2 # pragma h i c u d a g l o b a l a l l o c c [ ∗ ] [ ∗ ] copyin

4 # pragma h i c u d a k e r n e l mxm t b l o c k (N/ 1 6 ,N/ 1 6 ) t h r e a d ( 1 6 , 1 6 )
5 #pragma h i c u d a loop _partit ion over_tblock over_thread
6 f o r ( i = 0 ; i < N ; i++ ) {
7 #pragma h i c u d a loop _partit ion over_tblock over_thread
8 f o r ( j = 0 ; j < N ; j++) {
9 d o u b l e sum = 0 . 0 ;
10 ...

15 / 66

GPGPU Programming

PGI accelerator model
It is a higher level (directive-based) approach
Heterogeneous
Architectures Fortran and C are supported
accULL: An Early
OpenACC Precursor to OpenACC
Implementation

Results

Conclusions and PGI Accelerator Model Code Example:
Future Work

1 # pragma a c c d a t a c o p y i n ( b [ 0 : n∗ l ] , c [ 0 :m∗ l ] ) copy ( a [ 0 : n∗m] )
2 {
3 #pragma a c c r e g i o n
4 {
5 #pragma a c c l o o p independent
6 f o r ( j = 0 ; j < n ; j++)
7 {
8 #pragma a c c l o o p independent
9 f o r ( i = 0 ; i < l ; i++ ) {
10 d o u b l e sum = 0 . 0 ;
11 f o r ( k = 0 ; k < m ; k++ ) {
12 sum += b [ i+k∗l ] ∗ c [ k+j∗m ] ;
13 }
14 a [ i+j∗l ] = sum ;
15 }
16 } 16 / 66

GPGPU Programming

Heterogeneous
Architectures

accULL: An Early
OpenACC: introduced last November in
OpenACC
Implementation SuperComputing’2011
Results
A directive based language
Conclusions and
Future Work Aim to be standard
Supported by: Cray, NVIDIA, PGI and CAPS
A single source code for CPU/GPU
Platform independent
Easier for beginners

17 / 66

GPGPU Programming

OpenACC Code Example:
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

18 / 66

Outline

Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Future Work

3 Results


19 / 66

accULL: Our OpenACC implementation

Heterogeneous
Architectures

accULL: An Early accULL is a framework developed to support OpenACC
OpenACC
Implementation programs
Results

Conclusions and
Future Work

20 / 66

accULL: Our OpenACC implementation

Heterogeneous
Architectures
accULL = YaCF + Frangollo
accULL: An Early
OpenACC
Implementation
It is a two-layer based implementation:
Results Compiler + RunTime Library
Conclusions and
Future Work

21 / 66

YaCF: the compiler

Heterogeneous
YaCF (Yet Another Compiler Framework) is the compiler
Architectures

accULL: An Early
framework we have developed
OpenACC
Implementation Some features:
Results It is a StS compiler
Conclusions and
Future Work Written in Python from scratch with an OO approach
Receives C99 as input
It is able to generate CUDA/OpenCL kernels from an annotated
code
A driver for compiling OpenACC directives has been added
YaCF translates the directives into Frangollo calls
A public-domain development

22 / 66

Frangollo: the RunTime

Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation
Frangollo
Results It is a RunTime to support the execution over heterogeneous
Conclusions and
Future Work
platforms
1 Encapsulates the hardware issues
2 Is able to run in NVIDIA devices using CUDA
3 Is able to manage a wider range of devices using OpenCL

23 / 66


Compilation ﬂow
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

24 / 66


Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation Its Responsibilities
Results

Conclusions and
1 Manages the memory
Future Work
2 Initializes the devices
3 Launches the kernels

25 / 66


Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation Its Responsibilities
Results

Conclusions and
1 Manages the memory
Future Work
2 Initializes the devices
3 Launches the kernels
Makes programmers’ life easier!

26 / 66

Frangollo: Memory Management

A program workﬂow
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

27 / 66

Frangollo: Structure

Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

Interface layer: A door to Frangollo
Some functions in the C interface:
registerVar
launchKernel
getNumDevices

28 / 66


Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

Abstract layer
Frangollo uses a class-hierarchy
All classes in this layer are abstracts

29 / 66


Device layer
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation
Encapsulates all target
Results language related functions
Conclusions and
Future Work
New platforms could be
added in the future

30 / 66

Outline

Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Future Work

3 Results


31 / 66

Platforms

M1: A Desktop computer
Heterogeneous
Architectures Intel Core i7 930 processor (2.80 GHz)
accULL: An Early
OpenACC
1MB of L2 cache, 8MB of L3 cache, shared by the four cores
Implementation
4 GB RAM
Results

Conclusions and 2 GPU devices attached:
Future Work
Tesla C1060 with 3Gb memory (M1a)
Tesla C2050 (Fermi) with 4GB memory (M1b)
Accelerator platform is CUDA 4.0

M1a/ M1b mimic the scenario of an OpenACC average developer
She can purchase a GPU card and plug in it into her desktop
computer
It features a relatively cheap platform

32 / 66

Platforms

M2: A cluster node
Heterogeneous
Architectures
M2: 2 quad core Intel Xeon E5410 (2.25GHz) processors
accULL: An Early
OpenACC
Implementation
24 GB memory
Results Attached a Fermi C2050 card with 448 multiprocessors and 4
Conclusions and GB memory
Future Work
Accelerator platform: CUDA 4.0

M2 is a node of a common multinode cluster
Nowadays clusters combine multicore processors and GPU
devices, so we can take advantage of OpenACC
This kind of compute node has higher acquisition and
maintenance costs than M1

33 / 66

Platforms

M3: A second cluster
Heterogeneous
Architectures
M3 is a shared memory system
accULL: An Early 4 Intel Xeon E7 4850 CPU
OpenACC
Implementation
2.50MB L2 cache and 24MB L3 cache (for all its 10 cores)
Results

Conclusions and
6GB of memory per core
Future Work
Accelerator platform: Intel OpenCL SDK 1.5, running on the
CPU

M3 showcases an alternative use of OpenCL
There are implementations of OpenCL targeting shared memory
systems
Using CPU-targeted OpenCL platforms along with OpenACC
represents an interesting alternative to OpenMP programming

34 / 66

Some of our Experiments

Heterogeneous
Blocked Matrix Multiplication (M×M)
Architectures

accULL: An Early
OpenACC
Implementation Rodinia Benchmark
Results
The Rodinia Benchmark suite comprises compute-heavy
Conclusions and
Future Work applications
It covers a wide range of applications
OpenMP, CUDA and OpenCL versions are available for most of
the codes in the suite
From them, we have selected:
Needleman-Wunsch (NW)
HotSpot (HS)
Speckle Reducing Anisotropic Diﬀusion (SRAD)

35 / 66

Matrix Multiplication

Sketch of M×M in OpenACC
Heterogeneous
Architectures 1 # pragma a c c k e r n e l s name ( " mxm " ) copy ( a [ L∗N ] )
accULL: An Early 2 c o p y i n ( b [ L∗M ] , c [ M∗N ] . . . )
OpenACC 3 {
Implementation
4 # pragma a c c l o o p p r i v a t e ( i , j ) c o l l a p s e ( 2 )
Results 5 f o r ( i = 0 ; i < L ; i++)
Conclusions and
6 f o r ( j = 0 ; j < N ; j++)
Future Work 7 a[i ∗ L + j] = 0.0;
8 /∗ I t e r a t e o v e r b l o c k s ∗/
9 f o r ( ii = 0 ; ii < L ; ii += tile_size )
10 f o r ( jj = 0 ; jj < N ; jj += tile_size )
11 f o r ( kk = 0 ; kk < M ; kk += tile_size ) {
12 /∗ I t e r a t e i n s i d e a b l o c k ∗/
13 #pragma a c c l o o p c o l l a p s e ( 2 ) p r i v a t e ( i , j , k )
14 f o r ( j=jj ; j < min ( N , jj+tile_size ) ; j++)
15 f o r ( i=ii ; i < min ( L , ii+tile_size ) ; i++)
16 f o r ( k=kk ; k < min ( M , kk+tile_size ) ; k++)
17 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ;
18 }
19 }

36 / 66


Floating point performance for M×M in M2
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

37 / 66


Floating point performance comparison between OpenMP,
accULL, PGI and hiCUDA in M1
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

38 / 66


Comparison between OpenMP-gcc implementation and
Heterogeneous
Frangollo+OpenCL in M3 (SM system 40 cores)
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

39 / 66

Needleman-Wunsch

Performance comparisons of NW in M1b
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

accULL performs worse than native versions
40 / 66

Needleman-Wunsch

Performance comparisons of NW in M3 (SM, 40 cores)
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

The OpenMP versions outperform to the OpenCL counterparts
41 / 66

HotSpot

Performance comparison of diﬀerent implementations
showing eﬃciency over native CUDA code in M1
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

In this case, accULL performs similarly to hiCUDA 42 / 66

HotSpot

Speed-Up comparison with native CUDA code in
Heterogeneous
M1b (Fermi)
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

43 / 66

HotSpot

Eﬃciency w.r.t. Intel-OpenMP in M3 (SM, 40 cores)
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

44 / 66

SRAD

Speedup over the OpenMP implementation in M1b
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

45 / 66

SRAD

Speedup over the OpenMP implementation in M3
Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Conclusions and
Future Work

46 / 66

Outline

Heterogeneous
Architectures

accULL: An Early
OpenACC
Implementation

Results

Future Work

3 Results


47 / 66

Conclusions I

Heterogeneous
Architectures

accULL: An Early
OpenACC accULL
Implementation

Results
First OpenACC implementation with support for both CUDA
Conclusions and and OpenCL
Future Work

48 / 66

Conclusions I

Heterogeneous
Architectures

accULL: An Early
OpenACC accULL
Implementation

Results
Future Work
It supports most of the standard

49 / 66

Conclusions I

Heterogeneous
Architectures

accULL: An Early
OpenACC accULL
Implementation

Results
Future Work
We validate accULL using codes from widely available
benchmarks using GPUs and CPUs

50 / 66

Conclusions I

Heterogeneous
Architectures

accULL: An Early
OpenACC accULL
Implementation

Results
Future Work
We validate accULL using codes from widely available
benchmarks using GPUs and CPUs
It meets the requirements of a non-expert developer

51 / 66

Conclusions II

Heterogeneous
Architectures
accULL
accULL: An Early YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation optimizations
Results

Conclusions and
Future Work

52 / 66

Conclusions II

Heterogeneous
Architectures
accULL
OpenACC
Results
Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work production-ready compiler

53 / 66

Conclusions II

Heterogeneous
Architectures
accULL
OpenACC
Results
Conclusions and
Some issues that can be tackled within Frangollo
independently from the compiler

54 / 66

Conclusions II

Heterogeneous
Architectures
accULL
OpenACC
Results
Conclusions and
Memory allocation

55 / 66

Conclusions II

Heterogeneous
Architectures
accULL
OpenACC
Results
Conclusions and
Memory allocation
Kernel scheduling

56 / 66

Conclusions II

Heterogeneous
Architectures
accULL
OpenACC
Results
Conclusions and
Memory allocation
Kernel scheduling
Data splitting

57 / 66

Conclusions II

Heterogeneous
Architectures
accULL
OpenACC
Results
Conclusions and
Memory allocation
Kernel scheduling
Data splitting
Overlapping of computation and communications

58 / 66

Conclusions II

Heterogeneous
Architectures
accULL
OpenACC
Results
Conclusions and
Memory allocation
Kernel scheduling
Data splitting
Overlapping of computation and communications
Parallel reduction implementation

59 / 66

Future work

Heterogeneous
Architectures
There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
To implement 2D arrays as cudaMatrix or OCLImages to
Results improve non-contiguous memory access
Conclusions and
Future Work

60 / 66

Future work

Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Conclusions and
Future Work
To complete the implementation of the asynchronous calls for
better performance

61 / 66

Future work

Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Conclusions and
Future Work
better performance
Multi-GPU support

62 / 66

Future work

Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Conclusions and
Future Work
better performance
Multi-GPU support
To explore diﬀerent possibilities of integration with MPI

63 / 66

Future work

Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Conclusions and
Future Work
better performance
Multi-GPU support
Integration of Frangollo with a production-ready compiler

64 / 66

Future work

Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Conclusions and
Future Work
better performance
Multi-GPU support
Integration of Frangollo with a production-ready compiler
New backend for FPGAs

65 / 66

Thank you for your attention!

accULL: An User-directed Approach to
Heterogeneous
Architectures
Heterogeneous Programming
accULL: An Early
OpenACC
Implementation

Results
http://accull.wordpress.com/
Conclusions and
Future Work

This work has been partially supported by the EU (FEDER),
the Spanish MEC (contracts TIN2008-06570-C04-03 and
TIN2011-24598), HPC-EUROPA2 and the Canary Islands
Government, ACIISI

F. de Sande
fsande@ull.es

66 / 66

accULL (HAC Leganés)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie accULL (HAC Leganés)

Ähnlich wie accULL (HAC Leganés) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

accULL (HAC Leganés)