08448380779 Call Girls In Greater Kailash - I Women Seeking Men
accULL (HAC Leganés)
1. Heterogeneous
Architectures
accULL: An User-directed Approach to
accULL: An Early
OpenACC
Heterogeneous Programming
Implementation
Results
Conclusions and
Future Work
Ruym´n Reyes
a Iv´n L´pez-Rodr´
a o ıguez Juan J. Fumero
Francisco de Sande
1
Dept. E.I.O. y Computaci´n,
o
Univ. de La Laguna, 38271–La Laguna, Spain
International Workshop on Heterogeneous
Architectures and Computing
Legan´s, July 13 2012
e
1 / 66
2. Outline
Heterogeneous
Architectures
accULL: An Early
1 Heterogeneous Architectures
OpenACC
Implementation
Results
Conclusions and 2 accULL: An Early OpenACC Implementation
Future Work
3 Results
4 Conclusions and Future Work
2 / 66
3. Outline
Heterogeneous
Architectures
accULL: An Early
1 Heterogeneous Architectures
OpenACC
Implementation
Results
Conclusions and 2 accULL: An Early OpenACC Implementation
Future Work
3 Results
4 Conclusions and Future Work
3 / 66
4. Introduction
The irruption of GPUs: Impressive Results
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
4 / 66
5. GPUs
Successfully used for general purpose computing (GPGPU)
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
5 / 66
7. Heterogeneous Architectures
A GPU is not a CPU
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
GPUs are inherently SIMD processors
CPUs and GPUs tackle the processing of tasks differently
CPUs excel at serial processing
GPUs are better at handling applications that require high
floating point calculations and lower power consumption
7 / 66
8. Parallel Languages: MPI (DM) and OpenMP (SM)
Heterogeneous They are not valid for programming GPUs
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
New programming models are required...
8 / 66
9. GPGPU Programming
Heterogeneous Nowadays Software Stack:
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
9 / 66
10. CUDA from NVIDIA
Heterogeneous
Architectures
Pros: Performance, Easier
accULL: An Early
OpenACC than OpenCL
Implementation
Results Con: Only for NVIDIA
Conclusions and hardware
Future Work
CUDA Code Example
1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n ,
2 int m , int p) {
3 i n t i = blockIdx . x ∗32 + threadIdx . x ;
4 i n t j = blockIdx . y ;
5 f l o a t sum = 0 . 0 f ;
6 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ;
7 a [ i+n∗j ] = sum ;
8 }
10 / 66
11. GPGPU Programming
OpenCL: Open Computing Language
Heterogeneous
Architectures
A framework developed by the Khronos Group
accULL: An Early A standard
OpenACC
Implementation OpenCL programs execute across heterogeneous platforms:
Results
CPUs + GPUs + other processors
Conclusions and
Future Work Pros: can be used with any device, it is a standard
Cons: more complex than CUDA, inmature
11 / 66
12. GPGPU Programming
Common Problems
Heterogeneous
1 The programmer needs to know low-level details of the
Architectures
architecture
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
12 / 66
13. GPGPU Programming
Heterogeneous
Architectures
accULL: An Early
OpenACC
Common Problems
Implementation 1 The programmer needs to know low-level details of the
Results
architecture
Conclusions and
Future Work 2 Source codes need to be rewritten:
One version for CPU
A different version for GPU
3 Good performance requires a great effort in parameter tunning
4 CUDA and OpenCL are new and complex for non-experts
13 / 66
14. GPGPU Programming
Heterogeneous
Architectures
accULL: An Early
Our Claim: New models and tools are needed if we want
OpenACC
Implementation
to widespread the use of GPUs in HPC
Results
Conclusions and
Future Work
Is there anything new in the horizon?
hiCUDA
PGI accelerator model
CAPS HMPP
OpenACC
14 / 66
15. GPGPU Programming
hiCUDA
Heterogeneous
Translates each directive into a CUDA call
Architectures
It is able to use the GPU Shared Memory
accULL: An Early
OpenACC
Implementation
Only works with NVIDIA devices
Results The programmer still needs to know hardware details
Conclusions and
Future Work
hiCUDA Code Example:
1 ...
2 # pragma h i c u d a g l o b a l a l l o c c [ ∗ ] [ ∗ ] copyin
4 # pragma h i c u d a k e r n e l mxm t b l o c k (N/ 1 6 ,N/ 1 6 ) t h r e a d ( 1 6 , 1 6 )
5 #pragma h i c u d a loop _partit ion over_tblock over_thread
6 f o r ( i = 0 ; i < N ; i++ ) {
7 #pragma h i c u d a loop _partit ion over_tblock over_thread
8 f o r ( j = 0 ; j < N ; j++) {
9 d o u b l e sum = 0 . 0 ;
10 ...
15 / 66
16. GPGPU Programming
PGI accelerator model
It is a higher level (directive-based) approach
Heterogeneous
Architectures Fortran and C are supported
accULL: An Early
OpenACC Precursor to OpenACC
Implementation
Results
Conclusions and PGI Accelerator Model Code Example:
Future Work
1 # pragma a c c d a t a c o p y i n ( b [ 0 : n∗ l ] , c [ 0 :m∗ l ] ) copy ( a [ 0 : n∗m] )
2 {
3 #pragma a c c r e g i o n
4 {
5 #pragma a c c l o o p independent
6 f o r ( j = 0 ; j < n ; j++)
7 {
8 #pragma a c c l o o p independent
9 f o r ( i = 0 ; i < l ; i++ ) {
10 d o u b l e sum = 0 . 0 ;
11 f o r ( k = 0 ; k < m ; k++ ) {
12 sum += b [ i+k∗l ] ∗ c [ k+j∗m ] ;
13 }
14 a [ i+j∗l ] = sum ;
15 }
16 } 16 / 66
17. GPGPU Programming
Heterogeneous
Architectures
accULL: An Early
OpenACC: introduced last November in
OpenACC
Implementation SuperComputing’2011
Results
A directive based language
Conclusions and
Future Work Aim to be standard
Supported by: Cray, NVIDIA, PGI and CAPS
A single source code for CPU/GPU
Platform independent
Easier for beginners
17 / 66
18. GPGPU Programming
OpenACC Code Example:
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
18 / 66
19. Outline
Heterogeneous
Architectures
accULL: An Early
1 Heterogeneous Architectures
OpenACC
Implementation
Results
Conclusions and 2 accULL: An Early OpenACC Implementation
Future Work
3 Results
4 Conclusions and Future Work
19 / 66
20. accULL: Our OpenACC implementation
Heterogeneous
Architectures
accULL: An Early accULL is a framework developed to support OpenACC
OpenACC
Implementation programs
Results
Conclusions and
Future Work
20 / 66
21. accULL: Our OpenACC implementation
Heterogeneous
Architectures
accULL = YaCF + Frangollo
accULL: An Early
OpenACC
Implementation
It is a two-layer based implementation:
Results Compiler + RunTime Library
Conclusions and
Future Work
21 / 66
22. YaCF: the compiler
Heterogeneous
YaCF (Yet Another Compiler Framework) is the compiler
Architectures
accULL: An Early
framework we have developed
OpenACC
Implementation Some features:
Results It is a StS compiler
Conclusions and
Future Work Written in Python from scratch with an OO approach
Receives C99 as input
It is able to generate CUDA/OpenCL kernels from an annotated
code
A driver for compiling OpenACC directives has been added
YaCF translates the directives into Frangollo calls
A public-domain development
22 / 66
23. Frangollo: the RunTime
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Frangollo
Results It is a RunTime to support the execution over heterogeneous
Conclusions and
Future Work
platforms
1 Encapsulates the hardware issues
2 Is able to run in NVIDIA devices using CUDA
3 Is able to manage a wider range of devices using OpenCL
23 / 66
24. Frangollo: the RunTime
Compilation flow
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
24 / 66
26. Frangollo: the RunTime
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation Its Responsibilities
Results
Conclusions and
1 Manages the memory
Future Work
2 Initializes the devices
3 Launches the kernels
Makes programmers’ life easier!
26 / 66
27. Frangollo: Memory Management
A program workflow
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
27 / 66
28. Frangollo: Structure
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
Interface layer: A door to Frangollo
Some functions in the C interface:
registerVar
launchKernel
getNumDevices
28 / 66
30. Frangollo: Structure
Device layer
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Encapsulates all target
Results language related functions
Conclusions and
Future Work
New platforms could be
added in the future
30 / 66
31. Outline
Heterogeneous
Architectures
accULL: An Early
1 Heterogeneous Architectures
OpenACC
Implementation
Results
Conclusions and 2 accULL: An Early OpenACC Implementation
Future Work
3 Results
4 Conclusions and Future Work
31 / 66
32. Platforms
M1: A Desktop computer
Heterogeneous
Architectures Intel Core i7 930 processor (2.80 GHz)
accULL: An Early
OpenACC
1MB of L2 cache, 8MB of L3 cache, shared by the four cores
Implementation
4 GB RAM
Results
Conclusions and 2 GPU devices attached:
Future Work
Tesla C1060 with 3Gb memory (M1a)
Tesla C2050 (Fermi) with 4GB memory (M1b)
Accelerator platform is CUDA 4.0
M1a/ M1b mimic the scenario of an OpenACC average developer
She can purchase a GPU card and plug in it into her desktop
computer
It features a relatively cheap platform
32 / 66
33. Platforms
M2: A cluster node
Heterogeneous
Architectures
M2: 2 quad core Intel Xeon E5410 (2.25GHz) processors
accULL: An Early
OpenACC
Implementation
24 GB memory
Results Attached a Fermi C2050 card with 448 multiprocessors and 4
Conclusions and GB memory
Future Work
Accelerator platform: CUDA 4.0
M2 is a node of a common multinode cluster
Nowadays clusters combine multicore processors and GPU
devices, so we can take advantage of OpenACC
This kind of compute node has higher acquisition and
maintenance costs than M1
33 / 66
34. Platforms
M3: A second cluster
Heterogeneous
Architectures
M3 is a shared memory system
accULL: An Early 4 Intel Xeon E7 4850 CPU
OpenACC
Implementation
2.50MB L2 cache and 24MB L3 cache (for all its 10 cores)
Results
Conclusions and
6GB of memory per core
Future Work
Accelerator platform: Intel OpenCL SDK 1.5, running on the
CPU
M3 showcases an alternative use of OpenCL
There are implementations of OpenCL targeting shared memory
systems
Using CPU-targeted OpenCL platforms along with OpenACC
represents an interesting alternative to OpenMP programming
34 / 66
35. Some of our Experiments
Heterogeneous
Blocked Matrix Multiplication (M×M)
Architectures
accULL: An Early
OpenACC
Implementation Rodinia Benchmark
Results
The Rodinia Benchmark suite comprises compute-heavy
Conclusions and
Future Work applications
It covers a wide range of applications
OpenMP, CUDA and OpenCL versions are available for most of
the codes in the suite
From them, we have selected:
Needleman-Wunsch (NW)
HotSpot (HS)
Speckle Reducing Anisotropic Diffusion (SRAD)
35 / 66
36. Matrix Multiplication
Sketch of M×M in OpenACC
Heterogeneous
Architectures 1 # pragma a c c k e r n e l s name ( " mxm " ) copy ( a [ L∗N ] )
accULL: An Early 2 c o p y i n ( b [ L∗M ] , c [ M∗N ] . . . )
OpenACC 3 {
Implementation
4 # pragma a c c l o o p p r i v a t e ( i , j ) c o l l a p s e ( 2 )
Results 5 f o r ( i = 0 ; i < L ; i++)
Conclusions and
6 f o r ( j = 0 ; j < N ; j++)
Future Work 7 a[i ∗ L + j] = 0.0;
8 /∗ I t e r a t e o v e r b l o c k s ∗/
9 f o r ( ii = 0 ; ii < L ; ii += tile_size )
10 f o r ( jj = 0 ; jj < N ; jj += tile_size )
11 f o r ( kk = 0 ; kk < M ; kk += tile_size ) {
12 /∗ I t e r a t e i n s i d e a b l o c k ∗/
13 #pragma a c c l o o p c o l l a p s e ( 2 ) p r i v a t e ( i , j , k )
14 f o r ( j=jj ; j < min ( N , jj+tile_size ) ; j++)
15 f o r ( i=ii ; i < min ( L , ii+tile_size ) ; i++)
16 f o r ( k=kk ; k < min ( M , kk+tile_size ) ; k++)
17 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ;
18 }
19 }
36 / 66
37. Matrix Multiplication
Floating point performance for M×M in M2
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
37 / 66
38. Matrix Multiplication
Floating point performance comparison between OpenMP,
accULL, PGI and hiCUDA in M1
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
38 / 66
39. Matrix Multiplication
Comparison between OpenMP-gcc implementation and
Heterogeneous
Frangollo+OpenCL in M3 (SM system 40 cores)
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
39 / 66
40. Needleman-Wunsch
Performance comparisons of NW in M1b
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
accULL performs worse than native versions
40 / 66
41. Needleman-Wunsch
Performance comparisons of NW in M3 (SM, 40 cores)
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
The OpenMP versions outperform to the OpenCL counterparts
41 / 66
42. HotSpot
Performance comparison of different implementations
showing efficiency over native CUDA code in M1
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
In this case, accULL performs similarly to hiCUDA 42 / 66
43. HotSpot
Speed-Up comparison with native CUDA code in
Heterogeneous
M1b (Fermi)
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
43 / 66
44. HotSpot
Efficiency w.r.t. Intel-OpenMP in M3 (SM, 40 cores)
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
44 / 66
45. SRAD
Speedup over the OpenMP implementation in M1b
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
45 / 66
46. SRAD
Speedup over the OpenMP implementation in M3
Heterogeneous
Architectures
accULL: An Early
OpenACC
Implementation
Results
Conclusions and
Future Work
46 / 66
47. Outline
Heterogeneous
Architectures
accULL: An Early
1 Heterogeneous Architectures
OpenACC
Implementation
Results
Conclusions and 2 accULL: An Early OpenACC Implementation
Future Work
3 Results
4 Conclusions and Future Work
47 / 66
49. Conclusions I
Heterogeneous
Architectures
accULL: An Early
OpenACC accULL
Implementation
Results
First OpenACC implementation with support for both CUDA
Conclusions and and OpenCL
Future Work
It supports most of the standard
49 / 66
50. Conclusions I
Heterogeneous
Architectures
accULL: An Early
OpenACC accULL
Implementation
Results
First OpenACC implementation with support for both CUDA
Conclusions and and OpenCL
Future Work
It supports most of the standard
We validate accULL using codes from widely available
benchmarks using GPUs and CPUs
50 / 66
51. Conclusions I
Heterogeneous
Architectures
accULL: An Early
OpenACC accULL
Implementation
Results
First OpenACC implementation with support for both CUDA
Conclusions and and OpenCL
Future Work
It supports most of the standard
We validate accULL using codes from widely available
benchmarks using GPUs and CPUs
It meets the requirements of a non-expert developer
51 / 66
52. Conclusions II
Heterogeneous
Architectures
accULL
accULL: An Early YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation optimizations
Results
Conclusions and
Future Work
52 / 66
53. Conclusions II
Heterogeneous
Architectures
accULL
accULL: An Early YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation optimizations
Results
Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work production-ready compiler
53 / 66
54. Conclusions II
Heterogeneous
Architectures
accULL
accULL: An Early YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation optimizations
Results
Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work production-ready compiler
Some issues that can be tackled within Frangollo
independently from the compiler
54 / 66
55. Conclusions II
Heterogeneous
Architectures
accULL
accULL: An Early YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation optimizations
Results
Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work production-ready compiler
Some issues that can be tackled within Frangollo
independently from the compiler
Memory allocation
55 / 66
56. Conclusions II
Heterogeneous
Architectures
accULL
accULL: An Early YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation optimizations
Results
Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work production-ready compiler
Some issues that can be tackled within Frangollo
independently from the compiler
Memory allocation
Kernel scheduling
56 / 66
57. Conclusions II
Heterogeneous
Architectures
accULL
accULL: An Early YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation optimizations
Results
Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work production-ready compiler
Some issues that can be tackled within Frangollo
independently from the compiler
Memory allocation
Kernel scheduling
Data splitting
57 / 66
58. Conclusions II
Heterogeneous
Architectures
accULL
accULL: An Early YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation optimizations
Results
Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work production-ready compiler
Some issues that can be tackled within Frangollo
independently from the compiler
Memory allocation
Kernel scheduling
Data splitting
Overlapping of computation and communications
58 / 66
59. Conclusions II
Heterogeneous
Architectures
accULL
accULL: An Early YaCF can be used as a fast-prototyping tool to explore
OpenACC
Implementation optimizations
Results
Frangollo can be detached from YaCF and combined with a
Conclusions and
Future Work production-ready compiler
Some issues that can be tackled within Frangollo
independently from the compiler
Memory allocation
Kernel scheduling
Data splitting
Overlapping of computation and communications
Parallel reduction implementation
59 / 66
60. Future work
Heterogeneous
Architectures
There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
To implement 2D arrays as cudaMatrix or OCLImages to
Results improve non-contiguous memory access
Conclusions and
Future Work
60 / 66
61. Future work
Heterogeneous
Architectures
There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
To implement 2D arrays as cudaMatrix or OCLImages to
Results improve non-contiguous memory access
Conclusions and
Future Work
To complete the implementation of the asynchronous calls for
better performance
61 / 66
62. Future work
Heterogeneous
Architectures
There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
To implement 2D arrays as cudaMatrix or OCLImages to
Results improve non-contiguous memory access
Conclusions and
Future Work
To complete the implementation of the asynchronous calls for
better performance
Multi-GPU support
62 / 66
63. Future work
Heterogeneous
Architectures
There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
To implement 2D arrays as cudaMatrix or OCLImages to
Results improve non-contiguous memory access
Conclusions and
Future Work
To complete the implementation of the asynchronous calls for
better performance
Multi-GPU support
To explore different possibilities of integration with MPI
63 / 66
64. Future work
Heterogeneous
Architectures
There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
To implement 2D arrays as cudaMatrix or OCLImages to
Results improve non-contiguous memory access
Conclusions and
Future Work
To complete the implementation of the asynchronous calls for
better performance
Multi-GPU support
To explore different possibilities of integration with MPI
Integration of Frangollo with a production-ready compiler
64 / 66
65. Future work
Heterogeneous
Architectures
There are plenty of opportunities to improve performance
accULL: An Early
OpenACC
Implementation
To implement 2D arrays as cudaMatrix or OCLImages to
Results improve non-contiguous memory access
Conclusions and
Future Work
To complete the implementation of the asynchronous calls for
better performance
Multi-GPU support
To explore different possibilities of integration with MPI
Integration of Frangollo with a production-ready compiler
New backend for FPGAs
65 / 66
66. Thank you for your attention!
accULL: An User-directed Approach to
Heterogeneous
Architectures
Heterogeneous Programming
accULL: An Early
OpenACC
Implementation
Results
http://accull.wordpress.com/
Conclusions and
Future Work
This work has been partially supported by the EU (FEDER),
the Spanish MEC (contracts TIN2008-06570-C04-03 and
TIN2011-24598), HPC-EUROPA2 and the Canary Islands
Government, ACIISI
F. de Sande
fsande@ull.es
66 / 66