1. This work was performed under the auspices of the U.S. Department
of Energy by Lawrence Livermore National Laboratory under Contract
DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
A Source-To-Source Approach to HPC
Challenges
LLNL-PRES- 668361
2. Lawrence Livermore National Laboratory 2
Agenda
ď§ Background
⢠HPC challenges
⢠Source-to-source compilers
ď§ Source-to-source compilers to address HPC
challenges
⢠Programming models for heterogeneous computing
⢠Performance via autotuning
⢠Resilience via source level redundancy
ď§ Conclusion
ď§ Future Interests
3. Lawrence Livermore National Laboratory 3
HPC Challenges
ď§ HPC: High Performance
Computing
⢠DOE applications: climate
modeling, combustion, fusion
energy, biology, material
science, etc.
⢠Network connected (clustered)
computation nodes
⢠Increasing node complexity
ď§ Challenges
⢠Parallelism, performance,
productivity, portability
⢠Heterogeneity, power
efficiency, resilience
J. A. Ang, et al. Abstract machine models and proxy architectures
for exascale (1019 FLOPS) computing. Co-HPC '14
4. Lawrence Livermore National Laboratory 4
Compilers: traditional vs. source-to-
source
Source
code
Frontend
(parsing)
Middle End
(analysis&
transformation&
optimizations)
Backend
(code
generation)
Machine
code
Source
code
Frontend
(parsing)
Middle End
(analysis&
transformation&
optimizations)
Backend
(unparsing)
Source
code
5. Lawrence Livermore National Laboratory 5
Why Source-to-Source?
Traditional Compilers
(GCC, ICC, LLVM..)
Source-to-source compilers
(ROSE, Cetus, ..)
Goal Use compiler technology to
generate fast & efficient
executables for a target platform
Make compiler technology accessible, by
enabling users to build customized source
code analysis and transformation
compilers/tools
Users Programmers Researchers and developers in compiler,
tools, and application communities.
Input Programs in multiple languages Programs in multiple languages
Internal
representation
Multiple levels of Intermediate
representations,
losing source code structure and
information
Abstract Syntax Tree, keeping all source
level structure and information
(comments, preprocessing info. etc),
extensible.
Analyses &
optimizations
Internal use only, black box to
users
Individually exposed to users,
composable and customable
Output Machine executables, hard to
read
Human-readable, compilable source code
Portability Multiple code generators for
multiple platforms
Single code generator for all platforms,
leveraging traditional backend compilers
6. Lawrence Livermore National Laboratory 6
ROSE Compiler Infrastructure
C/C++/Fortran
OpenMP/UPC
Source Code
Analyzed/
Transformed
Source
EDG Front-end/
Open Fortran Parser
IR
(AST)
Unparser
ROSE compiler infrastructure
2009
Winner
www.roseCompiler.org
Control Flow
ROSE-based source-to-source tools
System Dependence
Data Dependence
Backend
Compiler
Executable
7. Lawrence Livermore National Laboratory 7
ROSE IR = AST + symbol tables +
interface
Simplified AST
without symbol tables
int main()
{
int a[10];
for(int i=0;i<10;i++)
a[i]=i*i;
return 0;
}
ROSE AST(abstract syntax tree)
⢠High level: preserve all source level details
⢠Easy to use: rich set of interface functions for
analysis and transformation
⢠Extensible: user defined persistent attributes
8. Lawrence Livermore National Laboratory 8
Using ROSE to help address HPC
challenges
ď§ Parallel programming model research
⢠MPI, OpenMP, OpenACC, UPC, Co-Array Fortran, etc.
⢠Domain-specific languages, etc.
ď§ Program optimization/translation tools
⢠Automatic parallelization to migrate legacy codes
⢠Outlining and parameterized code generation to support auto tuning
⢠Resilience via code redundancy
⢠Application skeleton extraction for Co-Design
⢠Reverse computation for discrete event simulation
⢠Bug seeding for testing
ď§ Program analysis tools
⢠Programming styles and coding standard enforcement
⢠Error checking for serial, MPI and OpenMP programs
⢠Worst execution time analysis
9. Lawrence Livermore National Laboratory 9
Case 1: high-level programming models
for heterogeneous computing
ď§ Problem
⢠Heterogeneity challenge of exascale
⢠Naïve approach: CPU code+
accelerator code + data transferring
in between
ď§ Approach
⢠Develop high-level, directive based
programming models
â Automatically generate low level
accelerator code
⢠Prototype and extend the OpenMP
4.0 accelerator model
â Result: HOMP (Heterogeneous
OpenMP) compiler
J. A. Ang, et al. Abstract machine models and proxy
architectures for exascale (1019 FLOPS) computing.
Co-HPC '14
10. Lawrence Livermore National Laboratory 10
Programming models bridge algorithms and machines and are
implemented through components of software stack
Measures of success:
⢠Expressiveness
⢠Performance
⢠Programmability
⢠Portability
⢠Efficiency
â˘âŚ
Language
Compiler
Library
Algorithm
Application
Abstract
Machine
Executable
Real
Machine
Programming Model
Express
Execute
Compile/link
âŚ
Software Stack
11. Lawrence Livermore National Laboratory 11
Parallel programming models are built on top of
sequential ones and use a combination of
language/compiler/library support
CPU
MemoryAbstract
Machine
(overly
simplified) CPU
Shared Memory
CPU
CPU
Memory
CPU
Memory
Interconnect
âŚ
Programming
Model
Sequential
Parallel
Shared Memory (e.g. OpenMP) Distributed Memory (e.g. MPI)
âŚ
Software
Stack
General purpose
Languages (GPL)
C/C++/Fortran
Sequential
Compiler
Optional Seq. Libs
GPL + Directives
Seq. Compiler
+ OpenMP support
OpenMP Runtime Lib
GPL + Call to MPI libs
Seq. Compiler
MPI library
12. Lawrence Livermore National Laboratory 12
OpenMP Code
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif
int main(void)
{
#pragma omp parallel
printf("Hello,world! n");
return 0;
}
âŚ
//PI calculation
#pragma omp parallel for reduction (+:sum) private (x)
for (i=1;i<=num_steps;i++)
{
x=(i-0.5)*step;
sum=sum+ 4.0/(1.0+x*x);
}
pi=step*sum;
..
Programmer compiler Runtime library OS threads
13. Lawrence Livermore National Laboratory 13
Shared storage
OpenMP 4.0âs accelerator model
ď§ Device: a logical execution engine
⢠Host device: where OpenMP program
begins, one only
⢠Target devices: 1 or more accelerators
ď§ Memory model
⢠Host data environment: one
⢠Device data environment: one or more
⢠Allow shared host and device memory
ď§ Execution model: Host-centric
⢠Host device : âoffloadsâ code regions
and data to target devices
⢠Target devices: fork-join model
⢠Host waits until devices finish
⢠Host executes device regions if no
accelerators are available /supported
Host Device
Proc. Proc.
Target Device
Proc. Proc.
OpenMP 4.0 Abstract Machine
Host Memory Device Memory
⌠âŚ
⌠âŚ
14. Lawrence Livermore National Laboratory 14
Computation and data offloading
ď§ Directive: #pragma omp target device(id) map() if()
⢠target: create a device data environment and offload computation to be run in
sequential on the same device
⢠device (int_exp): specify a target device
⢠map(to|from|tofrom|alloc:var_list) : data mapping between the current
taskâs data environment and a device data environment
ď§ Directive: #pragma omp target data device(id) map() if()
⢠Create a device data environment
omp target
CPU thread
omp parallel
Accelerator threads
CPU thread
15. Lawrence Livermore National Laboratory 15
Example code: Jacobi
while ((k<=mits)&&(error>tol))
{
// a loop copying u[][] to uold[][] is omitted here
âŚ
#pragma omp parallel for private(resid,j,i) reduction(+:error)
for (i=1;i<(n-1);i++)
for (j=1;j<(m-1);j++)
{
resid = (ax*(uold[i-1][j] + uold[i+1][j])
+ ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid*resid ;
} // the rest code omitted ...
}
#pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b,
f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m])
#pragma omp target device(gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m],
uold[0:n][0:m]) map(tofrom:u[0:n][0:m])
16. Lawrence Livermore National Laboratory 16
Building the accelerator support
ď§ Language: OpenMP 4.0 directives and clauses
⢠target, device, map, collapse, target data, âŚ
ď§ Compiler:
⢠CUDA kernel generation: outliner
⢠CPU code generation: kernel launching
⢠Loop mapping (1-D vs. 2-D), loop transformation (normalize, collapse)
⢠Array linearization
ď§ Runtime:
⢠Device probing, configure execution settings
⢠Loop scheduler: static even, round-robin
⢠Memory management : data, allocation, copy, deallocation, reuse
⢠Reduction: contiguous reduction pattern (folding)
17. Lawrence Livermore National Laboratory 17
Example code: Jacobi using
OpenMP accelerator directives
while ((k<=mits)&&(error>tol))
{
// a loop copying u[][] to uold[][] is omitted here
âŚ
#pragma omp parallel for private(resid,j,i) reduction(+:error)
for (i=1;i<(n-1);i++)
for (j=1;j<(m-1);j++)
{
resid = (ax*(uold[i-1][j] + uold[i+1][j])
+ ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid*resid ;
} // the rest code omitted ...
}
#pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b,
f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m])
#pragma omp target device(gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m],
uold[0:n][0:m]) map(tofrom:u[0:n][0:m])
19. Lawrence Livermore National Laboratory 19
CPU Code: data management, exe
configuration, launching, reduction
1. xomp_deviceDataEnvironmentEnter(); /* Initialize a new data environment, push it to a stack */
2. float *_dev_u = (float*) xomp_deviceDataEnvironmentGetInheritedVariable ((void*)u, _dev_u_size);
3. /* If not inheritable, allocate and register the mapped variable */
4. if (_dev_u == NULL)
5. { _dev_u = ((float *)(xomp_deviceMalloc(_dev_u_size)));
6. /* Register this mapped variable original address, device address, size, and a copy-back flag */
7. xomp_deviceDataEnvironmentAddVariable ((void*)u, _dev_u_size, (void*) _dev_u, true);
8. }
9. /* Execution configuration: threads per block and total block numbers */
10. int _threads_per_block_ = xomp_get_maxThreadsPerBlock();
11. int _num_blocks_ = xomp_get_max1DBlock((n - 1) - 1);
12. /* Launch the CUDA kernel ... */
13. OUT__1__10117__<<<_num_blocks_,_threads_per_block_,(_threads_per_block_ * sizeof(float ))>>>
14. (n,m,omega,ax,ay,b,_dev_per_block_error,_dev_u,_dev_f,_dev_uold);
15. error = xomp_beyond_block_reduction_float(_dev_per_block_error,_num_blocks_,6);
16. /* Copy back (optionally) and deallocate variables mapped within this environment, pop stack */
17. xomp_deviceDataEnvironmentExit();
20. Lawrence Livermore National Laboratory 20
Preliminary results: AXPY (Y=a*X)
Hardware configuration:
⢠4 quad-core Intel Xeon processors (16 cores)
2.27GHz with 32GB DRAM.
⢠NVIDIA Tesla K20c GPU (Kepler architecture)
0
0.5
1
1.5
2
2.5
3
5000 50000 500000 5000000 50000000 100000000 500000000
Vector size (float)
AXPY Execution Time (s)
Sequential
OpenMP FOR (16 threads)
HOMP ACC
PGI OpenACC
HMPP OpenACC
Software configuration:
⢠HOMP (GCC 4.4.7 and the CUDA 5.0)
⢠PGI OpenACC compiler version 13.4
⢠HMPP OpenACC compiler version 3.3.3
21. Lawrence Livermore National Laboratory 21
Matrix multiplication
0
5
10
15
20
25
30
35
40
45
50
128x128 256x256 512x512 1024x1024 2048x2048 4096x4096
Matrix size (float)
Matrix Multiplication Execution Time (s)
Sequential
OpenMP (16 threads)
PGI
HMPP
HOMP
HMPP Collapse
HOMP Collapse
*HOMP-collapse slightly outperforms HMPP-collapse when linearization with round-robin scheduling is used.
22. Lawrence Livermore National Laboratory 22
Jacobi
0
10
20
30
40
50
60
70
80
90
100
128x128 256x256 512x512 1024x1024 2048x2048
Matrix size (float)
Jacobi Execution Time (s)
Sequential
HMPP
PGI
HOMP
OpenMP
HMPP Collapse
HOMP Collpase
23. Lawrence Livermore National Laboratory 23
HOMP: multiple versions of Jacobi
0
10
20
30
40
50
60
70
80
90
100
128x128 256x256 512x512 1024x1024 2048x2048
Matrix size (float)
Jacobi Execution Time (s)
first version
target-data
Loop collapse using linearization with static-even scheduling
Loop collapse using 2-D mapping (16x16 block)
Loop collapse using 2-D mapping (8x32 block)
Loop collapse using linearization with round-robin scheduling
24. Lawrence Livermore National Laboratory 24
Case 1: conclusion
ď§ Many fundamental source-to-source translation
⢠CUDA kernel generation, loop transformation, data
management, reduction, loop scheduling, etc.
⢠Reusable across CPU and accelerator
ď§ Quick & competitive results:
⢠1st implementation in the world in 2013
â Interactions with IBM, TI and GCC compiler teams
⢠Early Feedback to OpenMP committee
â Multiple device support
â Per device default loop scheduling policy
â Productivity: combined directives
â New clause: no-middle-sync
â Global barrier: #pragma omp barrier [league|team]
⢠Extensions to support multiple GPUs in 2015
25. Lawrence Livermore National Laboratory 25
Case 2: Performance via whole
program autotuning
ď§ Problem: increasingly difficult to conduct
performance optimization for large scale
applications
⢠Hardware: multicore, GPU, FPGA, Cell, Blue Gene
⢠Empirical optimization (autotuning):
â small kernels, domain-specific libraries only
ď§ Our objectives:
⢠Research into autotuning for optimizing large scale
applications
â sequential and parallel
⢠Develop tools to automate the life-cycle of autotuning
â Outlining, parameterized transformation (loop tiling, unrolling)
26. Lawrence Livermore National Laboratory 26
An end-to-end whole program
autotuning system
Application
Performance
Tool
Tool
Interface
Code Triage
Performance
Info.
Kernel
Extraction
Kernels
Variants
Search
Engine
Parameterized
Translation
Search Space
ROSE-
Based
Components
KernelsFocus
27. Lawrence Livermore National Laboratory 27
Kernel extraction: outlining
ď§ Outlining:
ď§ Form a function from a code segment (outlining
target)
ď§ Replace the code segment with a call to the function
(outlined function)
ď§ A critical step to the success of an end-to-end
autotuning system
⢠Feasibility:
â whole-program tuning -> kernel tuning
⢠Relevance
â Ensure semantic correctness
â Preserve performance characteristics
28. Lawrence Livermore National Laboratory 28
Example: Jacobi using a classic
algorithm
// .... some code is omitted
void OUT__1__4027__(int *ip__, int *jp__, double omega,
double *errorp__, double *residp__, double ax, double ay,
double b)
{
for (*ip__=1;*ip__<(n-1);(*ip__)++)
for (*jp__=1;*jp__<(m-1);(*jp__)++)
{
*residp__ = (ax * (uold[*ip__-1][*jp__] +
uold[*ip__+1][*jp__]) + ay * (uold[*ip__][*jp__-1] +
uold[*ip__][*jp__+1]) + b * uold[*ip__][*jp__] -
f[*ip__][*jp__])/b;
u[*ip__][*jp__] = uold[*ip__][*jp__] - omega * (*residp__);
*errorp__ = *errorp__ + (*residp__) * (*residp__);
}
}
void main()
{
int i,j;
double omega,error,resid,ax,ay,b;
//...
OUT__1__4027__(&i,&j,omega,&error,&resid,ax,ay,b);
error = sqrt(error)/(n*m);
//...
}
int n,m;
double
u[MSIZE][MSIZE],f[MSIZE][MSIZE],uold[MSIZE][MSIZE];
void main()
{
int i,j;
double omega,error,resid,ax,ay,b;
// initialization code omitted here
// ...
for (i=1;i<(n-1);i++)
for (j=1;j<(m-1);j++) {
resid = (ax * (uold[i-1][j] + uold[i+1][j])+ ay * (uold[i][j-1]
+ uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid*resid;
}
error = sqrt(error)/(n*m);
// verification code omitted here ...
}
⢠Algorithm: all variables are made parameters,
written variables used via pointer
dereferencing
⢠Problem: 8 parameters & pointer
dereferences for written variables in C
⢠Disturb performance
⢠Impede compiler analysis
29. Lawrence Livermore National Laboratory 29
Our effective outlining algorithm
ď§ Rely on source level code analysis
⢠Scope information: global, local, or in-between
⢠Liveness analysis: at a given point, if the valueâs value will be
used in the future (transferred to future use)
⢠Side-effect: variables being read or written
ď§ Decide two things about parameters
⢠Minimum number of parameters to be passed
⢠Minimum pass-by-reference parameters
Parameters = (AllVars â InnerVars â GlobalVars â NamespaceVars) ⊠(LiveInVars U
LiveOutVars)
PassByRefParameters = Parameters ⊠((ModifiedVars ⊠LiveOutVars) U ArrayVars)
30. Lawrence Livermore National Laboratory 30
Eliminating pointer dereferences
ď§ Variables pass-by-reference handled by classic
algorithms: pointer dereferences
ď§ We use a novel method: variable cloning
⢠Check if such a variable is used by address:
â C: &x; C++: T & y=x; or foo(x) when foo(T&)
⢠Use a clone variable if x is NOT used by address
CloneCandidates = PassByRefParameters ⊠PointerDereferencedVars
CloneVars = (CloneCandidates â UseByAddressVars) ⊠AssignableVars
CloneVarsToInit = CloneVars ⊠LiveInVars
CloneVarsToSave = CloneVars ⊠LiveOutVars
31. Lawrence Livermore National Laboratory 31
Classic vs. effective outlining
Classic algorithm Our effective algorithm
void OUT__1__4027__(int *ip, int *jp, double
omega, double *errorp__, double *residp__,
double ax, double ay, double b)
{
for (*ip__=1;*ip__<(n-1);(*ip__)++)
for (*jp__=1;*jp__<(m-1);(*jp__)++)
{
*residp__ = (ax * (uold[*ip__-1][*jp__] +
uold[*ip__+1][*jp__]) + ay * (uold[*ip__][*jp__-1] +
uold[*ip__][*jp__+1]) + b * uold[*ip__][*jp__] -
f[*ip__][*jp__])/b;
u[*ip__][*jp__] = uold[*ip__][*jp__] - omega *
(*residp__);
*errorp__ = *errorp__ + (*residp__) *
(*residp__);
}
}
void OUT__1__5058__(double omega,double
*errorp__, double ax, double ay, double b)
{
int i, j; /* neither live-in nor live-out*/
double resid ; /* neither live-in nor live-out */
double error ; /* clone for a live-in and live-out
parameter */
error = * errorp__; /* Initialize the clone*/
for (i = 1; i < (n - 1); i++)
for (j = 1; j < (m - 1); j++) {
resid = (ax * (uold[i - 1][j] + uold[i + 1][j]) +
ay * (uold[i][j - 1] + uold[i][j + 1]) +
b * uold[i][j] - f[i][j]) / b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid * resid;
}
*errorp__ = error; /* Save value of the clone*/
}
33. Lawrence Livermore National Laboratory 33
Whole program autotuning example
The Outlined SMG 2000 kernel (simplified version)
for (si = 0; si < stencil_size; si++)
for (k = 0; k < hypre__mz; k++)
for (j = 0; j < hypre__my; j++)
for (i = 0; i < hypre__mx; i++)
rp[ri + i + j * hypre__sy3 + k * hypre__sz3] -=
Ap_0[i + j * hypre__sy1 + k * hypre__sz1 + A->data_indices[m][si]] *
xp_0[i + j * hypre__sy2 + k * hypre__sz2 + dxp_s[si]];
ď§ SMG (semicoarsing multigrid solver) 2000
⢠28k line C code, stencil computation
⢠a kernel ~45% execution time for 120x120x129 data set
ď§ HPCToolkit (Rice), the GCO search engine (UTK)
35. Lawrence Livermore National Laboratory 35
Case 2 conclusion
ď§ Performance optimization via whole program
tuning
⢠Key components: outliner, parameterized
transformation (skipped in this talk)
⢠ROSE outliner: improved algorithm based on source
code analysis
â Preserve performance characteristics
â Facilitate other tools to further process the kernels
ď§ Difficult to automate
⢠Code triage: The right scope of a kernel, specification
⢠Search space: Optimization/configuration for each kernel
36. Lawrence Livermore National Laboratory 36
Case 3: resilience via source-level
redundancy
ď§ Exascale systems expected to use large-amount
of components running at low-power
⢠Exa- 1018 / 2GHz (109) per core ď¨ 109 core
⢠Increased sensitivity to external events
⢠Transient faults caused by external events
â Faults which donât stop execution but silently corrupt results.
â Compared to permanent faults
ď§ Streamlined and simple processors
⢠Cannot afford pure hardware-based resilience
⢠Solution: software-implemented hardware fault
tolerance
37. Lawrence Livermore National Laboratory 37
Approach: compiler-based
transformations to add resilience
ď§ Source code annotation(pragmas)
⢠What to protect
⢠What to do when things go wrong
ď§ Source-to-source translator
⢠Fault detection: Triple-Modular
Redundancy (TMR)
⢠Implement fault-handling policies
⢠Vendor compiler: binary executable
⢠Intel PIN: fault injection
ď§ Challenges
ď§ Overhead
ď§ Vendor compiler optimizations
ROSE::FTTransform
Vendor
Compiler
Annotated
source code
Transformed
source code
Binary
executable
Fault Injection
(Intel PIN)
39. Lawrence Livermore National Laboratory 39
Source code redundancy:
reduce overhead
Relying on baseline double
modular redundancy (DMR)
to help reduce overhead
Statement to be protected
40. Lawrence Livermore National Laboratory 40
Source code redundancy: bypass
compiler optimizations
Using an extra pointer to
help preserve source code
redundancy
Statement to be protected
41. Lawrence Livermore National Laboratory 41
Results: necessity and effectiveness
of optimizer-proof code redundancy
Transformation Method PAPI_FP_INS
(O1)
PAPI_FP_INS
(O2)
PAPI_FP_INS
(O3)
Our method of using
pointers
Doubled Doubled Doubled
NaĂŻve Duplication No Change No Change No Change
ď§ Hardware counter to check if redundant computation can
survive compiler optimizations
⢠PAPI (PAPI_FP_INS) to collect the number of floating point
instructions for the protected kernel
⢠GCC 4.3.4, O1 to O3
42. Lawrence Livermore National Laboratory 42
Results: overhead
ď§ Jacobi kernel:
⢠Overhead: 0% to 30%
⢠Minimum overhead
â Stride=8
â Array size 16Kx16K
â Double precision
⢠The more original
latency, the
less overhead of
added redundancy
⢠Livermore kernel
â Kernel 1 (Hydro fragment) â 20%
â Kernel 4 (Banded linear equations) â 40%
â Kernel 5 (Tri-diagonal elimination) â 26%
â Kernel 11 (First sum) â 2%
43. Lawrence Livermore National Laboratory 43
Conclusion
ď§ HPC challenges
⢠Traditional: performance, productivity, portability
⢠Newer: Heterogeneity, resilience, power efficiency
⢠Made much worse by increasing complexity of computation node
architectures
ď§ Source-to-source compilation techniques
⢠Complement traditional compilers aimed for generating machine
code
â Source code analysis, transformation, optimizations, generation, âŚ
⢠High-level Programming models for GPUs
⢠Performance tuning: code extraction and parameterized code
translation/generation
⢠Resilience via source level redundancy
44. Lawrence Livermore National Laboratory 44
Future directions
ď§ Accelerator optimizations
⢠Newer accelerator features: Direct GPU to GPU communication,
Peer-to-Peer execution model, Uniform Virtual Memory
⢠CPU threads vs. accelerator threads vs. vectorization
ď§ Knowledge-driven compilation and runtime adaptation
⢠Explicit represent optimization knowledge and automatic apply
them in HPC
ď§ A uniform directive-based programming model
⢠A single X instead of MPI+X to cover both on-node and across
node parallelism
ď§ Thread level resilience
⢠Most resilience work focus on MPI, not threading
45. Lawrence Livermore National Laboratory 45
Acknowledgement
ď§ Sponsors
⢠LLNL (LDRD, ASC), DOE (SC), DOD
ď§ Colleagues
⢠Dan Quinlan (ROSE project lead)
⢠Pei-Hung Lin (Postdoc)
ď§ Collaborators
⢠Mary Hall (University of Utah)
⢠Qing Yi (University of Colorado at Colorado Springs)
⢠Robert Lucas (University of Southern California-ISI)
⢠Sally A. McKee (Chalmers University of Technology)
⢠Stephen Guzik (Colorado State University)
⢠Xipeng Shen (North Carolina State University)
⢠Yonghong Yan (Oakland University)
⢠âŚ.
Hinweis der Redaktion
Parallel scientific computing in climate modeling, weather prediction, energy, material science, etc.
Traditional compilers: GCC, ICC, Open64
Aimed to generate machine executables
Lose high level source code information
Black-box to application developers
Multiple backends for multiple platforms
A source-to-source compiler infrastructure
Generate analyzed and/or transformed source code
Maintain high-level source structure and information (comments, preprocessing info., etc.)
Friendly to application developers: easy to understand
Portable across multiple platforms by leveraging backend compilers
ROSE: making compiler techniques more accessible
Source-to-source compiler infrastructure
High level intermediate representation (Abstract Syntax Tree -AST) with rich traverse/transformation interfaces
Supports C/C++, Fortran, OpenMP, UPC
Encapsulates sophisticated compiler analyses/optimizations into easy-to-use function calls
Intended users: building customized program analysis and translation tools
The ROSE IR consists of the AST tree, a set of associated symbol tables, and a rich set of interface functions.
As a source-to-source compiler framework, the ROSE AST is a high level representation which is designed to preserves source code details, such as token stream, source comments, C preprocessor info and C++ templates.
Rich interface for: AST traversal, query, creation, copy, symbol lookup, Generic analyses, transformations,
It is also extensible, which means users can attach arbitrary information to AST, via a mechanism called persistent attributes.
As we can see in this slide, a programâs source code has a very straightforward and faithful tree representation in ROSE. As a result, ROSE is ideal for quickly building customized analysis and translation tools to support source code analysis and restructuring.
Discover/design/develop building blocks in the software stack
Language level
target, device(), map(), âŚ
num_devices, device_type(), no-middle-sync, âŚ
Compiler level
Pragma parsing
Parser building blocks
AST extensions
CUDA-specific IR nodes
AST translation: SageBuilder and SageInterface functions
Loop normalization, multi-dimensional array linearization
Outliner: CUDA kernel generation
Runtime level
Device feature probe: threads per block, block numberâŚ
Memory management: allocation, copy, data tracking/reuse, etc.
Loop scheduling: within and across thread blocks
Reduction: a first two-level algorithm
Use the latest official release for key concepts : one slide primer
Language level
target, device(), map(), âŚ
num_device, device_type(), no-middle-sync, âŚ
Compiler level
Pragma parsing
Parser building blocks
AST extensions
CUDA-specific IR nodes
AST translation: SageBuilder and SageInterface functions
Loop normalization, multi-dimensional array linearization
Outliner: CUDA kernel generation
Runtime level
Device feature probe: threads per block, block numberâŚ
Memory management: allocation, copy, data tracking/reuse, etc.
Loop scheduling: within and across thread blocks
Reduction: a first two-level algorithm
Language features
Target regions
Parallel regions and teams
Data mapping
Loop and distribute directives
C bindings: interoperate with C/C++ and Fortran
Conditional bodies: support both CUDA and OpenCL(ongoing working)
Language level
target, device(), map(), âŚ
num_device, device_type(), no-middle-sync, âŚ
Compiler level
Pragma parsing
Parser building blocks
AST extensions
CUDA-specific IR nodes
AST translation: SageBuilder and SageInterface functions
Loop normalization, multi-dimensional array linearization
Outliner: CUDA kernel generation
Runtime level
Device feature probe: threads per block, block numberâŚ
Memory management: allocation, copy, data tracking/reuse, etc.
Loop scheduling: within and across thread blocks
Reduction: a first two-level algorithm
*C. Liao, D. J. Quinlan, T. Panas, and B. R. de Supinski, âA ROSE-based OpenMP 3.0 research compiler supporting multiple runtime libraries,â
in Beyond loop level parallelism in OpenMP: accelerators, tasking and more (IWOMPâ10), Springer, 2010, pp. 15-28.
Three points
Start with traditional parallel fro
Add first target directive with map clauses
To avoid repetitive data allocation, copying, target data at the while loop level
4 points:
Outliner extended to generate CUDA kernel
Invoke runtime scheduler
Array linearization
Inner block reduction
For the call site: data preparation, configuration, launch, cleanup,
Runtime support for tracking data environment: a stack of data structure storing data allocated in each level
Runtime data allocation, registering,
Execution configuration,
CPU side reduction
Points
* Software hardware configuration
* PGI OpenACC , HMPP OpenACC
Accelerator is not always a better choice
void axpy_ompacc(REAL* x, REAL* y, int n, REAL a) {
int i;
#pragma omp target device (gpu0) map(tofrom: y[0:n]) map(to: x[0:n],a,n)
#pragma omp parallel for
for (i = 0; i < n; ++i)
y[i] += a * x[i];
}
PGI collapse version does not give any performance improvements
i-j-k order, scalar replacement
PGI collapse does not work with reduction
The Heterogeneous OpenMP (HOMP) programming model developed by using the building blocks of this project has delivered competitive performance for the important stencil computation kernel, Jacobi. This figure shows the improvements of execution time by incrementally using more advanced compiler and runtime building blocks. The original version uses minimum set of building blocks. The target-data version uses data-reuse. The rest versions use loop collapse, 2-D mapping, and different scheduling policies.
Current: device(id), portability concern
Desired: device_type(), num_devices(), dist_data()
Avoid cumbersome explicit teams
May need per device default scheduling policy
omp target parallel: avoid initial implicit sequential task
Our work is motivated by a grand challenge problem today:
With the radically different and diverse platforms, such as muticore, GPU, FPGA,etc.
building conventional compiler optimizations for each of them is no longer an effective solution.
On the other hand, empirical tuning seems to be an very promising approach since it usually does not rely on accurate machine models.
Unfortunately, most existing autotuning work only handle small kernels, domain-specific libraries.
Thus, our goal of this work is to study what it takes to extend autotuning to handle general large scale applications.
In particular, we want to build automated tools to support empirical tuning of whole applications.
------------------
%It mostly chooses the best optimization based on actual measured performance regardless hardware details.
%building model-driven static compiler optimizations for them is no longer an effective solution. On the other hand,
It is difficult to accurately model each of the complex platform.
Static compiler optimizations: limited success
Software: C/C++/Fortran, MPI, OpenMP, UPC âŚ
source-to-source
A classic algorithm will generate outlined kernel like this:
Written variables : pass their addresses, use them via pointer dereferences
There are at least two serious problems with the class outlining algorithm.
One is that it may change the performance characteristics of the transformed application. Pointer-dereferencing often is more expensive than straight forward non-pointer typed variables.
The other problem is that it significantly impede static analysis due to the well know limitations of accurate pointer analysis.
Consequently, the kernels with excessive pointer references are hard to be further analyzed and transformed to generate multiple variants.
From our experience, none of the scidac peri tools can handle the kernel mentioned above. So it is a serious problem.
------------------
Hard to be optimized , difficult to be handled by other tools (parameterized translation).
%This slide shows the jacobi program after its loop computation kernel is outlined using a classic algorithm.
As to the parameters, we use variable scope, liveness, and side effect analysis to minimize the number of parameters to be passed.
Further, we decide how to pass each parameter, either by value or by reference, based on the formula here.
Modified and liveout: need to pass the modified values, plus array for language limitation and class/structure variables for efficiency
Parameters = ((AllVars â InnerVars â GlobalVars â NamespaceVars âClassVars) ⊠(LiveInVars U LiveOutVars)) U ClassPointers
PassByRefParameters = Parameters ⊠((ModifiedVars ⊠LiveOutVars) U ArrayVars U ClassVars)
Once the outliner ensures that a target can be correctly outlined.
In conclusion,
manual code specification/translation may be needed
Second-chance: retry until reaching the predefined threshold, then use next-policy
Final-wish: only one chance to handle the fault using next-policy, you can inject a statement before the policy is called
* Jacob Lidman, Daniel J. Quinlan, Chunhua Liao and Sally A. McKee, ROSE::FTTransform A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research, Fault-Tolerance for HPC at Extreme Scale (FTXS 2012), Boston, June 25-28, 2012.
It is obvious that this overhead is heavily influenced by
the memory latencies of the original code: the more latency,
the better hidden the cost of the duplicated computation. The
minimum overhead is observed when the Jacobi iteration uses
a non-consecutive stride, a large data set, and a large element
size.