SlideShare ist ein Scribd-Unternehmen logo
1 von 45
This work was performed under the auspices of the U.S. Department
of Energy by Lawrence Livermore National Laboratory under Contract
DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
A Source-To-Source Approach to HPC
Challenges
LLNL-PRES- 668361
Lawrence Livermore National Laboratory 2
Agenda
 Background
• HPC challenges
• Source-to-source compilers
 Source-to-source compilers to address HPC
challenges
• Programming models for heterogeneous computing
• Performance via autotuning
• Resilience via source level redundancy
 Conclusion
 Future Interests
Lawrence Livermore National Laboratory 3
HPC Challenges
 HPC: High Performance
Computing
• DOE applications: climate
modeling, combustion, fusion
energy, biology, material
science, etc.
• Network connected (clustered)
computation nodes
• Increasing node complexity
 Challenges
• Parallelism, performance,
productivity, portability
• Heterogeneity, power
efficiency, resilience
J. A. Ang, et al. Abstract machine models and proxy architectures
for exascale (1019 FLOPS) computing. Co-HPC '14
Lawrence Livermore National Laboratory 4
Compilers: traditional vs. source-to-
source
Source
code
Frontend
(parsing)
Middle End
(analysis&
transformation&
optimizations)
Backend
(code
generation)
Machine
code
Source
code
Frontend
(parsing)
Middle End
(analysis&
transformation&
optimizations)
Backend
(unparsing)
Source
code
Lawrence Livermore National Laboratory 5
Why Source-to-Source?
Traditional Compilers
(GCC, ICC, LLVM..)
Source-to-source compilers
(ROSE, Cetus, ..)
Goal Use compiler technology to
generate fast & efficient
executables for a target platform
Make compiler technology accessible, by
enabling users to build customized source
code analysis and transformation
compilers/tools
Users Programmers Researchers and developers in compiler,
tools, and application communities.
Input Programs in multiple languages Programs in multiple languages
Internal
representation
Multiple levels of Intermediate
representations,
losing source code structure and
information
Abstract Syntax Tree, keeping all source
level structure and information
(comments, preprocessing info. etc),
extensible.
Analyses &
optimizations
Internal use only, black box to
users
Individually exposed to users,
composable and customable
Output Machine executables, hard to
read
Human-readable, compilable source code
Portability Multiple code generators for
multiple platforms
Single code generator for all platforms,
leveraging traditional backend compilers
Lawrence Livermore National Laboratory 6
ROSE Compiler Infrastructure
C/C++/Fortran
OpenMP/UPC
Source Code
Analyzed/
Transformed
Source
EDG Front-end/
Open Fortran Parser
IR
(AST)
Unparser
ROSE compiler infrastructure
2009
Winner
www.roseCompiler.org
Control Flow
ROSE-based source-to-source tools
System Dependence
Data Dependence
Backend
Compiler
Executable
Lawrence Livermore National Laboratory 7
ROSE IR = AST + symbol tables +
interface
Simplified AST
without symbol tables
int main()
{
int a[10];
for(int i=0;i<10;i++)
a[i]=i*i;
return 0;
}
ROSE AST(abstract syntax tree)
• High level: preserve all source level details
• Easy to use: rich set of interface functions for
analysis and transformation
• Extensible: user defined persistent attributes
Lawrence Livermore National Laboratory 8
Using ROSE to help address HPC
challenges
 Parallel programming model research
• MPI, OpenMP, OpenACC, UPC, Co-Array Fortran, etc.
• Domain-specific languages, etc.
 Program optimization/translation tools
• Automatic parallelization to migrate legacy codes
• Outlining and parameterized code generation to support auto tuning
• Resilience via code redundancy
• Application skeleton extraction for Co-Design
• Reverse computation for discrete event simulation
• Bug seeding for testing
 Program analysis tools
• Programming styles and coding standard enforcement
• Error checking for serial, MPI and OpenMP programs
• Worst execution time analysis
Lawrence Livermore National Laboratory 9
Case 1: high-level programming models
for heterogeneous computing
 Problem
• Heterogeneity challenge of exascale
• Naïve approach: CPU code+
accelerator code + data transferring
in between
 Approach
• Develop high-level, directive based
programming models
— Automatically generate low level
accelerator code
• Prototype and extend the OpenMP
4.0 accelerator model
— Result: HOMP (Heterogeneous
OpenMP) compiler
J. A. Ang, et al. Abstract machine models and proxy
architectures for exascale (1019 FLOPS) computing.
Co-HPC '14
Lawrence Livermore National Laboratory 10
Programming models bridge algorithms and machines and are
implemented through components of software stack
Measures of success:
• Expressiveness
• Performance
• Programmability
• Portability
• Efficiency
•…
Language
Compiler
Library
Algorithm
Application
Abstract
Machine
Executable
Real
Machine
Programming Model
Express
Execute
Compile/link
…
Software Stack
Lawrence Livermore National Laboratory 11
Parallel programming models are built on top of
sequential ones and use a combination of
language/compiler/library support
CPU
MemoryAbstract
Machine
(overly
simplified) CPU
Shared Memory
CPU
CPU
Memory
CPU
Memory
Interconnect
…
Programming
Model
Sequential
Parallel
Shared Memory (e.g. OpenMP) Distributed Memory (e.g. MPI)
…
Software
Stack
General purpose
Languages (GPL)
C/C++/Fortran
Sequential
Compiler
Optional Seq. Libs
GPL + Directives
Seq. Compiler
+ OpenMP support
OpenMP Runtime Lib
GPL + Call to MPI libs
Seq. Compiler
MPI library
Lawrence Livermore National Laboratory 12
OpenMP Code
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif
int main(void)
{
#pragma omp parallel
printf("Hello,world! n");
return 0;
}
…
//PI calculation
#pragma omp parallel for reduction (+:sum) private (x)
for (i=1;i<=num_steps;i++)
{
x=(i-0.5)*step;
sum=sum+ 4.0/(1.0+x*x);
}
pi=step*sum;
..
Programmer compiler Runtime library OS threads
Lawrence Livermore National Laboratory 13
Shared storage
OpenMP 4.0’s accelerator model
 Device: a logical execution engine
• Host device: where OpenMP program
begins, one only
• Target devices: 1 or more accelerators
 Memory model
• Host data environment: one
• Device data environment: one or more
• Allow shared host and device memory
 Execution model: Host-centric
• Host device : “offloads” code regions
and data to target devices
• Target devices: fork-join model
• Host waits until devices finish
• Host executes device regions if no
accelerators are available /supported
Host Device
Proc. Proc.
Target Device
Proc. Proc.
OpenMP 4.0 Abstract Machine
Host Memory Device Memory
… …
… …
Lawrence Livermore National Laboratory 14
Computation and data offloading
 Directive: #pragma omp target device(id) map() if()
• target: create a device data environment and offload computation to be run in
sequential on the same device
• device (int_exp): specify a target device
• map(to|from|tofrom|alloc:var_list) : data mapping between the current
task’s data environment and a device data environment
 Directive: #pragma omp target data device(id) map() if()
• Create a device data environment
omp target
CPU thread
omp parallel
Accelerator threads
CPU thread
Lawrence Livermore National Laboratory 15
Example code: Jacobi
while ((k<=mits)&&(error>tol))
{
// a loop copying u[][] to uold[][] is omitted here
…
#pragma omp parallel for private(resid,j,i) reduction(+:error)
for (i=1;i<(n-1);i++)
for (j=1;j<(m-1);j++)
{
resid = (ax*(uold[i-1][j] + uold[i+1][j])
+ ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid*resid ;
} // the rest code omitted ...
}
#pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b, 
f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m])
#pragma omp target device(gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m], 
uold[0:n][0:m]) map(tofrom:u[0:n][0:m])
Lawrence Livermore National Laboratory 16
Building the accelerator support
 Language: OpenMP 4.0 directives and clauses
• target, device, map, collapse, target data, …
 Compiler:
• CUDA kernel generation: outliner
• CPU code generation: kernel launching
• Loop mapping (1-D vs. 2-D), loop transformation (normalize, collapse)
• Array linearization
 Runtime:
• Device probing, configure execution settings
• Loop scheduler: static even, round-robin
• Memory management : data, allocation, copy, deallocation, reuse
• Reduction: contiguous reduction pattern (folding)
Lawrence Livermore National Laboratory 17
Example code: Jacobi using
OpenMP accelerator directives
while ((k<=mits)&&(error>tol))
{
// a loop copying u[][] to uold[][] is omitted here
…
#pragma omp parallel for private(resid,j,i) reduction(+:error)
for (i=1;i<(n-1);i++)
for (j=1;j<(m-1);j++)
{
resid = (ax*(uold[i-1][j] + uold[i+1][j])
+ ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid*resid ;
} // the rest code omitted ...
}
#pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b, 
f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m])
#pragma omp target device(gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m], 
uold[0:n][0:m]) map(tofrom:u[0:n][0:m])
Lawrence Livermore National Laboratory 18
GPU kernel generation: outliner,
scheduler, array linearization, reduction
1. __global__ void OUT__1__10117__(int n, int m, float omega, float ax, float ay, float b, 
2. float *_dev_per_block_error, float *_dev_u, float *_dev_f, float *_dev_uold)
3. { /* local variables for loop , reduction, etc */
4. int _p_j; float _p_error; _p_error = 0; float _p_resid;
5. int _dev_i, _dev_lower, _dev_upper;
6. /* Static even scheduling: obtain loop bounds for current thread of current block */
7. XOMP_accelerator_loop_default (1, n-2, 1, &_dev_lower, &_dev_upper);
8. for (_dev_i = _dev_lower; _dev_i<= _dev_upper; _dev_i ++) {
9. for (_p_j = 1; _p_j < (m - 1); _p_j++) { /* replace with device variables, linearize 2-D array accesses */
10. _p_resid = (((((ax * (_dev_uold[(_dev_i - 1) * 512 + _p_j] + _dev_uold[(_dev_i + 1) * 512 + _p_j])) 
11. + (ay * (_dev_uold[_dev_i * 512 + (_p_j - 1)] + _dev_uold[_dev_i * 512 + (_p_j + 1)]))) 
12. + (b * _dev_uold[_dev_i * 512 + _p_j])) - _dev_f[_dev_i * 512 + _p_j]) / b);
13. _dev_u[_dev_i * 512 + _p_j] = (_dev_uold[_dev_i * 512 + _p_j] - (omega * _p_resid));
14. _p_error = (_p_error + (_p_resid * _p_resid));
15. } } /* thread block level reduction for float type*/
16. xomp_inner_block_reduction_float(_p_error,_dev_per_block_error,6);
17. }
Lawrence Livermore National Laboratory 19
CPU Code: data management, exe
configuration, launching, reduction
1. xomp_deviceDataEnvironmentEnter(); /* Initialize a new data environment, push it to a stack */
2. float *_dev_u = (float*) xomp_deviceDataEnvironmentGetInheritedVariable ((void*)u, _dev_u_size);
3. /* If not inheritable, allocate and register the mapped variable */
4. if (_dev_u == NULL)
5. { _dev_u = ((float *)(xomp_deviceMalloc(_dev_u_size)));
6. /* Register this mapped variable original address, device address, size, and a copy-back flag */
7. xomp_deviceDataEnvironmentAddVariable ((void*)u, _dev_u_size, (void*) _dev_u, true);
8. }
9. /* Execution configuration: threads per block and total block numbers */
10. int _threads_per_block_ = xomp_get_maxThreadsPerBlock();
11. int _num_blocks_ = xomp_get_max1DBlock((n - 1) - 1);
12. /* Launch the CUDA kernel ... */
13. OUT__1__10117__<<<_num_blocks_,_threads_per_block_,(_threads_per_block_ * sizeof(float ))>>> 
14. (n,m,omega,ax,ay,b,_dev_per_block_error,_dev_u,_dev_f,_dev_uold);
15. error = xomp_beyond_block_reduction_float(_dev_per_block_error,_num_blocks_,6);
16. /* Copy back (optionally) and deallocate variables mapped within this environment, pop stack */
17. xomp_deviceDataEnvironmentExit();
Lawrence Livermore National Laboratory 20
Preliminary results: AXPY (Y=a*X)
Hardware configuration:
• 4 quad-core Intel Xeon processors (16 cores)
2.27GHz with 32GB DRAM.
• NVIDIA Tesla K20c GPU (Kepler architecture)
0
0.5
1
1.5
2
2.5
3
5000 50000 500000 5000000 50000000 100000000 500000000
Vector size (float)
AXPY Execution Time (s)
Sequential
OpenMP FOR (16 threads)
HOMP ACC
PGI OpenACC
HMPP OpenACC
Software configuration:
• HOMP (GCC 4.4.7 and the CUDA 5.0)
• PGI OpenACC compiler version 13.4
• HMPP OpenACC compiler version 3.3.3
Lawrence Livermore National Laboratory 21
Matrix multiplication
0
5
10
15
20
25
30
35
40
45
50
128x128 256x256 512x512 1024x1024 2048x2048 4096x4096
Matrix size (float)
Matrix Multiplication Execution Time (s)
Sequential
OpenMP (16 threads)
PGI
HMPP
HOMP
HMPP Collapse
HOMP Collapse
*HOMP-collapse slightly outperforms HMPP-collapse when linearization with round-robin scheduling is used.
Lawrence Livermore National Laboratory 22
Jacobi
0
10
20
30
40
50
60
70
80
90
100
128x128 256x256 512x512 1024x1024 2048x2048
Matrix size (float)
Jacobi Execution Time (s)
Sequential
HMPP
PGI
HOMP
OpenMP
HMPP Collapse
HOMP Collpase
Lawrence Livermore National Laboratory 23
HOMP: multiple versions of Jacobi
0
10
20
30
40
50
60
70
80
90
100
128x128 256x256 512x512 1024x1024 2048x2048
Matrix size (float)
Jacobi Execution Time (s)
first version
target-data
Loop collapse using linearization with static-even scheduling
Loop collapse using 2-D mapping (16x16 block)
Loop collapse using 2-D mapping (8x32 block)
Loop collapse using linearization with round-robin scheduling
Lawrence Livermore National Laboratory 24
Case 1: conclusion
 Many fundamental source-to-source translation
• CUDA kernel generation, loop transformation, data
management, reduction, loop scheduling, etc.
• Reusable across CPU and accelerator
 Quick & competitive results:
• 1st implementation in the world in 2013
— Interactions with IBM, TI and GCC compiler teams
• Early Feedback to OpenMP committee
— Multiple device support
— Per device default loop scheduling policy
— Productivity: combined directives
— New clause: no-middle-sync
— Global barrier: #pragma omp barrier [league|team]
• Extensions to support multiple GPUs in 2015
Lawrence Livermore National Laboratory 25
Case 2: Performance via whole
program autotuning
 Problem: increasingly difficult to conduct
performance optimization for large scale
applications
• Hardware: multicore, GPU, FPGA, Cell, Blue Gene
• Empirical optimization (autotuning):
— small kernels, domain-specific libraries only
 Our objectives:
• Research into autotuning for optimizing large scale
applications
— sequential and parallel
• Develop tools to automate the life-cycle of autotuning
— Outlining, parameterized transformation (loop tiling, unrolling)
Lawrence Livermore National Laboratory 26
An end-to-end whole program
autotuning system
Application
Performance
Tool
Tool
Interface
Code Triage
Performance
Info.
Kernel
Extraction
Kernels
Variants
Search
Engine
Parameterized
Translation
Search Space
ROSE-
Based
Components
KernelsFocus
Lawrence Livermore National Laboratory 27
Kernel extraction: outlining
 Outlining:
 Form a function from a code segment (outlining
target)
 Replace the code segment with a call to the function
(outlined function)
 A critical step to the success of an end-to-end
autotuning system
• Feasibility:
— whole-program tuning -> kernel tuning
• Relevance
— Ensure semantic correctness
— Preserve performance characteristics
Lawrence Livermore National Laboratory 28
Example: Jacobi using a classic
algorithm
// .... some code is omitted
void OUT__1__4027__(int *ip__, int *jp__, double omega,
double *errorp__, double *residp__, double ax, double ay,
double b)
{
for (*ip__=1;*ip__<(n-1);(*ip__)++)
for (*jp__=1;*jp__<(m-1);(*jp__)++)
{
*residp__ = (ax * (uold[*ip__-1][*jp__] +
uold[*ip__+1][*jp__]) + ay * (uold[*ip__][*jp__-1] +
uold[*ip__][*jp__+1]) + b * uold[*ip__][*jp__] -
f[*ip__][*jp__])/b;
u[*ip__][*jp__] = uold[*ip__][*jp__] - omega * (*residp__);
*errorp__ = *errorp__ + (*residp__) * (*residp__);
}
}
void main()
{
int i,j;
double omega,error,resid,ax,ay,b;
//...
OUT__1__4027__(&i,&j,omega,&error,&resid,ax,ay,b);
error = sqrt(error)/(n*m);
//...
}
int n,m;
double
u[MSIZE][MSIZE],f[MSIZE][MSIZE],uold[MSIZE][MSIZE];
void main()
{
int i,j;
double omega,error,resid,ax,ay,b;
// initialization code omitted here
// ...
for (i=1;i<(n-1);i++)
for (j=1;j<(m-1);j++) {
resid = (ax * (uold[i-1][j] + uold[i+1][j])+ ay * (uold[i][j-1]
+ uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid*resid;
}
error = sqrt(error)/(n*m);
// verification code omitted here ...
}
• Algorithm: all variables are made parameters,
written variables used via pointer
dereferencing
• Problem: 8 parameters & pointer
dereferences for written variables in C
• Disturb performance
• Impede compiler analysis
Lawrence Livermore National Laboratory 29
Our effective outlining algorithm
 Rely on source level code analysis
• Scope information: global, local, or in-between
• Liveness analysis: at a given point, if the value’s value will be
used in the future (transferred to future use)
• Side-effect: variables being read or written
 Decide two things about parameters
• Minimum number of parameters to be passed
• Minimum pass-by-reference parameters
Parameters = (AllVars − InnerVars − GlobalVars − NamespaceVars) ∩ (LiveInVars U
LiveOutVars)
PassByRefParameters = Parameters ∊ ((ModifiedVars ∊ LiveOutVars) U ArrayVars)
Lawrence Livermore National Laboratory 30
Eliminating pointer dereferences
 Variables pass-by-reference handled by classic
algorithms: pointer dereferences
 We use a novel method: variable cloning
• Check if such a variable is used by address:
— C: &x; C++: T & y=x; or foo(x) when foo(T&)
• Use a clone variable if x is NOT used by address
CloneCandidates = PassByRefParameters ∊ PointerDereferencedVars
CloneVars = (CloneCandidates − UseByAddressVars) ∩ AssignableVars
CloneVarsToInit = CloneVars ∊ LiveInVars
CloneVarsToSave = CloneVars ∊ LiveOutVars
Lawrence Livermore National Laboratory 31
Classic vs. effective outlining
Classic algorithm Our effective algorithm
void OUT__1__4027__(int *ip, int *jp, double
omega, double *errorp__, double *residp__,
double ax, double ay, double b)
{
for (*ip__=1;*ip__<(n-1);(*ip__)++)
for (*jp__=1;*jp__<(m-1);(*jp__)++)
{
*residp__ = (ax * (uold[*ip__-1][*jp__] +
uold[*ip__+1][*jp__]) + ay * (uold[*ip__][*jp__-1] +
uold[*ip__][*jp__+1]) + b * uold[*ip__][*jp__] -
f[*ip__][*jp__])/b;
u[*ip__][*jp__] = uold[*ip__][*jp__] - omega *
(*residp__);
*errorp__ = *errorp__ + (*residp__) *
(*residp__);
}
}
void OUT__1__5058__(double omega,double
*errorp__, double ax, double ay, double b)
{
int i, j; /* neither live-in nor live-out*/
double resid ; /* neither live-in nor live-out */
double error ; /* clone for a live-in and live-out
parameter */
error = * errorp__; /* Initialize the clone*/
for (i = 1; i < (n - 1); i++)
for (j = 1; j < (m - 1); j++) {
resid = (ax * (uold[i - 1][j] + uold[i + 1][j]) +
ay * (uold[i][j - 1] + uold[i][j + 1]) +
b * uold[i][j] - f[i][j]) / b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid * resid;
}
*errorp__ = error; /* Save value of the clone*/
}
Lawrence Livermore National Laboratory 32
Different outlining methods:
performance impact
0
5
10
15
20
25
30
35
40
45
O0 O2 O3
Executiontime(seconds)
Optimization levels
v0-Original
v1-Pointer dereference
v2-C++Reference
v3-VariableClone
Jacobi (500x500)
Lawrence Livermore National Laboratory 33
Whole program autotuning example
The Outlined SMG 2000 kernel (simplified version)
for (si = 0; si < stencil_size; si++)
for (k = 0; k < hypre__mz; k++)
for (j = 0; j < hypre__my; j++)
for (i = 0; i < hypre__mx; i++)
rp[ri + i + j * hypre__sy3 + k * hypre__sz3] -=
Ap_0[i + j * hypre__sy1 + k * hypre__sz1 + A->data_indices[m][si]] *
xp_0[i + j * hypre__sy2 + k * hypre__sz2 + dxp_s[si]];
 SMG (semicoarsing multigrid solver) 2000
• 28k line C code, stencil computation
• a kernel ~45% execution time for 120x120x129 data set
 HPCToolkit (Rice), the GCO search engine (UTK)
Lawrence Livermore National Laboratory 34
Results
icc –O2 Seq. (0,8,0) OMP (6,0,0)
Kernel 18.6s 13.02s 3.35s
Kernel speedup N/A 1.43x 5.55x
Whole app. 35.11s 29.71s 19.90s
Total speedup N/A 1.18x 1.76x
 Sequential: ~40 hours for 14,400 points
 Best point: (k,j,si,i) order
 OpenMP: ~ 3 hours for <992 points
 Best point: 6-thread
Search space specifications
Sequential OpenMP
Tiling size (k,j,i) [0, 55, 5] #thread [1, 8, 1]
Interchange (all levels) [0, 23, 1] Schedule policy [0, 3, 1]
Unrolling (innermost) [0, 49, 1] Chunk size [0, 60, 2]
Lawrence Livermore National Laboratory 35
Case 2 conclusion
 Performance optimization via whole program
tuning
• Key components: outliner, parameterized
transformation (skipped in this talk)
• ROSE outliner: improved algorithm based on source
code analysis
— Preserve performance characteristics
— Facilitate other tools to further process the kernels
 Difficult to automate
• Code triage: The right scope of a kernel, specification
• Search space: Optimization/configuration for each kernel
Lawrence Livermore National Laboratory 36
Case 3: resilience via source-level
redundancy
 Exascale systems expected to use large-amount
of components running at low-power
• Exa- 1018 / 2GHz (109) per core  109 core
• Increased sensitivity to external events
• Transient faults caused by external events
— Faults which don’t stop execution but silently corrupt results.
— Compared to permanent faults
 Streamlined and simple processors
• Cannot afford pure hardware-based resilience
• Solution: software-implemented hardware fault
tolerance
Lawrence Livermore National Laboratory 37
Approach: compiler-based
transformations to add resilience
 Source code annotation(pragmas)
• What to protect
• What to do when things go wrong
 Source-to-source translator
• Fault detection: Triple-Modular
Redundancy (TMR)
• Implement fault-handling policies
• Vendor compiler: binary executable
• Intel PIN: fault injection
 Challenges
 Overhead
 Vendor compiler optimizations
ROSE::FTTransform
Vendor
Compiler
Annotated
source code
Transformed
source code
Binary
executable
Fault Injection
(Intel PIN)
Lawrence Livermore National Laboratory 38
NaĂŻve example
Policy Final-wish
Pragma #pragma FT-FW (Majority-voting)
Y = …;
Semantics // T-Modular Redundancy
y[0] = . . . ;
y[1] = . . . ;
y[2] = . . . ;
// Fault detection
if ( !EQUALS( y[0] , y[1], y[2])) )
{
// Fault handling
Y=Voting(y[0] , y[1], y[2] );
}
else
Y = y[0];
Lawrence Livermore National Laboratory 39
Source code redundancy:
reduce overhead
Relying on baseline double
modular redundancy (DMR)
to help reduce overhead
Statement to be protected
Lawrence Livermore National Laboratory 40
Source code redundancy: bypass
compiler optimizations
Using an extra pointer to
help preserve source code
redundancy
Statement to be protected
Lawrence Livermore National Laboratory 41
Results: necessity and effectiveness
of optimizer-proof code redundancy
Transformation Method PAPI_FP_INS
(O1)
PAPI_FP_INS
(O2)
PAPI_FP_INS
(O3)
Our method of using
pointers
Doubled Doubled Doubled
NaĂŻve Duplication No Change No Change No Change
 Hardware counter to check if redundant computation can
survive compiler optimizations
• PAPI (PAPI_FP_INS) to collect the number of floating point
instructions for the protected kernel
• GCC 4.3.4, O1 to O3
Lawrence Livermore National Laboratory 42
Results: overhead
 Jacobi kernel:
• Overhead: 0% to 30%
• Minimum overhead
— Stride=8
— Array size 16Kx16K
— Double precision
• The more original
latency, the
less overhead of
added redundancy
• Livermore kernel
— Kernel 1 (Hydro fragment) – 20%
— Kernel 4 (Banded linear equations) – 40%
— Kernel 5 (Tri-diagonal elimination) – 26%
— Kernel 11 (First sum) – 2%
Lawrence Livermore National Laboratory 43
Conclusion
 HPC challenges
• Traditional: performance, productivity, portability
• Newer: Heterogeneity, resilience, power efficiency
• Made much worse by increasing complexity of computation node
architectures
 Source-to-source compilation techniques
• Complement traditional compilers aimed for generating machine
code
— Source code analysis, transformation, optimizations, generation, …
• High-level Programming models for GPUs
• Performance tuning: code extraction and parameterized code
translation/generation
• Resilience via source level redundancy
Lawrence Livermore National Laboratory 44
Future directions
 Accelerator optimizations
• Newer accelerator features: Direct GPU to GPU communication,
Peer-to-Peer execution model, Uniform Virtual Memory
• CPU threads vs. accelerator threads vs. vectorization
 Knowledge-driven compilation and runtime adaptation
• Explicit represent optimization knowledge and automatic apply
them in HPC
 A uniform directive-based programming model
• A single X instead of MPI+X to cover both on-node and across
node parallelism
 Thread level resilience
• Most resilience work focus on MPI, not threading
Lawrence Livermore National Laboratory 45
Acknowledgement
 Sponsors
• LLNL (LDRD, ASC), DOE (SC), DOD
 Colleagues
• Dan Quinlan (ROSE project lead)
• Pei-Hung Lin (Postdoc)
 Collaborators
• Mary Hall (University of Utah)
• Qing Yi (University of Colorado at Colorado Springs)
• Robert Lucas (University of Southern California-ISI)
• Sally A. McKee (Chalmers University of Technology)
• Stephen Guzik (Colorado State University)
• Xipeng Shen (North Carolina State University)
• Yonghong Yan (Oakland University)
• ….

Weitere ähnliche Inhalte

Was ist angesagt?

Something About Dynamic Linking
Something About Dynamic LinkingSomething About Dynamic Linking
Something About Dynamic LinkingWang Hsiangkai
 
Return oriented programming
Return oriented programmingReturn oriented programming
Return oriented programminghybr1s
 
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Mingliang Liu
 
Yacc (yet another compiler compiler)
Yacc (yet another compiler compiler)Yacc (yet another compiler compiler)
Yacc (yet another compiler compiler)omercomail
 
Lecture 14 run time environment
Lecture 14 run time environmentLecture 14 run time environment
Lecture 14 run time environmentIffat Anjum
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniquesNetronome
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
FISL XIV - The ELF File Format and the Linux Loader
FISL XIV - The ELF File Format and the Linux LoaderFISL XIV - The ELF File Format and the Linux Loader
FISL XIV - The ELF File Format and the Linux LoaderJohn Tortugo
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT CompilerNetronome
 
FARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
FARIS: Fast and Memory-efficient URL Filter by Domain Specific MachineFARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
FARIS: Fast and Memory-efficient URL Filter by Domain Specific MachineYuuki Takano
 
06 - ELF format, knowing your friend
06 - ELF format, knowing your friend06 - ELF format, knowing your friend
06 - ELF format, knowing your friendAlexandre Moneger
 
Dynamic Linker
Dynamic LinkerDynamic Linker
Dynamic LinkerSanjiv Malik
 
Yacc lex
Yacc lexYacc lex
Yacc lex915086731
 
Introduction to D programming language at Weka.IO
Introduction to D programming language at Weka.IOIntroduction to D programming language at Weka.IO
Introduction to D programming language at Weka.IOLiran Zvibel
 
eBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureeBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureNetronome
 
Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)Varun Mahajan
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18Yaoqing Gao
 
Scala-Gopher: CSP-style programming techniques with idiomatic Scala.
Scala-Gopher: CSP-style programming techniques with idiomatic Scala.Scala-Gopher: CSP-style programming techniques with idiomatic Scala.
Scala-Gopher: CSP-style programming techniques with idiomatic Scala.Ruslan Shevchenko
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulThomas Wuerthinger
 
Return oriented programming (ROP)
Return oriented programming (ROP)Return oriented programming (ROP)
Return oriented programming (ROP)Pipat Methavanitpong
 

Was ist angesagt? (20)

Something About Dynamic Linking
Something About Dynamic LinkingSomething About Dynamic Linking
Something About Dynamic Linking
 
Return oriented programming
Return oriented programmingReturn oriented programming
Return oriented programming
 
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
 
Yacc (yet another compiler compiler)
Yacc (yet another compiler compiler)Yacc (yet another compiler compiler)
Yacc (yet another compiler compiler)
 
Lecture 14 run time environment
Lecture 14 run time environmentLecture 14 run time environment
Lecture 14 run time environment
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current Techniques
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
FISL XIV - The ELF File Format and the Linux Loader
FISL XIV - The ELF File Format and the Linux LoaderFISL XIV - The ELF File Format and the Linux Loader
FISL XIV - The ELF File Format and the Linux Loader
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT Compiler
 
FARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
FARIS: Fast and Memory-efficient URL Filter by Domain Specific MachineFARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
FARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
 
06 - ELF format, knowing your friend
06 - ELF format, knowing your friend06 - ELF format, knowing your friend
06 - ELF format, knowing your friend
 
Dynamic Linker
Dynamic LinkerDynamic Linker
Dynamic Linker
 
Yacc lex
Yacc lexYacc lex
Yacc lex
 
Introduction to D programming language at Weka.IO
Introduction to D programming language at Weka.IOIntroduction to D programming language at Weka.IO
Introduction to D programming language at Weka.IO
 
eBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureeBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging Infrastructure
 
Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18
 
Scala-Gopher: CSP-style programming techniques with idiomatic Scala.
Scala-Gopher: CSP-style programming techniques with idiomatic Scala.Scala-Gopher: CSP-style programming techniques with idiomatic Scala.
Scala-Gopher: CSP-style programming techniques with idiomatic Scala.
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered Harmful
 
Return oriented programming (ROP)
Return oriented programming (ROP)Return oriented programming (ROP)
Return oriented programming (ROP)
 

Ähnlich wie A Source-To-Source Approach to HPC Challenges

Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsUnai Lopez-Novoa
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...KRamasamy2
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...VAISHNAVI MADHAN
 
Implement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVMImplement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVMNational Cheng Kung University
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLinaro
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascaleinside-BigData.com
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU ComputingGlobalLogic Ukraine
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCHimanshu Bedi
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaopenseesdays
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Subbu Rama
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Scientific Computing - Hardware
Scientific Computing - HardwareScientific Computing - Hardware
Scientific Computing - Hardwarejalle6
 

Ähnlich wie A Source-To-Source Approach to HPC Challenges (20)

Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
Implement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVMImplement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVM
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project Update
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascale
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Disco workshop
Disco workshopDisco workshop
Disco workshop
 
Scientific Computing - Hardware
Scientific Computing - HardwareScientific Computing - Hardware
Scientific Computing - Hardware
 

KĂźrzlich hochgeladen

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

KĂźrzlich hochgeladen (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

A Source-To-Source Approach to HPC Challenges

  • 1. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC A Source-To-Source Approach to HPC Challenges LLNL-PRES- 668361
  • 2. Lawrence Livermore National Laboratory 2 Agenda  Background • HPC challenges • Source-to-source compilers  Source-to-source compilers to address HPC challenges • Programming models for heterogeneous computing • Performance via autotuning • Resilience via source level redundancy  Conclusion  Future Interests
  • 3. Lawrence Livermore National Laboratory 3 HPC Challenges  HPC: High Performance Computing • DOE applications: climate modeling, combustion, fusion energy, biology, material science, etc. • Network connected (clustered) computation nodes • Increasing node complexity  Challenges • Parallelism, performance, productivity, portability • Heterogeneity, power efficiency, resilience J. A. Ang, et al. Abstract machine models and proxy architectures for exascale (1019 FLOPS) computing. Co-HPC '14
  • 4. Lawrence Livermore National Laboratory 4 Compilers: traditional vs. source-to- source Source code Frontend (parsing) Middle End (analysis& transformation& optimizations) Backend (code generation) Machine code Source code Frontend (parsing) Middle End (analysis& transformation& optimizations) Backend (unparsing) Source code
  • 5. Lawrence Livermore National Laboratory 5 Why Source-to-Source? Traditional Compilers (GCC, ICC, LLVM..) Source-to-source compilers (ROSE, Cetus, ..) Goal Use compiler technology to generate fast & efficient executables for a target platform Make compiler technology accessible, by enabling users to build customized source code analysis and transformation compilers/tools Users Programmers Researchers and developers in compiler, tools, and application communities. Input Programs in multiple languages Programs in multiple languages Internal representation Multiple levels of Intermediate representations, losing source code structure and information Abstract Syntax Tree, keeping all source level structure and information (comments, preprocessing info. etc), extensible. Analyses & optimizations Internal use only, black box to users Individually exposed to users, composable and customable Output Machine executables, hard to read Human-readable, compilable source code Portability Multiple code generators for multiple platforms Single code generator for all platforms, leveraging traditional backend compilers
  • 6. Lawrence Livermore National Laboratory 6 ROSE Compiler Infrastructure C/C++/Fortran OpenMP/UPC Source Code Analyzed/ Transformed Source EDG Front-end/ Open Fortran Parser IR (AST) Unparser ROSE compiler infrastructure 2009 Winner www.roseCompiler.org Control Flow ROSE-based source-to-source tools System Dependence Data Dependence Backend Compiler Executable
  • 7. Lawrence Livermore National Laboratory 7 ROSE IR = AST + symbol tables + interface Simplified AST without symbol tables int main() { int a[10]; for(int i=0;i<10;i++) a[i]=i*i; return 0; } ROSE AST(abstract syntax tree) • High level: preserve all source level details • Easy to use: rich set of interface functions for analysis and transformation • Extensible: user defined persistent attributes
  • 8. Lawrence Livermore National Laboratory 8 Using ROSE to help address HPC challenges  Parallel programming model research • MPI, OpenMP, OpenACC, UPC, Co-Array Fortran, etc. • Domain-specific languages, etc.  Program optimization/translation tools • Automatic parallelization to migrate legacy codes • Outlining and parameterized code generation to support auto tuning • Resilience via code redundancy • Application skeleton extraction for Co-Design • Reverse computation for discrete event simulation • Bug seeding for testing  Program analysis tools • Programming styles and coding standard enforcement • Error checking for serial, MPI and OpenMP programs • Worst execution time analysis
  • 9. Lawrence Livermore National Laboratory 9 Case 1: high-level programming models for heterogeneous computing  Problem • Heterogeneity challenge of exascale • NaĂŻve approach: CPU code+ accelerator code + data transferring in between  Approach • Develop high-level, directive based programming models — Automatically generate low level accelerator code • Prototype and extend the OpenMP 4.0 accelerator model — Result: HOMP (Heterogeneous OpenMP) compiler J. A. Ang, et al. Abstract machine models and proxy architectures for exascale (1019 FLOPS) computing. Co-HPC '14
  • 10. Lawrence Livermore National Laboratory 10 Programming models bridge algorithms and machines and are implemented through components of software stack Measures of success: • Expressiveness • Performance • Programmability • Portability • Efficiency •… Language Compiler Library Algorithm Application Abstract Machine Executable Real Machine Programming Model Express Execute Compile/link … Software Stack
  • 11. Lawrence Livermore National Laboratory 11 Parallel programming models are built on top of sequential ones and use a combination of language/compiler/library support CPU MemoryAbstract Machine (overly simplified) CPU Shared Memory CPU CPU Memory CPU Memory Interconnect … Programming Model Sequential Parallel Shared Memory (e.g. OpenMP) Distributed Memory (e.g. MPI) … Software Stack General purpose Languages (GPL) C/C++/Fortran Sequential Compiler Optional Seq. Libs GPL + Directives Seq. Compiler + OpenMP support OpenMP Runtime Lib GPL + Call to MPI libs Seq. Compiler MPI library
  • 12. Lawrence Livermore National Laboratory 12 OpenMP Code #include <stdio.h> #ifdef _OPENMP #include <omp.h> #endif int main(void) { #pragma omp parallel printf("Hello,world! n"); return 0; } … //PI calculation #pragma omp parallel for reduction (+:sum) private (x) for (i=1;i<=num_steps;i++) { x=(i-0.5)*step; sum=sum+ 4.0/(1.0+x*x); } pi=step*sum; .. Programmer compiler Runtime library OS threads
  • 13. Lawrence Livermore National Laboratory 13 Shared storage OpenMP 4.0’s accelerator model  Device: a logical execution engine • Host device: where OpenMP program begins, one only • Target devices: 1 or more accelerators  Memory model • Host data environment: one • Device data environment: one or more • Allow shared host and device memory  Execution model: Host-centric • Host device : “offloads” code regions and data to target devices • Target devices: fork-join model • Host waits until devices finish • Host executes device regions if no accelerators are available /supported Host Device Proc. Proc. Target Device Proc. Proc. OpenMP 4.0 Abstract Machine Host Memory Device Memory … … … …
  • 14. Lawrence Livermore National Laboratory 14 Computation and data offloading  Directive: #pragma omp target device(id) map() if() • target: create a device data environment and offload computation to be run in sequential on the same device • device (int_exp): specify a target device • map(to|from|tofrom|alloc:var_list) : data mapping between the current task’s data environment and a device data environment  Directive: #pragma omp target data device(id) map() if() • Create a device data environment omp target CPU thread omp parallel Accelerator threads CPU thread
  • 15. Lawrence Livermore National Laboratory 15 Example code: Jacobi while ((k<=mits)&&(error>tol)) { // a loop copying u[][] to uold[][] is omitted here … #pragma omp parallel for private(resid,j,i) reduction(+:error) for (i=1;i<(n-1);i++) for (j=1;j<(m-1);j++) { resid = (ax*(uold[i-1][j] + uold[i+1][j]) + ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b; u[i][j] = uold[i][j] - omega * resid; error = error + resid*resid ; } // the rest code omitted ... } #pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m]) #pragma omp target device(gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m], uold[0:n][0:m]) map(tofrom:u[0:n][0:m])
  • 16. Lawrence Livermore National Laboratory 16 Building the accelerator support  Language: OpenMP 4.0 directives and clauses • target, device, map, collapse, target data, …  Compiler: • CUDA kernel generation: outliner • CPU code generation: kernel launching • Loop mapping (1-D vs. 2-D), loop transformation (normalize, collapse) • Array linearization  Runtime: • Device probing, configure execution settings • Loop scheduler: static even, round-robin • Memory management : data, allocation, copy, deallocation, reuse • Reduction: contiguous reduction pattern (folding)
  • 17. Lawrence Livermore National Laboratory 17 Example code: Jacobi using OpenMP accelerator directives while ((k<=mits)&&(error>tol)) { // a loop copying u[][] to uold[][] is omitted here … #pragma omp parallel for private(resid,j,i) reduction(+:error) for (i=1;i<(n-1);i++) for (j=1;j<(m-1);j++) { resid = (ax*(uold[i-1][j] + uold[i+1][j]) + ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b; u[i][j] = uold[i][j] - omega * resid; error = error + resid*resid ; } // the rest code omitted ... } #pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m]) #pragma omp target device(gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m], uold[0:n][0:m]) map(tofrom:u[0:n][0:m])
  • 18. Lawrence Livermore National Laboratory 18 GPU kernel generation: outliner, scheduler, array linearization, reduction 1. __global__ void OUT__1__10117__(int n, int m, float omega, float ax, float ay, float b, 2. float *_dev_per_block_error, float *_dev_u, float *_dev_f, float *_dev_uold) 3. { /* local variables for loop , reduction, etc */ 4. int _p_j; float _p_error; _p_error = 0; float _p_resid; 5. int _dev_i, _dev_lower, _dev_upper; 6. /* Static even scheduling: obtain loop bounds for current thread of current block */ 7. XOMP_accelerator_loop_default (1, n-2, 1, &_dev_lower, &_dev_upper); 8. for (_dev_i = _dev_lower; _dev_i<= _dev_upper; _dev_i ++) { 9. for (_p_j = 1; _p_j < (m - 1); _p_j++) { /* replace with device variables, linearize 2-D array accesses */ 10. _p_resid = (((((ax * (_dev_uold[(_dev_i - 1) * 512 + _p_j] + _dev_uold[(_dev_i + 1) * 512 + _p_j])) 11. + (ay * (_dev_uold[_dev_i * 512 + (_p_j - 1)] + _dev_uold[_dev_i * 512 + (_p_j + 1)]))) 12. + (b * _dev_uold[_dev_i * 512 + _p_j])) - _dev_f[_dev_i * 512 + _p_j]) / b); 13. _dev_u[_dev_i * 512 + _p_j] = (_dev_uold[_dev_i * 512 + _p_j] - (omega * _p_resid)); 14. _p_error = (_p_error + (_p_resid * _p_resid)); 15. } } /* thread block level reduction for float type*/ 16. xomp_inner_block_reduction_float(_p_error,_dev_per_block_error,6); 17. }
  • 19. Lawrence Livermore National Laboratory 19 CPU Code: data management, exe configuration, launching, reduction 1. xomp_deviceDataEnvironmentEnter(); /* Initialize a new data environment, push it to a stack */ 2. float *_dev_u = (float*) xomp_deviceDataEnvironmentGetInheritedVariable ((void*)u, _dev_u_size); 3. /* If not inheritable, allocate and register the mapped variable */ 4. if (_dev_u == NULL) 5. { _dev_u = ((float *)(xomp_deviceMalloc(_dev_u_size))); 6. /* Register this mapped variable original address, device address, size, and a copy-back flag */ 7. xomp_deviceDataEnvironmentAddVariable ((void*)u, _dev_u_size, (void*) _dev_u, true); 8. } 9. /* Execution configuration: threads per block and total block numbers */ 10. int _threads_per_block_ = xomp_get_maxThreadsPerBlock(); 11. int _num_blocks_ = xomp_get_max1DBlock((n - 1) - 1); 12. /* Launch the CUDA kernel ... */ 13. OUT__1__10117__<<<_num_blocks_,_threads_per_block_,(_threads_per_block_ * sizeof(float ))>>> 14. (n,m,omega,ax,ay,b,_dev_per_block_error,_dev_u,_dev_f,_dev_uold); 15. error = xomp_beyond_block_reduction_float(_dev_per_block_error,_num_blocks_,6); 16. /* Copy back (optionally) and deallocate variables mapped within this environment, pop stack */ 17. xomp_deviceDataEnvironmentExit();
  • 20. Lawrence Livermore National Laboratory 20 Preliminary results: AXPY (Y=a*X) Hardware configuration: • 4 quad-core Intel Xeon processors (16 cores) 2.27GHz with 32GB DRAM. • NVIDIA Tesla K20c GPU (Kepler architecture) 0 0.5 1 1.5 2 2.5 3 5000 50000 500000 5000000 50000000 100000000 500000000 Vector size (float) AXPY Execution Time (s) Sequential OpenMP FOR (16 threads) HOMP ACC PGI OpenACC HMPP OpenACC Software configuration: • HOMP (GCC 4.4.7 and the CUDA 5.0) • PGI OpenACC compiler version 13.4 • HMPP OpenACC compiler version 3.3.3
  • 21. Lawrence Livermore National Laboratory 21 Matrix multiplication 0 5 10 15 20 25 30 35 40 45 50 128x128 256x256 512x512 1024x1024 2048x2048 4096x4096 Matrix size (float) Matrix Multiplication Execution Time (s) Sequential OpenMP (16 threads) PGI HMPP HOMP HMPP Collapse HOMP Collapse *HOMP-collapse slightly outperforms HMPP-collapse when linearization with round-robin scheduling is used.
  • 22. Lawrence Livermore National Laboratory 22 Jacobi 0 10 20 30 40 50 60 70 80 90 100 128x128 256x256 512x512 1024x1024 2048x2048 Matrix size (float) Jacobi Execution Time (s) Sequential HMPP PGI HOMP OpenMP HMPP Collapse HOMP Collpase
  • 23. Lawrence Livermore National Laboratory 23 HOMP: multiple versions of Jacobi 0 10 20 30 40 50 60 70 80 90 100 128x128 256x256 512x512 1024x1024 2048x2048 Matrix size (float) Jacobi Execution Time (s) first version target-data Loop collapse using linearization with static-even scheduling Loop collapse using 2-D mapping (16x16 block) Loop collapse using 2-D mapping (8x32 block) Loop collapse using linearization with round-robin scheduling
  • 24. Lawrence Livermore National Laboratory 24 Case 1: conclusion  Many fundamental source-to-source translation • CUDA kernel generation, loop transformation, data management, reduction, loop scheduling, etc. • Reusable across CPU and accelerator  Quick & competitive results: • 1st implementation in the world in 2013 — Interactions with IBM, TI and GCC compiler teams • Early Feedback to OpenMP committee — Multiple device support — Per device default loop scheduling policy — Productivity: combined directives — New clause: no-middle-sync — Global barrier: #pragma omp barrier [league|team] • Extensions to support multiple GPUs in 2015
  • 25. Lawrence Livermore National Laboratory 25 Case 2: Performance via whole program autotuning  Problem: increasingly difficult to conduct performance optimization for large scale applications • Hardware: multicore, GPU, FPGA, Cell, Blue Gene • Empirical optimization (autotuning): — small kernels, domain-specific libraries only  Our objectives: • Research into autotuning for optimizing large scale applications — sequential and parallel • Develop tools to automate the life-cycle of autotuning — Outlining, parameterized transformation (loop tiling, unrolling)
  • 26. Lawrence Livermore National Laboratory 26 An end-to-end whole program autotuning system Application Performance Tool Tool Interface Code Triage Performance Info. Kernel Extraction Kernels Variants Search Engine Parameterized Translation Search Space ROSE- Based Components KernelsFocus
  • 27. Lawrence Livermore National Laboratory 27 Kernel extraction: outlining  Outlining:  Form a function from a code segment (outlining target)  Replace the code segment with a call to the function (outlined function)  A critical step to the success of an end-to-end autotuning system • Feasibility: — whole-program tuning -> kernel tuning • Relevance — Ensure semantic correctness — Preserve performance characteristics
  • 28. Lawrence Livermore National Laboratory 28 Example: Jacobi using a classic algorithm // .... some code is omitted void OUT__1__4027__(int *ip__, int *jp__, double omega, double *errorp__, double *residp__, double ax, double ay, double b) { for (*ip__=1;*ip__<(n-1);(*ip__)++) for (*jp__=1;*jp__<(m-1);(*jp__)++) { *residp__ = (ax * (uold[*ip__-1][*jp__] + uold[*ip__+1][*jp__]) + ay * (uold[*ip__][*jp__-1] + uold[*ip__][*jp__+1]) + b * uold[*ip__][*jp__] - f[*ip__][*jp__])/b; u[*ip__][*jp__] = uold[*ip__][*jp__] - omega * (*residp__); *errorp__ = *errorp__ + (*residp__) * (*residp__); } } void main() { int i,j; double omega,error,resid,ax,ay,b; //... OUT__1__4027__(&i,&j,omega,&error,&resid,ax,ay,b); error = sqrt(error)/(n*m); //... } int n,m; double u[MSIZE][MSIZE],f[MSIZE][MSIZE],uold[MSIZE][MSIZE]; void main() { int i,j; double omega,error,resid,ax,ay,b; // initialization code omitted here // ... for (i=1;i<(n-1);i++) for (j=1;j<(m-1);j++) { resid = (ax * (uold[i-1][j] + uold[i+1][j])+ ay * (uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b; u[i][j] = uold[i][j] - omega * resid; error = error + resid*resid; } error = sqrt(error)/(n*m); // verification code omitted here ... } • Algorithm: all variables are made parameters, written variables used via pointer dereferencing • Problem: 8 parameters & pointer dereferences for written variables in C • Disturb performance • Impede compiler analysis
  • 29. Lawrence Livermore National Laboratory 29 Our effective outlining algorithm  Rely on source level code analysis • Scope information: global, local, or in-between • Liveness analysis: at a given point, if the value’s value will be used in the future (transferred to future use) • Side-effect: variables being read or written  Decide two things about parameters • Minimum number of parameters to be passed • Minimum pass-by-reference parameters Parameters = (AllVars − InnerVars − GlobalVars − NamespaceVars) ∊ (LiveInVars U LiveOutVars) PassByRefParameters = Parameters ∊ ((ModifiedVars ∊ LiveOutVars) U ArrayVars)
  • 30. Lawrence Livermore National Laboratory 30 Eliminating pointer dereferences  Variables pass-by-reference handled by classic algorithms: pointer dereferences  We use a novel method: variable cloning • Check if such a variable is used by address: — C: &x; C++: T & y=x; or foo(x) when foo(T&) • Use a clone variable if x is NOT used by address CloneCandidates = PassByRefParameters ∊ PointerDereferencedVars CloneVars = (CloneCandidates − UseByAddressVars) ∊ AssignableVars CloneVarsToInit = CloneVars ∊ LiveInVars CloneVarsToSave = CloneVars ∊ LiveOutVars
  • 31. Lawrence Livermore National Laboratory 31 Classic vs. effective outlining Classic algorithm Our effective algorithm void OUT__1__4027__(int *ip, int *jp, double omega, double *errorp__, double *residp__, double ax, double ay, double b) { for (*ip__=1;*ip__<(n-1);(*ip__)++) for (*jp__=1;*jp__<(m-1);(*jp__)++) { *residp__ = (ax * (uold[*ip__-1][*jp__] + uold[*ip__+1][*jp__]) + ay * (uold[*ip__][*jp__-1] + uold[*ip__][*jp__+1]) + b * uold[*ip__][*jp__] - f[*ip__][*jp__])/b; u[*ip__][*jp__] = uold[*ip__][*jp__] - omega * (*residp__); *errorp__ = *errorp__ + (*residp__) * (*residp__); } } void OUT__1__5058__(double omega,double *errorp__, double ax, double ay, double b) { int i, j; /* neither live-in nor live-out*/ double resid ; /* neither live-in nor live-out */ double error ; /* clone for a live-in and live-out parameter */ error = * errorp__; /* Initialize the clone*/ for (i = 1; i < (n - 1); i++) for (j = 1; j < (m - 1); j++) { resid = (ax * (uold[i - 1][j] + uold[i + 1][j]) + ay * (uold[i][j - 1] + uold[i][j + 1]) + b * uold[i][j] - f[i][j]) / b; u[i][j] = uold[i][j] - omega * resid; error = error + resid * resid; } *errorp__ = error; /* Save value of the clone*/ }
  • 32. Lawrence Livermore National Laboratory 32 Different outlining methods: performance impact 0 5 10 15 20 25 30 35 40 45 O0 O2 O3 Executiontime(seconds) Optimization levels v0-Original v1-Pointer dereference v2-C++Reference v3-VariableClone Jacobi (500x500)
  • 33. Lawrence Livermore National Laboratory 33 Whole program autotuning example The Outlined SMG 2000 kernel (simplified version) for (si = 0; si < stencil_size; si++) for (k = 0; k < hypre__mz; k++) for (j = 0; j < hypre__my; j++) for (i = 0; i < hypre__mx; i++) rp[ri + i + j * hypre__sy3 + k * hypre__sz3] -= Ap_0[i + j * hypre__sy1 + k * hypre__sz1 + A->data_indices[m][si]] * xp_0[i + j * hypre__sy2 + k * hypre__sz2 + dxp_s[si]];  SMG (semicoarsing multigrid solver) 2000 • 28k line C code, stencil computation • a kernel ~45% execution time for 120x120x129 data set  HPCToolkit (Rice), the GCO search engine (UTK)
  • 34. Lawrence Livermore National Laboratory 34 Results icc –O2 Seq. (0,8,0) OMP (6,0,0) Kernel 18.6s 13.02s 3.35s Kernel speedup N/A 1.43x 5.55x Whole app. 35.11s 29.71s 19.90s Total speedup N/A 1.18x 1.76x  Sequential: ~40 hours for 14,400 points  Best point: (k,j,si,i) order  OpenMP: ~ 3 hours for <992 points  Best point: 6-thread Search space specifications Sequential OpenMP Tiling size (k,j,i) [0, 55, 5] #thread [1, 8, 1] Interchange (all levels) [0, 23, 1] Schedule policy [0, 3, 1] Unrolling (innermost) [0, 49, 1] Chunk size [0, 60, 2]
  • 35. Lawrence Livermore National Laboratory 35 Case 2 conclusion  Performance optimization via whole program tuning • Key components: outliner, parameterized transformation (skipped in this talk) • ROSE outliner: improved algorithm based on source code analysis — Preserve performance characteristics — Facilitate other tools to further process the kernels  Difficult to automate • Code triage: The right scope of a kernel, specification • Search space: Optimization/configuration for each kernel
  • 36. Lawrence Livermore National Laboratory 36 Case 3: resilience via source-level redundancy  Exascale systems expected to use large-amount of components running at low-power • Exa- 1018 / 2GHz (109) per core  109 core • Increased sensitivity to external events • Transient faults caused by external events — Faults which don’t stop execution but silently corrupt results. — Compared to permanent faults  Streamlined and simple processors • Cannot afford pure hardware-based resilience • Solution: software-implemented hardware fault tolerance
  • 37. Lawrence Livermore National Laboratory 37 Approach: compiler-based transformations to add resilience  Source code annotation(pragmas) • What to protect • What to do when things go wrong  Source-to-source translator • Fault detection: Triple-Modular Redundancy (TMR) • Implement fault-handling policies • Vendor compiler: binary executable • Intel PIN: fault injection  Challenges  Overhead  Vendor compiler optimizations ROSE::FTTransform Vendor Compiler Annotated source code Transformed source code Binary executable Fault Injection (Intel PIN)
  • 38. Lawrence Livermore National Laboratory 38 NaĂŻve example Policy Final-wish Pragma #pragma FT-FW (Majority-voting) Y = …; Semantics // T-Modular Redundancy y[0] = . . . ; y[1] = . . . ; y[2] = . . . ; // Fault detection if ( !EQUALS( y[0] , y[1], y[2])) ) { // Fault handling Y=Voting(y[0] , y[1], y[2] ); } else Y = y[0];
  • 39. Lawrence Livermore National Laboratory 39 Source code redundancy: reduce overhead Relying on baseline double modular redundancy (DMR) to help reduce overhead Statement to be protected
  • 40. Lawrence Livermore National Laboratory 40 Source code redundancy: bypass compiler optimizations Using an extra pointer to help preserve source code redundancy Statement to be protected
  • 41. Lawrence Livermore National Laboratory 41 Results: necessity and effectiveness of optimizer-proof code redundancy Transformation Method PAPI_FP_INS (O1) PAPI_FP_INS (O2) PAPI_FP_INS (O3) Our method of using pointers Doubled Doubled Doubled NaĂŻve Duplication No Change No Change No Change  Hardware counter to check if redundant computation can survive compiler optimizations • PAPI (PAPI_FP_INS) to collect the number of floating point instructions for the protected kernel • GCC 4.3.4, O1 to O3
  • 42. Lawrence Livermore National Laboratory 42 Results: overhead  Jacobi kernel: • Overhead: 0% to 30% • Minimum overhead — Stride=8 — Array size 16Kx16K — Double precision • The more original latency, the less overhead of added redundancy • Livermore kernel — Kernel 1 (Hydro fragment) – 20% — Kernel 4 (Banded linear equations) – 40% — Kernel 5 (Tri-diagonal elimination) – 26% — Kernel 11 (First sum) – 2%
  • 43. Lawrence Livermore National Laboratory 43 Conclusion  HPC challenges • Traditional: performance, productivity, portability • Newer: Heterogeneity, resilience, power efficiency • Made much worse by increasing complexity of computation node architectures  Source-to-source compilation techniques • Complement traditional compilers aimed for generating machine code — Source code analysis, transformation, optimizations, generation, … • High-level Programming models for GPUs • Performance tuning: code extraction and parameterized code translation/generation • Resilience via source level redundancy
  • 44. Lawrence Livermore National Laboratory 44 Future directions  Accelerator optimizations • Newer accelerator features: Direct GPU to GPU communication, Peer-to-Peer execution model, Uniform Virtual Memory • CPU threads vs. accelerator threads vs. vectorization  Knowledge-driven compilation and runtime adaptation • Explicit represent optimization knowledge and automatic apply them in HPC  A uniform directive-based programming model • A single X instead of MPI+X to cover both on-node and across node parallelism  Thread level resilience • Most resilience work focus on MPI, not threading
  • 45. Lawrence Livermore National Laboratory 45 Acknowledgement  Sponsors • LLNL (LDRD, ASC), DOE (SC), DOD  Colleagues • Dan Quinlan (ROSE project lead) • Pei-Hung Lin (Postdoc)  Collaborators • Mary Hall (University of Utah) • Qing Yi (University of Colorado at Colorado Springs) • Robert Lucas (University of Southern California-ISI) • Sally A. McKee (Chalmers University of Technology) • Stephen Guzik (Colorado State University) • Xipeng Shen (North Carolina State University) • Yonghong Yan (Oakland University) • ….

Hinweis der Redaktion

  1. Parallel scientific computing in climate modeling, weather prediction, energy, material science, etc.
  2. Traditional compilers: GCC, ICC, Open64 Aimed to generate machine executables Lose high level source code information Black-box to application developers Multiple backends for multiple platforms A source-to-source compiler infrastructure Generate analyzed and/or transformed source code Maintain high-level source structure and information (comments, preprocessing info., etc.) Friendly to application developers: easy to understand Portable across multiple platforms by leveraging backend compilers
  3. ROSE: making compiler techniques more accessible Source-to-source compiler infrastructure High level intermediate representation (Abstract Syntax Tree -AST) with rich traverse/transformation interfaces Supports C/C++, Fortran, OpenMP, UPC Encapsulates sophisticated compiler analyses/optimizations into easy-to-use function calls Intended users: building customized program analysis and translation tools
  4. The ROSE IR consists of the AST tree, a set of associated symbol tables, and a rich set of interface functions. As a source-to-source compiler framework, the ROSE AST is a high level representation which is designed to preserves source code details, such as token stream, source comments, C preprocessor info and C++ templates. Rich interface for: AST traversal, query, creation, copy, symbol lookup, Generic analyses, transformations, It is also extensible, which means users can attach arbitrary information to AST, via a mechanism called persistent attributes. As we can see in this slide, a program’s source code has a very straightforward and faithful tree representation in ROSE. As a result, ROSE is ideal for quickly building customized analysis and translation tools to support source code analysis and restructuring.
  5. Discover/design/develop building blocks in the software stack Language level target, device(), map(), … num_devices, device_type(), no-middle-sync, … Compiler level Pragma parsing Parser building blocks AST extensions CUDA-specific IR nodes AST translation: SageBuilder and SageInterface functions Loop normalization, multi-dimensional array linearization Outliner: CUDA kernel generation Runtime level Device feature probe: threads per block, block number… Memory management: allocation, copy, data tracking/reuse, etc. Loop scheduling: within and across thread blocks Reduction: a first two-level algorithm
  6. Use the latest official release for key concepts : one slide primer
  7. Language level target, device(), map(), … num_device, device_type(), no-middle-sync, … Compiler level Pragma parsing Parser building blocks AST extensions CUDA-specific IR nodes AST translation: SageBuilder and SageInterface functions Loop normalization, multi-dimensional array linearization Outliner: CUDA kernel generation Runtime level Device feature probe: threads per block, block number… Memory management: allocation, copy, data tracking/reuse, etc. Loop scheduling: within and across thread blocks Reduction: a first two-level algorithm
  8. Language features Target regions Parallel regions and teams Data mapping Loop and distribute directives C bindings: interoperate with C/C++ and Fortran Conditional bodies: support both CUDA and OpenCL(ongoing working) Language level target, device(), map(), … num_device, device_type(), no-middle-sync, … Compiler level Pragma parsing Parser building blocks AST extensions CUDA-specific IR nodes AST translation: SageBuilder and SageInterface functions Loop normalization, multi-dimensional array linearization Outliner: CUDA kernel generation Runtime level Device feature probe: threads per block, block number… Memory management: allocation, copy, data tracking/reuse, etc. Loop scheduling: within and across thread blocks Reduction: a first two-level algorithm *C. Liao, D. J. Quinlan, T. Panas, and B. R. de Supinski, “A ROSE-based OpenMP 3.0 research compiler supporting multiple runtime libraries,” in Beyond loop level parallelism in OpenMP: accelerators, tasking and more (IWOMP’10), Springer, 2010, pp. 15-28.
  9. Three points Start with traditional parallel fro Add first target directive with map clauses To avoid repetitive data allocation, copying, target data at the while loop level
  10. 4 points: Outliner extended to generate CUDA kernel Invoke runtime scheduler Array linearization Inner block reduction
  11. For the call site: data preparation, configuration, launch, cleanup, Runtime support for tracking data environment: a stack of data structure storing data allocated in each level Runtime data allocation, registering, Execution configuration, CPU side reduction
  12. Points * Software hardware configuration * PGI OpenACC , HMPP OpenACC Accelerator is not always a better choice void axpy_ompacc(REAL* x, REAL* y, int n, REAL a) { int i; #pragma omp target device (gpu0) map(tofrom: y[0:n]) map(to: x[0:n],a,n) #pragma omp parallel for for (i = 0; i < n; ++i) y[i] += a * x[i]; }
  13. PGI collapse version does not give any performance improvements i-j-k order, scalar replacement
  14. PGI collapse does not work with reduction
  15. The Heterogeneous OpenMP (HOMP) programming model developed by using the building blocks of this project has delivered competitive performance for the important stencil computation kernel, Jacobi. This figure shows the improvements of execution time by incrementally using more advanced compiler and runtime building blocks. The original version uses minimum set of building blocks. The target-data version uses data-reuse. The rest versions use loop collapse, 2-D mapping, and different scheduling policies.
  16. Current: device(id), portability concern Desired: device_type(), num_devices(), dist_data() Avoid cumbersome explicit teams May need per device default scheduling policy omp target parallel: avoid initial implicit sequential task
  17. Our work is motivated by a grand challenge problem today: With the radically different and diverse platforms, such as muticore, GPU, FPGA,etc. building conventional compiler optimizations for each of them is no longer an effective solution. On the other hand, empirical tuning seems to be an very promising approach since it usually does not rely on accurate machine models. Unfortunately, most existing autotuning work only handle small kernels, domain-specific libraries. Thus, our goal of this work is to study what it takes to extend autotuning to handle general large scale applications. In particular, we want to build automated tools to support empirical tuning of whole applications. ------------------ %It mostly chooses the best optimization based on actual measured performance regardless hardware details. %building model-driven static compiler optimizations for them is no longer an effective solution. On the other hand, It is difficult to accurately model each of the complex platform. Static compiler optimizations: limited success Software: C/C++/Fortran, MPI, OpenMP, UPC … source-to-source
  18. A classic algorithm will generate outlined kernel like this: Written variables : pass their addresses, use them via pointer dereferences There are at least two serious problems with the class outlining algorithm. One is that it may change the performance characteristics of the transformed application. Pointer-dereferencing often is more expensive than straight forward non-pointer typed variables. The other problem is that it significantly impede static analysis due to the well know limitations of accurate pointer analysis. Consequently, the kernels with excessive pointer references are hard to be further analyzed and transformed to generate multiple variants. From our experience, none of the scidac peri tools can handle the kernel mentioned above. So it is a serious problem. ------------------ Hard to be optimized , difficult to be handled by other tools (parameterized translation). %This slide shows the jacobi program after its loop computation kernel is outlined using a classic algorithm.
  19. As to the parameters, we use variable scope, liveness, and side effect analysis to minimize the number of parameters to be passed. Further, we decide how to pass each parameter, either by value or by reference, based on the formula here. Modified and liveout: need to pass the modified values, plus array for language limitation and class/structure variables for efficiency Parameters = ((AllVars − InnerVars − GlobalVars − NamespaceVars −ClassVars) ∩ (LiveInVars U LiveOutVars)) U ClassPointers PassByRefParameters = Parameters ∩ ((ModifiedVars ∩ LiveOutVars) U ArrayVars U ClassVars) Once the outliner ensures that a target can be correctly outlined.
  20. In conclusion, manual code specification/translation may be needed
  21. Second-chance: retry until reaching the predefined threshold, then use next-policy Final-wish: only one chance to handle the fault using next-policy, you can inject a statement before the policy is called
  22. * Jacob Lidman, Daniel J. Quinlan, Chunhua Liao and Sally A. McKee, ROSE::FTTransform A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research, Fault-Tolerance for HPC at Extreme Scale (FTXS 2012), Boston, June 25-28, 2012.
  23. It is obvious that this overhead is heavily influenced by the memory latencies of the original code: the more latency, the better hidden the cost of the duplicated computation. The minimum overhead is observed when the Jacobi iteration uses a non-consecutive stride, a large data set, and a large element size.