In this deck from PASC18, Robert Searles from the University of Delaware presents: Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures.
"Architectures are rapidly evolving, and exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages and programming models among other components in order to migrate large scale applications and explore parallelism on these machines. Although directive-based programming models allow programmers to worry less about programming and more about science, expressing complex parallel patterns in these models can be a daunting task especially when the goal is to match the performance that the hardware platforms can offer. One such pattern is wavefront. This paper extensively studies a wavefront-based miniapplication for Denovo, a production code for nuclear reactor modeling.
We parallelize the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm in the main kernel of Minisweep (the miniapplication) using CUDA, OpenMP and OpenACC. Our OpenACC implementation running on NVIDIA's next-generation Volta GPU boasts an 85.06x speedup over serial code, which is larger than CUDA's 83.72x speedup over the same serial implementation. Our experimental platform includes SummitDev, an ORNL representative architecture of the upcoming Summit supercomputer. Our parallelization effort across platforms also motivated us to define an abstract parallelism model that is architecture independent, with a goal of creating software abstractions that can be used by applications employing the wavefront sweep motif."
Watch the video: https://wp.me/p3RLHQ-iPU
Read the Full Paper: https://doi.org/10.1145/3218176.3218228
and
https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures
1. Robert Searles, Sunita Chandrasekaran
(rsearles, schandra)@udel.edu
Wayne Joubert, Oscar Hernandez
(joubert,oscar)@ornl.gov
Abstractions and Directives for
Adapting Wavefront Algorithms to
Future Architectures
PASC 2018 June 3, 2018
1
2. Motivation
• Parallel programming is software’s future
– Acceleration
• State-of-the-art abstractions handle simple
parallel patterns well
• Complex patterns are hard!
2
3. Our Contributions
• An abstract representation for wavefront
algorithms
• A performance portable proof-of-concept of
this abstraction using directives: OpenACC
– Evaluation on multiple state-of-the-art platforms
• A description of the limitations of existing
high-level programming models
3
4. Several ways to accelerate
Directives
Programming
Languages
Libraries
Applications
Drop in
acceleration
Maximum
Flexibility
Used for easier
acceleration
4
5. Directive-Based Programming Models
• OpenMP (current version 4.5)
– Multi-platform shared multiprocessing API
– Since 2013, supporting device offloading
• OpenACC (current version 2.6)
– Directive-based model for heterogeneous
computing
5
8. Host Code
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
// Allocate GPU buffers for three vectors (two input, one output)
cudaStatus = cudaMalloc((void**)&dev_c, N* sizeof(int));
cudaStatus = cudaMalloc((void**)&dev_a, N* sizeof(int));
cudaStatus = cudaMalloc((void**)&dev_b, N* sizeof(int));
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, N* sizeof(int), cudaMemcpyHostToDevice);
cudaStatus = cudaMemcpy(dev_b, b, N* sizeof(int), cudaMemcpyHostToDevice);
// Launch a kernel on the GPU with one thread for each element.
addKernel<<<N/BLOCK_SIZE, BLOCK_SIZE>>>(dev_c, dev_a, dev_b);
// cudaThreadSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaThreadSynchronize();
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, N* sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
8
Kernel
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
c[i] = a[i] + b[i];
}
CUDA Example
9. Pattern-Based Approach in Parallel
Computing
• Several parallel patterns
– Existing high-level languages provide abstractions for many
simple patterns
• However there are complex patterns often found in
scientific applications that are a challenge to be
represented with software abstractions
– Require manual code rewrite
• Need additional features/extensions!
– How do we approach this? (Our paper’s contribution)
9
10. Application Motivation:
Minisweep
• A miniapp modeling wavefront sweep component of
Denovo Radiation transport code from ORNL
– Minisweep, a miniapp, represents 80-90% of Denovo
• Denovo - part of DOE INCITE project, is used to model
fusion reactor – CASL, ITER
• Run many times with different parameters
• The faster it runs, the more configurations we can explore
• Poses a six dimensional problem
• 3D in space, 2D in angular particle direction and 1D in
particle energy
10
11. Minisweep code status
• Github: https://github.com/wdj/minisweep
• Early application readiness on ORNL Titan
• Being used for #1 TOP500 -> Summit
acceptance testing
• Has been ported to Beacon and Titan (ORNL
machines) using OpenMP and CUDA
11
16. Overview of Sweep Algorithm
• 5 nested loops
– X, Y, Z dimensions, Energy Groups, Angles
– OpenACC/PGI only offers 2 levels of parallelism:
gang and vector (worker clause not working
properly)
– Upstream data dependency
16
19. Parallelizing Sweep Algorithm: KBA
• Koch-Baker-Alcouffe (KBA)
• Algorithm developed in 1992 at
Los Alamos
• Parallel sweep algorithm that
overcomes some of the
dependencies using a wavefront.
19
Image credit: High Performance
Radiation Transport Simulations:
Preparing for TITAN
C. Baker, G. Davidson, T. M. Evans,
S. Hamilton, J. Jarrell and W.
Joubert, ORNL, USA
20. Expressing Wavefront via Software
Abstractions – A Challenge
• Existing solutions involve manual rewrites, or
compiler-based loop transformations
– Michael Wolfe. 1986. Loop skewing: the wavefront method
revisited. Int. J. Parallel Program. 15, 4 (October 1986),
279-293. DOI=http://dx.doi.org/10.1007/BF01407876
– Polyhedral frameworks, only support affine loops, ChiLL
and Pluto
• No solution in high-level languages like
OpenMP/OpenACC; no software abstractions
20
21. Our Contribution:
Create Software Abstractions for
Wavefront pattern
• Analyzing flow of data and computation in
wavefront codes
• Memory and threading challenges
• Wavefront loop transformation algorithm
21
23. Abstract Parallelism Model
• Spatial decomposition = outer layer (KBA)
– No existing abstraction for this
• In-gridcell computations = inner layer
– Application specific
• Upstream data dependencies
– Slight variation between wavefront applications
23
24. Data Model
• Storing all previous wavefronts is unnecessary
– How many neighbors and prior wavefronts are
accessed?
• Face arrays make indexing easy
– Smaller data footprint
• Limiting memory to the size of the largest
wavefront is optimal, but not practical
24
30. Contributions
• An abstract representation of wavefront
algorithms
• A performance portable proof-of-concept of
this abstraction using OpenACC
– Evaluation on multiple state-of-the-art platforms
• A description of the limitations of existing
high-level programming models
30
31. Next Steps
• Asynchronous execution
• MPI - multi-node/multi-GPU
• Develop a generalization/extension to existing
high-level programming models
– Prototype
31
32. Preliminary Results/Ongoing Work
• MPI + OpenACC
– 1 node x 1 P100 GPU = 66.79x speedup
– 4 nodes x 4 P100 GPUs/node = 565.81x speedup
– 4 nodes x 4 V100 GPUs/node = 624.88x speedup
• Distributing the workload lets us examine larger
spatial dimensions
– Future: Use blocking to allow for this on a single GPU
32
33. Takeaway(s)
• Using directives is not magical! Compilers are already doing a lot for
us!
• Code benefits from incremental improvement – so let’s not give up!
• *Profiling and Re-profiling is highly critical*
• Look for any serial code refactoring, if need be
– Make the code parallel and accelerator-friendly
• Watch out for compiler bugs and *report them*
– The programmer is not ‘always’ wrong
• Watch out for *novel language extensions and propose to the
committee* - User feedback
– Did you completely change the loop structure? Did you notice a
parallel pattern for which we don’t have a high-level directive yet?
33
34. Contributions
• An abstract representation of wavefront algorithms
• A performance portable proof-of-concept of this
abstraction using directives, OpenACC
– Evaluation on multiple state-of-the-art platforms
• A description of the limitations of existing high-level
programming models
• Contact: rsearles@udel.edu
• Github: https://github.com/rsearles35/minisweep
34
37. 37
silica IFPEN, RMM-DIIS on P100
OPENACC GROWING MOMENTUM
Wide Adoption Across Key HPC Codes
3 of Top 5 HPC Applications Use
OpenACC
ANSYS Fluent, Gaussian, VASP
40 core Broadwell 1 P100 2 P100 4 P100
0
1
2
3
4
5
Speed-up
vasp_std (5.4.4)
VASP, silica IFPEN, RMM-DIIS on P100
* OpenACC port covers more VASP routines than CUDA, OpenACC port planned top
down, with complete analysis of the call tree, OpenACC port leverages improvements in
latest VASP Fortran source base
silica IFPEN, RMM-DIIS on
P100
CAAR Codes Use
OpenACC
GTC
XGC
ACME
FLASH
LSDalton
OpenACC Dominates in
Climate & Weather
Key Codes Globally
COSMO, IFS(ESCAPE),
NICAM, ICON, MPAS
Gordon Bell Finalist
CAM-SE on Sunway
Taihulight
38. 38
NUCLEAR REACTOR
MODELING PROXY CODE :
MINISWEEP
Minisweep, a miniapp, represents 80-90% of
Denovo Sn code
Denovo Sn (discrete ordinate), part of DOE
INCITE project, is used to model fusion reactor
– CASL, ITER
Impact: By running Minisweep faster,
experiments with more configurations can be
performed directly impacting the determination
of accuracy of radiation shielding
Poses a six dimensional problem
3D in space, 2D in angular particle direction and
1D in particle energy
38
Robbie – I changed implementation to ->proof-of-concept
Third bullet -> currently only slide 26 is for this bullet isn’t it? That is not enough. Think about adding CUDA and OpenMP limitations as 2 slides or in 1 slide, then this bullet will be covered.
Mention that OpenMP 4.5 is now starting to support offloading since 2013, but that is not what it was originally intended for.
Robbie –I removed SummitDev, you can mention it, our focus on this slide is Summit and Top500
Since you are presenting in Europe, I added ORNL to the machine names, coz not everyone is going to put 2 and 2 together that Titan, Beacon where ORNL machines.
EXPLAIN GANG & VECTOR
Talk about how to map wavefront using gang/vector
EXPLAIN GANG & VECTOR
EXPLAIN GANG & VECTOR
Go through the motions of analyzing a complex parallel pattern
Come up with an abstraction for the pattern in question
Robbie – edited the bullet on polyhedral
-> Polyhedral transformation tools require code refactoring; only supports affine loops
For your notes-> Typical limitations of the polyhedral model is essentially anything that makes the code non-affine.
These include array indirections, data-dependent conditionals, linearized arrays and use of pointers.
These limitations need to be addressed before broadening the applicability of polyhedral transformations.
Go through the motions of analyzing a complex parallel pattern
Come up with an abstraction for the pattern in question
EXPLAIN FACE ARRAYS!
Optimal memory size implies brutal index calculations.
Note that minisweep actually runs 8 sweeps (one per direction)
Need a 3rd layer of parallelism for this!
Robbie – didn’t we have results on SummitDev? It is in the paper, so add that to the second bullet? Its OK if you are showing titan results in your chart. Keep the absolute runtime numbers on summitdev as a backup slide. Someone might ask about absolute runtime as people don’t like speedup chart
.
SC: After this slide, bring back the contribution slide, and we have to discuss what we have contributed to these proposed bullets. Because contribution slides means we have contributed something.
So how about?
Creating standardized high-level abstraction for complex parallel patterns
Evaluate abstraction on multiple state-of-the-art platforms to study performance/portability
Apply abstraction to real-world scientific applications of interest to Exascale projects
Robbie – this can be your last slide and then switch to the contribution slide and then put the next slide up while taking questions.
Center for Accelerated Application Readiness (CAAR)
VASP material science
Gaussian Quantum Chemistry
ANSYS Fluent, CFD
Just keep this slide, in case someone goes like who is using OpenACC, you can bring this up pretty quickly.
current DOE INCITE project to model the International Thermonuclear Experimental Reactor (ITER) fusion reactor
and is the primary radiation transport code used in the multi-industry Consortium for Advanced Simulation of Light Water Reactors (CASL) [2] headquartered at Oak Ridge National Laboratory (ORNL).
impact of the Fukushima nuclear accident in Japan, highlights the need to increase the production of safe, abundant quantities of energy
A transport solver that incorporates an adequately resolved discretization for all scales using current discrete models would require 1017–1021 degrees of freedom per step, which is well beyond even exascale resources.
Space: in the x , y and z dimensions, each gridcell depends on results from three upstream cells, based on octant direction, resulting in a wavefront problem.
Octant: results for different octants can be calculated indepen- dently, the only dependency being that results from different octants at the same entry of vo′ut are added together and thus are a race hazard, depending on how the computation is done.
Energy: computations for different energy groups have no coupling and can be considered separate problem instances, enabling flexible parallelization.
One of the six applications selected for ORNL CAAR project
Similar to Sn wavefront codes including Kripke, Sn (Discrete Ordinates) Application Proxy (SNAP) and PARTISN