SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Robert Searles, Sunita Chandrasekaran
(rsearles, schandra)@udel.edu
Wayne Joubert, Oscar Hernandez
(joubert,oscar)@ornl.gov
Abstractions and Directives for
Adapting Wavefront Algorithms to
Future Architectures
PASC 2018 June 3, 2018
1
Motivation
• Parallel programming is software’s future
– Acceleration
• State-of-the-art abstractions handle simple
parallel patterns well
• Complex patterns are hard!
2
Our Contributions
• An abstract representation for wavefront
algorithms
• A performance portable proof-of-concept of
this abstraction using directives: OpenACC
– Evaluation on multiple state-of-the-art platforms
• A description of the limitations of existing
high-level programming models
3
Several ways to accelerate
Directives
Programming
Languages
Libraries
Applications
Drop in
acceleration
Maximum
Flexibility
Used for easier
acceleration
4
Directive-Based Programming Models
• OpenMP (current version 4.5)
– Multi-platform shared multiprocessing API
– Since 2013, supporting device offloading
• OpenACC (current version 2.6)
– Directive-based model for heterogeneous
computing
5
Serial Example
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
6
OpenACC Example
#pragma acc parallel loop independent
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
7
Host Code
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
// Allocate GPU buffers for three vectors (two input, one output)
cudaStatus = cudaMalloc((void**)&dev_c, N* sizeof(int));
cudaStatus = cudaMalloc((void**)&dev_a, N* sizeof(int));
cudaStatus = cudaMalloc((void**)&dev_b, N* sizeof(int));
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, N* sizeof(int), cudaMemcpyHostToDevice);
cudaStatus = cudaMemcpy(dev_b, b, N* sizeof(int), cudaMemcpyHostToDevice);
// Launch a kernel on the GPU with one thread for each element.
addKernel<<<N/BLOCK_SIZE, BLOCK_SIZE>>>(dev_c, dev_a, dev_b);
// cudaThreadSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaThreadSynchronize();
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, N* sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
8
Kernel
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
c[i] = a[i] + b[i];
}
CUDA Example
Pattern-Based Approach in Parallel
Computing
• Several parallel patterns
– Existing high-level languages provide abstractions for many
simple patterns
• However there are complex patterns often found in
scientific applications that are a challenge to be
represented with software abstractions
– Require manual code rewrite
• Need additional features/extensions!
– How do we approach this? (Our paper’s contribution)
9
Application Motivation:
Minisweep
• A miniapp modeling wavefront sweep component of
Denovo Radiation transport code from ORNL
– Minisweep, a miniapp, represents 80-90% of Denovo
• Denovo - part of DOE INCITE project, is used to model
fusion reactor – CASL, ITER
• Run many times with different parameters
• The faster it runs, the more configurations we can explore
• Poses a six dimensional problem
• 3D in space, 2D in angular particle direction and 1D in
particle energy
10
Minisweep code status
• Github: https://github.com/wdj/minisweep
• Early application readiness on ORNL Titan
• Being used for #1 TOP500 -> Summit
acceptance testing
• Has been ported to Beacon and Titan (ORNL
machines) using OpenMP and CUDA
11
Minisweep: The Basics
12
Parallelizing Sweep Algorithm
13
Complex Parallel Pattern Identified:
Wavefront
14
Complex Parallel Pattern Identified:
Wavefront
15
Overview of Sweep Algorithm
• 5 nested loops
– X, Y, Z dimensions, Energy Groups, Angles
– OpenACC/PGI only offers 2 levels of parallelism:
gang and vector (worker clause not working
properly)
– Upstream data dependency
16
17
18
Parallelizing Sweep Algorithm: KBA
• Koch-Baker-Alcouffe (KBA)
• Algorithm developed in 1992 at
Los Alamos
• Parallel sweep algorithm that
overcomes some of the
dependencies using a wavefront.
19
Image credit: High Performance
Radiation Transport Simulations:
Preparing for TITAN
C. Baker, G. Davidson, T. M. Evans,
S. Hamilton, J. Jarrell and W.
Joubert, ORNL, USA
Expressing Wavefront via Software
Abstractions – A Challenge
• Existing solutions involve manual rewrites, or
compiler-based loop transformations
– Michael Wolfe. 1986. Loop skewing: the wavefront method
revisited. Int. J. Parallel Program. 15, 4 (October 1986),
279-293. DOI=http://dx.doi.org/10.1007/BF01407876
– Polyhedral frameworks, only support affine loops, ChiLL
and Pluto
• No solution in high-level languages like
OpenMP/OpenACC; no software abstractions
20
Our Contribution:
Create Software Abstractions for
Wavefront pattern
• Analyzing flow of data and computation in
wavefront codes
• Memory and threading challenges
• Wavefront loop transformation algorithm
21
Abstract Parallelism Model
for( iz=izbeg; iz!=izend; iz+=izinc )
for( iy=iybeg; iy!=iyend; iy+=iyinc )
for( ix=ixbeg; ix!=ixend; ix+=ixinc ) { // space
for( ie=0; ie<dim_ne; ie++ ) { // energy
for( ia=0; ia<dim_na; ia++ ) { // angles
// in-gridcell computation
}
}
}
22
Abstract Parallelism Model
• Spatial decomposition = outer layer (KBA)
– No existing abstraction for this
• In-gridcell computations = inner layer
– Application specific
• Upstream data dependencies
– Slight variation between wavefront applications
23
Data Model
• Storing all previous wavefronts is unnecessary
– How many neighbors and prior wavefronts are
accessed?
• Face arrays make indexing easy
– Smaller data footprint
• Limiting memory to the size of the largest
wavefront is optimal, but not practical
24
Parallelizing Sweep Algorithm: KBA
25
Programming Model Limitations
• No abstraction for wavefront loop
transformation
– Manual loop restructuring
• Limited layers of parallelism
– 2 isn’t enough (worker is broken)
– Asynchronous execution?
26
Experimental Setup
• NVIDIA PSG Cluster
– CPU: Intel Xeon E5-2698 v3 (16-core) and Xeon E5-2690 v2 (10-core)
– GPU: NVIDIA Tesla P100, Tesla V100, and Tesla K40 (4 GPUs per node)
• ORNL Titan
– CPU: AMD Opteron 6274 (16-core)
– GPU: NVIDIA Tesla K20x
• ORNL SummitDev
– CPU: IBM Power8 (10-core)
– GPU: NVIDIA Tesla P100
• PGI OpenACC Compiler 17.10
• OpenMP – GCC 6.2.0
– Issues running OpenMP minisweep code on Titan but works OK on PSG.
27
Input Parameters
• Scientifically
– X/Y/Z dimensions = 64
– # Energy Groups = 64
– # Angles = 32
• Goal is to explore larger spatial dimensions
28
29
Contributions
• An abstract representation of wavefront
algorithms
• A performance portable proof-of-concept of
this abstraction using OpenACC
– Evaluation on multiple state-of-the-art platforms
• A description of the limitations of existing
high-level programming models
30
Next Steps
• Asynchronous execution
• MPI - multi-node/multi-GPU
• Develop a generalization/extension to existing
high-level programming models
– Prototype
31
Preliminary Results/Ongoing Work
• MPI + OpenACC
– 1 node x 1 P100 GPU = 66.79x speedup
– 4 nodes x 4 P100 GPUs/node = 565.81x speedup
– 4 nodes x 4 V100 GPUs/node = 624.88x speedup
• Distributing the workload lets us examine larger
spatial dimensions
– Future: Use blocking to allow for this on a single GPU
32
Takeaway(s)
• Using directives is not magical! Compilers are already doing a lot for
us! 
• Code benefits from incremental improvement – so let’s not give up! 
• *Profiling and Re-profiling is highly critical*
• Look for any serial code refactoring, if need be
– Make the code parallel and accelerator-friendly
• Watch out for compiler bugs and *report them*
– The programmer is not ‘always’ wrong
• Watch out for *novel language extensions and propose to the
committee* - User feedback
– Did you completely change the loop structure? Did you notice a
parallel pattern for which we don’t have a high-level directive yet?
33
Contributions
• An abstract representation of wavefront algorithms
• A performance portable proof-of-concept of this
abstraction using directives, OpenACC
– Evaluation on multiple state-of-the-art platforms
• A description of the limitations of existing high-level
programming models
• Contact: rsearles@udel.edu
• Github: https://github.com/rsearles35/minisweep
34
Additional Material
35
Additional Material
36
37
silica IFPEN, RMM-DIIS on P100
OPENACC GROWING MOMENTUM
Wide Adoption Across Key HPC Codes
3 of Top 5 HPC Applications Use
OpenACC
ANSYS Fluent, Gaussian, VASP
40 core Broadwell 1 P100 2 P100 4 P100
0
1
2
3
4
5
Speed-up
vasp_std (5.4.4)
VASP, silica IFPEN, RMM-DIIS on P100
* OpenACC port covers more VASP routines than CUDA, OpenACC port planned top
down, with complete analysis of the call tree, OpenACC port leverages improvements in
latest VASP Fortran source base
silica IFPEN, RMM-DIIS on
P100
CAAR Codes Use
OpenACC
GTC
XGC
ACME
FLASH
LSDalton
OpenACC Dominates in
Climate & Weather
Key Codes Globally
COSMO, IFS(ESCAPE),
NICAM, ICON, MPAS
Gordon Bell Finalist
CAM-SE on Sunway
Taihulight
38
NUCLEAR REACTOR
MODELING PROXY CODE :
MINISWEEP
 Minisweep, a miniapp, represents 80-90% of
Denovo Sn code
 Denovo Sn (discrete ordinate), part of DOE
INCITE project, is used to model fusion reactor
– CASL, ITER
 Impact: By running Minisweep faster,
experiments with more configurations can be
performed directly impacting the determination
of accuracy of radiation shielding
 Poses a six dimensional problem
 3D in space, 2D in angular particle direction and
1D in particle energy
38

Weitere ähnliche Inhalte

Was ist angesagt?

LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeNidhin Pattaniyil
 
Graal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian WimmerGraal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian WimmerThomas Wuerthinger
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulThomas Wuerthinger
 
Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...Fwdays
 
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...Thomas Wuerthinger
 
Graal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution PlatformGraal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution PlatformThomas Wuerthinger
 
Advancing OpenFabrics Interfaces
Advancing OpenFabrics InterfacesAdvancing OpenFabrics Interfaces
Advancing OpenFabrics Interfacesinside-BigData.com
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...Rogue Wave Software
 
Modeling an Embedded Device for PSpice Simulation
Modeling an Embedded Device for PSpice SimulationModeling an Embedded Device for PSpice Simulation
Modeling an Embedded Device for PSpice SimulationEMA Design Automation
 
Requirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and ApplicationsRequirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and ApplicationsLionel Briand
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~Yamagishi Laboratory, National Institute of Informatics, Japan
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)Dimos Raptis
 
ScilabTEC 2015 - Silkan
ScilabTEC 2015 - SilkanScilabTEC 2015 - Silkan
ScilabTEC 2015 - SilkanScilab
 
On the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsOn the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsFilip Krikava
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrellzukun
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoData Science Milan
 

Was ist angesagt? (20)

LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
 
Graal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian WimmerGraal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian Wimmer
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered Harmful
 
Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...
 
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
 
Graal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution PlatformGraal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution Platform
 
Advancing OpenFabrics Interfaces
Advancing OpenFabrics InterfacesAdvancing OpenFabrics Interfaces
Advancing OpenFabrics Interfaces
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
Modeling an Embedded Device for PSpice Simulation
Modeling an Embedded Device for PSpice SimulationModeling an Embedded Device for PSpice Simulation
Modeling an Embedded Device for PSpice Simulation
 
Requirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and ApplicationsRequirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and Applications
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)
 
ScilabTEC 2015 - Silkan
ScilabTEC 2015 - SilkanScilabTEC 2015 - Silkan
ScilabTEC 2015 - Silkan
 
On the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsOn the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF Models
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrell
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
 

Ähnlich wie Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures

A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaopenseesdays
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Current & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylightCurrent & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylightabhijit2511
 
SERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENEWorkshop
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomFacultad de Informática UCM
 
Scilab Challenge@NTU 2014/2015 Project Briefing
Scilab Challenge@NTU 2014/2015 Project BriefingScilab Challenge@NTU 2014/2015 Project Briefing
Scilab Challenge@NTU 2014/2015 Project BriefingTBSS Group
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)Kirill Tsym
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingKernel TLV
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning
 
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...Daniel Varro
 
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen..."Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...Edge AI and Vision Alliance
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionSri Ambati
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsUnai Lopez-Novoa
 

Ähnlich wie Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures (20)

A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Current & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylightCurrent & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylight
 
SERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the Cloud
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
Scilab Challenge@NTU 2014/2015 Project Briefing
Scilab Challenge@NTU 2014/2015 Project BriefingScilab Challenge@NTU 2014/2015 Project Briefing
Scilab Challenge@NTU 2014/2015 Project Briefing
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
 
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen..."Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 

Mehr von inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mehr von inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Kürzlich hochgeladen

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Kürzlich hochgeladen (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures

  • 1. Robert Searles, Sunita Chandrasekaran (rsearles, schandra)@udel.edu Wayne Joubert, Oscar Hernandez (joubert,oscar)@ornl.gov Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures PASC 2018 June 3, 2018 1
  • 2. Motivation • Parallel programming is software’s future – Acceleration • State-of-the-art abstractions handle simple parallel patterns well • Complex patterns are hard! 2
  • 3. Our Contributions • An abstract representation for wavefront algorithms • A performance portable proof-of-concept of this abstraction using directives: OpenACC – Evaluation on multiple state-of-the-art platforms • A description of the limitations of existing high-level programming models 3
  • 4. Several ways to accelerate Directives Programming Languages Libraries Applications Drop in acceleration Maximum Flexibility Used for easier acceleration 4
  • 5. Directive-Based Programming Models • OpenMP (current version 4.5) – Multi-platform shared multiprocessing API – Since 2013, supporting device offloading • OpenACC (current version 2.6) – Directive-based model for heterogeneous computing 5
  • 6. Serial Example for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; } 6
  • 7. OpenACC Example #pragma acc parallel loop independent for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; } 7
  • 8. Host Code cudaError_t cudaStatus; // Choose which GPU to run on, change this on a multi-GPU system. cudaStatus = cudaSetDevice(0); // Allocate GPU buffers for three vectors (two input, one output) cudaStatus = cudaMalloc((void**)&dev_c, N* sizeof(int)); cudaStatus = cudaMalloc((void**)&dev_a, N* sizeof(int)); cudaStatus = cudaMalloc((void**)&dev_b, N* sizeof(int)); // Copy input vectors from host memory to GPU buffers. cudaStatus = cudaMemcpy(dev_a, a, N* sizeof(int), cudaMemcpyHostToDevice); cudaStatus = cudaMemcpy(dev_b, b, N* sizeof(int), cudaMemcpyHostToDevice); // Launch a kernel on the GPU with one thread for each element. addKernel<<<N/BLOCK_SIZE, BLOCK_SIZE>>>(dev_c, dev_a, dev_b); // cudaThreadSynchronize waits for the kernel to finish, and returns // any errors encountered during the launch. cudaStatus = cudaThreadSynchronize(); // Copy output vector from GPU buffer to host memory. cudaStatus = cudaMemcpy(c, dev_c, N* sizeof(int), cudaMemcpyDeviceToHost); cudaFree(dev_c); cudaFree(dev_a); cudaFree(dev_b); return cudaStatus; 8 Kernel __global__ void addKernel(int *c, const int *a, const int *b) { int i = threadIdx.x + blockIdx.x * blockDim.x; c[i] = a[i] + b[i]; } CUDA Example
  • 9. Pattern-Based Approach in Parallel Computing • Several parallel patterns – Existing high-level languages provide abstractions for many simple patterns • However there are complex patterns often found in scientific applications that are a challenge to be represented with software abstractions – Require manual code rewrite • Need additional features/extensions! – How do we approach this? (Our paper’s contribution) 9
  • 10. Application Motivation: Minisweep • A miniapp modeling wavefront sweep component of Denovo Radiation transport code from ORNL – Minisweep, a miniapp, represents 80-90% of Denovo • Denovo - part of DOE INCITE project, is used to model fusion reactor – CASL, ITER • Run many times with different parameters • The faster it runs, the more configurations we can explore • Poses a six dimensional problem • 3D in space, 2D in angular particle direction and 1D in particle energy 10
  • 11. Minisweep code status • Github: https://github.com/wdj/minisweep • Early application readiness on ORNL Titan • Being used for #1 TOP500 -> Summit acceptance testing • Has been ported to Beacon and Titan (ORNL machines) using OpenMP and CUDA 11
  • 14. Complex Parallel Pattern Identified: Wavefront 14
  • 15. Complex Parallel Pattern Identified: Wavefront 15
  • 16. Overview of Sweep Algorithm • 5 nested loops – X, Y, Z dimensions, Energy Groups, Angles – OpenACC/PGI only offers 2 levels of parallelism: gang and vector (worker clause not working properly) – Upstream data dependency 16
  • 17. 17
  • 18. 18
  • 19. Parallelizing Sweep Algorithm: KBA • Koch-Baker-Alcouffe (KBA) • Algorithm developed in 1992 at Los Alamos • Parallel sweep algorithm that overcomes some of the dependencies using a wavefront. 19 Image credit: High Performance Radiation Transport Simulations: Preparing for TITAN C. Baker, G. Davidson, T. M. Evans, S. Hamilton, J. Jarrell and W. Joubert, ORNL, USA
  • 20. Expressing Wavefront via Software Abstractions – A Challenge • Existing solutions involve manual rewrites, or compiler-based loop transformations – Michael Wolfe. 1986. Loop skewing: the wavefront method revisited. Int. J. Parallel Program. 15, 4 (October 1986), 279-293. DOI=http://dx.doi.org/10.1007/BF01407876 – Polyhedral frameworks, only support affine loops, ChiLL and Pluto • No solution in high-level languages like OpenMP/OpenACC; no software abstractions 20
  • 21. Our Contribution: Create Software Abstractions for Wavefront pattern • Analyzing flow of data and computation in wavefront codes • Memory and threading challenges • Wavefront loop transformation algorithm 21
  • 22. Abstract Parallelism Model for( iz=izbeg; iz!=izend; iz+=izinc ) for( iy=iybeg; iy!=iyend; iy+=iyinc ) for( ix=ixbeg; ix!=ixend; ix+=ixinc ) { // space for( ie=0; ie<dim_ne; ie++ ) { // energy for( ia=0; ia<dim_na; ia++ ) { // angles // in-gridcell computation } } } 22
  • 23. Abstract Parallelism Model • Spatial decomposition = outer layer (KBA) – No existing abstraction for this • In-gridcell computations = inner layer – Application specific • Upstream data dependencies – Slight variation between wavefront applications 23
  • 24. Data Model • Storing all previous wavefronts is unnecessary – How many neighbors and prior wavefronts are accessed? • Face arrays make indexing easy – Smaller data footprint • Limiting memory to the size of the largest wavefront is optimal, but not practical 24
  • 26. Programming Model Limitations • No abstraction for wavefront loop transformation – Manual loop restructuring • Limited layers of parallelism – 2 isn’t enough (worker is broken) – Asynchronous execution? 26
  • 27. Experimental Setup • NVIDIA PSG Cluster – CPU: Intel Xeon E5-2698 v3 (16-core) and Xeon E5-2690 v2 (10-core) – GPU: NVIDIA Tesla P100, Tesla V100, and Tesla K40 (4 GPUs per node) • ORNL Titan – CPU: AMD Opteron 6274 (16-core) – GPU: NVIDIA Tesla K20x • ORNL SummitDev – CPU: IBM Power8 (10-core) – GPU: NVIDIA Tesla P100 • PGI OpenACC Compiler 17.10 • OpenMP – GCC 6.2.0 – Issues running OpenMP minisweep code on Titan but works OK on PSG. 27
  • 28. Input Parameters • Scientifically – X/Y/Z dimensions = 64 – # Energy Groups = 64 – # Angles = 32 • Goal is to explore larger spatial dimensions 28
  • 29. 29
  • 30. Contributions • An abstract representation of wavefront algorithms • A performance portable proof-of-concept of this abstraction using OpenACC – Evaluation on multiple state-of-the-art platforms • A description of the limitations of existing high-level programming models 30
  • 31. Next Steps • Asynchronous execution • MPI - multi-node/multi-GPU • Develop a generalization/extension to existing high-level programming models – Prototype 31
  • 32. Preliminary Results/Ongoing Work • MPI + OpenACC – 1 node x 1 P100 GPU = 66.79x speedup – 4 nodes x 4 P100 GPUs/node = 565.81x speedup – 4 nodes x 4 V100 GPUs/node = 624.88x speedup • Distributing the workload lets us examine larger spatial dimensions – Future: Use blocking to allow for this on a single GPU 32
  • 33. Takeaway(s) • Using directives is not magical! Compilers are already doing a lot for us!  • Code benefits from incremental improvement – so let’s not give up!  • *Profiling and Re-profiling is highly critical* • Look for any serial code refactoring, if need be – Make the code parallel and accelerator-friendly • Watch out for compiler bugs and *report them* – The programmer is not ‘always’ wrong • Watch out for *novel language extensions and propose to the committee* - User feedback – Did you completely change the loop structure? Did you notice a parallel pattern for which we don’t have a high-level directive yet? 33
  • 34. Contributions • An abstract representation of wavefront algorithms • A performance portable proof-of-concept of this abstraction using directives, OpenACC – Evaluation on multiple state-of-the-art platforms • A description of the limitations of existing high-level programming models • Contact: rsearles@udel.edu • Github: https://github.com/rsearles35/minisweep 34
  • 37. 37 silica IFPEN, RMM-DIIS on P100 OPENACC GROWING MOMENTUM Wide Adoption Across Key HPC Codes 3 of Top 5 HPC Applications Use OpenACC ANSYS Fluent, Gaussian, VASP 40 core Broadwell 1 P100 2 P100 4 P100 0 1 2 3 4 5 Speed-up vasp_std (5.4.4) VASP, silica IFPEN, RMM-DIIS on P100 * OpenACC port covers more VASP routines than CUDA, OpenACC port planned top down, with complete analysis of the call tree, OpenACC port leverages improvements in latest VASP Fortran source base silica IFPEN, RMM-DIIS on P100 CAAR Codes Use OpenACC GTC XGC ACME FLASH LSDalton OpenACC Dominates in Climate & Weather Key Codes Globally COSMO, IFS(ESCAPE), NICAM, ICON, MPAS Gordon Bell Finalist CAM-SE on Sunway Taihulight
  • 38. 38 NUCLEAR REACTOR MODELING PROXY CODE : MINISWEEP  Minisweep, a miniapp, represents 80-90% of Denovo Sn code  Denovo Sn (discrete ordinate), part of DOE INCITE project, is used to model fusion reactor – CASL, ITER  Impact: By running Minisweep faster, experiments with more configurations can be performed directly impacting the determination of accuracy of radiation shielding  Poses a six dimensional problem  3D in space, 2D in angular particle direction and 1D in particle energy 38

Hinweis der Redaktion

  1. Simple Parallel: loop parallelism, recursion, task, master/worker, pipeline
  2. Robbie – I changed implementation to ->proof-of-concept Third bullet -> currently only slide 26 is for this bullet isn’t it? That is not enough. Think about adding CUDA and OpenMP limitations as 2 slides or in 1 slide, then this bullet will be covered.
  3. Mention that OpenMP 4.5 is now starting to support offloading since 2013, but that is not what it was originally intended for.
  4. Robbie –I removed SummitDev, you can mention it, our focus on this slide is Summit and Top500 Since you are presenting in Europe, I added ORNL to the machine names, coz not everyone is going to put 2 and 2 together that Titan, Beacon where ORNL machines.
  5. EXPLAIN GANG & VECTOR Talk about how to map wavefront using gang/vector
  6. EXPLAIN GANG & VECTOR
  7. EXPLAIN GANG & VECTOR
  8. Go through the motions of analyzing a complex parallel pattern Come up with an abstraction for the pattern in question Robbie – edited the bullet on polyhedral -> Polyhedral transformation tools require code refactoring; only supports affine loops For your notes-> Typical limitations of the polyhedral model is essentially anything that makes the code non-affine. These include array indirections, data-dependent conditionals, linearized arrays and use of pointers.  These limitations need to be addressed before broadening the applicability of polyhedral transformations.
  9. Go through the motions of analyzing a complex parallel pattern Come up with an abstraction for the pattern in question
  10. EXPLAIN FACE ARRAYS! Optimal memory size implies brutal index calculations.
  11. Note that minisweep actually runs 8 sweeps (one per direction) Need a 3rd layer of parallelism for this!
  12. Robbie – didn’t we have results on SummitDev? It is in the paper, so add that to the second bullet? Its OK if you are showing titan results in your chart. Keep the absolute runtime numbers on summitdev as a backup slide. Someone might ask about absolute runtime as people don’t like speedup chart
  13. .
  14. SC: After this slide, bring back the contribution slide, and we have to discuss what we have contributed to these proposed bullets. Because contribution slides means we have contributed something. So how about? Creating standardized high-level abstraction for complex parallel patterns Evaluate abstraction on multiple state-of-the-art platforms to study performance/portability Apply abstraction to real-world scientific applications of interest to Exascale projects
  15. Robbie – this can be your last slide and then switch to the contribution slide and then put the next slide up while taking questions.
  16. Center for Accelerated Application Readiness (CAAR) VASP material science Gaussian Quantum Chemistry ANSYS Fluent, CFD Just keep this slide, in case someone goes like who is using OpenACC, you can bring this up pretty quickly.
  17. current DOE INCITE project to model the International Thermonuclear Experimental Reactor (ITER) fusion reactor and is the primary radiation transport code used in the multi-industry Consortium for Advanced Simulation of Light Water Reactors (CASL) [2] headquartered at Oak Ridge National Laboratory (ORNL). impact of the Fukushima nuclear accident in Japan, highlights the need to increase the production of safe, abundant quantities of energy A transport solver that incorporates an adequately resolved discretization for all scales using current discrete models would require 1017–1021 degrees of freedom per step, which is well beyond even exascale resources. Space: in the x , y and z dimensions, each gridcell depends on results from three upstream cells, based on octant direction, resulting in a wavefront problem. Octant: results for different octants can be calculated indepen- dently, the only dependency being that results from different octants at the same entry of vo′ut are added together and thus are a race hazard, depending on how the computation is done. Energy: computations for different energy groups have no coupling and can be considered separate problem instances, enabling flexible parallelization. One of the six applications selected for ORNL CAAR project Similar to Sn wavefront codes including Kripke, Sn (Discrete Ordinates) Application Proxy (SNAP) and PARTISN