SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Tim Costa, NVIDIA HPC SW Product Manager
Jeff Larkin, Sr. DevTech Software Engineer
UK OpenMP Users Group, December 2020
BEST PRACTICES FOR
OPENMP ON GPUS
2
THE FUTURE OF PARALLEL PROGRAMMING
Standard Languages | Directives | Specialized Languages
Maximize Performance with
Specialized Languages & Intrinsics
Drive Base Languages to Better
Support Parallelism
Augment Base Languages with
Directives
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
!$omp target data map(x,y)
...
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
...
!$omp end target data
attribute(global)
subroutine saxpy(n, a, x, y) {
int i = blockIdx%x*blockDim%x +
threadIdx%x;
if (i < n) y(i) += a*x(i)
}
program main
real :: x(:), y(:)
real,device :: d_x(:), d_y(:)
d_x = x
d_y = y
call saxpy
<<<(N+255)/256,256>>>(...)
y = d_y
std::for_each_n(POL, idx(0), n,
[&](Index_t i){
y[i] += a*x[i];
});
3
THE ROLE OF DIRECTIVES FOR
PARALLEL PROGRAMMING
Serial Programming
Languages
Parallel
Programming
Languages
Directives convey additional information to the compiler.
4
AVAILABLE NOW: THE NVIDIA HPC SDK
Available at developer.nvidia.com/hpc-sdk, on NGC, and in the Cloud
Develop for the NVIDIA HPC Platform: GPU, CPU and Interconnect
HPC Libraries | GPU Accelerated C++ and Fortran | Directives | CUDA
7-8 Releases Per Year | Freely Available
Compilers
nvcc nvc
nvc++
nvfortran
Programming
Models
Standard C++ & Fortran
OpenACC & OpenMP
CUDA
Core
Libraries
libcu++
Thrust
CUB
Math
Libraries
cuBLAS cuTENSOR
cuSPARSE cuSOLVER
cuFFT cuRAND
Communication
Libraries
Open MPI
NVSHMEM
NCCL
DEVELOPMENT
Profilers
Nsight
Systems
Compute
Debugger
cuda-gdb
Host
Device
ANALYSIS
NVIDIA HPC SDK
5
HPC COMPILERS
NVC | NVC++ | NVFORTRAN
Programmable
Standard Languages
Directives
CUDA
Multicore
Directives
Vectorization
Multi-Platform
x86_64
Arm
OpenPOWER
Accelerated
Latest GPUs
Automatic Acceleration
*+=
6
BEST PRACTICES FOR OPENMP ON GPUS
Always use the teams and distribute directive to expose all available parallelism
Aggressively collapse loops to increase available parallelism
Use the target data directive and map clauses to reduce data movement between CPU and GPU
Use accelerated libraries whenever possible
Use OpenMP tasks to go asynchronous and better utilize the whole system
Use host fallback (if clause) to generate host and device code
7
Expose More
Parallelism
Many codes already have Parallel Worksharing loops
(Parallel For/Parallel Do); Isn’t that enough?
Parallel For creates a single contention group of
threads with a shared view of memory and the ability to
coordinate and synchronize.
This structure limits the degree of parallelism that a GPU
can exploit and doesn’t exploit many GPU advantages
Parallel For Isn’t Enough
#pragma omp parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
OMP PARALLEL
OMP FOR
ThreadTeam
8
Expose More
Parallelism
The Teams Distribute directives exposes coarse-
grained, scalable parallelism, generating more parallelism.
Parallel For is still used to engage the threads within
each team.
Coarse and Fine-grained parallelism may be combined (A)
or split (B).
How do I know which will be better for my code &
machine?
Use Teams Distribute
#pragma omp target teams distribute 
parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
A
#pragma omp target teams distribute 
reduction(max:error)
for( int j = 1; j < n-1; j++) {
#pragma parallel for reduction(max:error)
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
B
OMP TEAMS
OMP DISTRIBUTE
OMP FOR
9
Expose Even More
Parallelism!
Collapsing loops increases the parallelism available for the
compiler to exploit.
It’s likely possible for this code to perform equally well to
parallel for on the CPU and still exploit more parallelism
on a GPU.
Collapsing does erase possible gains from locality (maybe
tile can help?)
Not all loops can be collapsed
Use Collapse Clause
Aggressively
#pragma omp target teams distribute 
parallel for reduction(max:error) 
collapse(2)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] -
A[j][i]));
}
}
10
Optimize Data
Motion
The more loops you move to the device, the more data
movement you may introduce.
Compilers will always choose correctness over
performance, so the programmer can often do better.
Target Data directives and map clauses enable the
programmer to take control of data movement.
On Unified Memory machines, locality may be good
enough, but optimizing the data movement is more
portable.
Note: I’ve seen cases where registering arrays with CUDA
may further improve remaining data movement.
Use Target Data Mapping
#pragma omp target data map(to:Anew) map(tofrom:A)
while ( error > tol && iter < iter_max )
{
… // Use A and Anew in target directives
}
11
Accelerated
Libraries & OpenMP
Accelerated libraries are free performance
The use_device_ptr clause enables sharing data
from OpenMP to CUDA.
OpenMP 5.1 adds the interop construct to enable
sharing CUDA Streams
Eat your free lunch
#pragma omp target data map(alloc:x[0:n],y[0:n])
{
#pragma omp target teams distribute parallel for
for( i = 0; i < n; i++)
{
x[i] = 1.0f;
y[i] = 0.0f;
}
#pragma omp target data use_device_ptr(x,y)
{
cublasSaxpy(n, 2.0, x, 1, y, 1);
// Synchronize before using results
}
#pragma omp target update from(y[0:n])
}
12
Tasking & GPUs
Data copies, GPU Compute, and CPU compute
can all overlap.
OpenMP uses its existing tasking framework to
manage asynchronous execution and
dependencies.
Your mileage may vary regarding how well the
runtime maps OpenMP tasks to CUDA streams.
Get it working synchronously first!
Keep the whole system busy
#pragma omp target data map(alloc:a[0:N],b[0:N])
{
#pragma omp target update to(a[0:N]) nowait depend(inout:a)
#pragma omp target update to(b[0:N]) nowait depend(inout:b)
#pragma omp target teams distribute parallel for 
nowait depend(inout:a)
for(int i=0; i<N; i++ )
{
a[i] = 2.0f * a[i];
}
#pragma omp target teams distribute parallel for 
nowait depend(inout:a,b)
for(int i=0; i<N; i++ )
{
b[i] = 2.0f * a[i];
}
#pragma omp target update from(b[0:N]) nowait depend(inout:b)
#pragma omp taskwait depend(inout:b)
}
13
Host Fallback
The if clause can be used to selectively turn off
OpenMP offloading.
Unless the compiler knows at compile-time the
value of the if clause, both code paths are
generated
For simple loops, written to use the previous
guidelines, it may be possible to have unified
CPU & GPU code (no guarantee it will be
optimal)
The “if” clause
#pragma omp target teams distribute 
parallel for reduction(max:error) 
collapse(2) if(target:USE_GPU)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
14
The Loop Directive
The Loop directive was added in 5.0 as a
descriptive option for programming.
Loop asserts the ability of a loop to be run in any
order, including concurrently.
As more compilers & applications adopt it, we
hope it will enable more performance
portability.
Give descriptiveness a try
#pragma omp target teams loop 
reduction(max:error) 
collapse(2)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
15
BEST PRACTICES FOR OPENMP ON GPUS
Always use the teams and distribute directive to expose all available parallelism
Aggressively collapse loops to increase available parallelism
Use the target data directive and map clauses to reduce data movement between CPU and GPU
Use accelerated libraries whenever possible
Use OpenMP tasks to go asynchronous and better utilize the whole system
Use host fallback (if clause) to generate host and device code
Bonus: Give loop a try

Weitere ähnliche Inhalte

Was ist angesagt?

190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pubJaewook. Kang
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Jyotirmoy Sundi
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMPjbp4444
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019Masashi Shibata
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNetAI Frontiers
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedOptimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedIntel IT Center
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowDatabricks
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetAmazon Web Services
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to PolyaxonYu Ishikawa
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
 

Was ist angesagt? (20)

190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
OpenMP
OpenMPOpenMP
OpenMP
 
PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedOptimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNet
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to Polyaxon
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Open mp
Open mpOpen mp
Open mp
 
Open mp directives
Open mp directivesOpen mp directives
Open mp directives
 

Ähnlich wie Best Practices for OpenMP on GPUs - OpenMP UK Users Group

Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisFastly
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
Program Assignment Process ManagementObjective This program a.docx
Program Assignment  Process ManagementObjective This program a.docxProgram Assignment  Process ManagementObjective This program a.docx
Program Assignment Process ManagementObjective This program a.docxwkyra78
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013 HSA Foundation
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUJohn Colvin
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomValeriy Kravchuk
 

Ähnlich wie Best Practices for OpenMP on GPUs - OpenMP UK Users Group (20)

Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Program Assignment Process ManagementObjective This program a.docx
Program Assignment  Process ManagementObjective This program a.docxProgram Assignment  Process ManagementObjective This program a.docx
Program Assignment Process ManagementObjective This program a.docx
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPU
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
 

Mehr von Jeff Larkin

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIAJeff Larkin
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesJeff Larkin
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEJeff Larkin
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsJeff Larkin
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Jeff Larkin
 

Mehr von Jeff Larkin (13)

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Kürzlich hochgeladen

Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 

Kürzlich hochgeladen (20)

Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 

Best Practices for OpenMP on GPUs - OpenMP UK Users Group

  • 1. Tim Costa, NVIDIA HPC SW Product Manager Jeff Larkin, Sr. DevTech Software Engineer UK OpenMP Users Group, December 2020 BEST PRACTICES FOR OPENMP ON GPUS
  • 2. 2 THE FUTURE OF PARALLEL PROGRAMMING Standard Languages | Directives | Specialized Languages Maximize Performance with Specialized Languages & Intrinsics Drive Base Languages to Better Support Parallelism Augment Base Languages with Directives do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo !$omp target data map(x,y) ... do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo ... !$omp end target data attribute(global) subroutine saxpy(n, a, x, y) { int i = blockIdx%x*blockDim%x + threadIdx%x; if (i < n) y(i) += a*x(i) } program main real :: x(:), y(:) real,device :: d_x(:), d_y(:) d_x = x d_y = y call saxpy <<<(N+255)/256,256>>>(...) y = d_y std::for_each_n(POL, idx(0), n, [&](Index_t i){ y[i] += a*x[i]; });
  • 3. 3 THE ROLE OF DIRECTIVES FOR PARALLEL PROGRAMMING Serial Programming Languages Parallel Programming Languages Directives convey additional information to the compiler.
  • 4. 4 AVAILABLE NOW: THE NVIDIA HPC SDK Available at developer.nvidia.com/hpc-sdk, on NGC, and in the Cloud Develop for the NVIDIA HPC Platform: GPU, CPU and Interconnect HPC Libraries | GPU Accelerated C++ and Fortran | Directives | CUDA 7-8 Releases Per Year | Freely Available Compilers nvcc nvc nvc++ nvfortran Programming Models Standard C++ & Fortran OpenACC & OpenMP CUDA Core Libraries libcu++ Thrust CUB Math Libraries cuBLAS cuTENSOR cuSPARSE cuSOLVER cuFFT cuRAND Communication Libraries Open MPI NVSHMEM NCCL DEVELOPMENT Profilers Nsight Systems Compute Debugger cuda-gdb Host Device ANALYSIS NVIDIA HPC SDK
  • 5. 5 HPC COMPILERS NVC | NVC++ | NVFORTRAN Programmable Standard Languages Directives CUDA Multicore Directives Vectorization Multi-Platform x86_64 Arm OpenPOWER Accelerated Latest GPUs Automatic Acceleration *+=
  • 6. 6 BEST PRACTICES FOR OPENMP ON GPUS Always use the teams and distribute directive to expose all available parallelism Aggressively collapse loops to increase available parallelism Use the target data directive and map clauses to reduce data movement between CPU and GPU Use accelerated libraries whenever possible Use OpenMP tasks to go asynchronous and better utilize the whole system Use host fallback (if clause) to generate host and device code
  • 7. 7 Expose More Parallelism Many codes already have Parallel Worksharing loops (Parallel For/Parallel Do); Isn’t that enough? Parallel For creates a single contention group of threads with a shared view of memory and the ability to coordinate and synchronize. This structure limits the degree of parallelism that a GPU can exploit and doesn’t exploit many GPU advantages Parallel For Isn’t Enough #pragma omp parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } OMP PARALLEL OMP FOR ThreadTeam
  • 8. 8 Expose More Parallelism The Teams Distribute directives exposes coarse- grained, scalable parallelism, generating more parallelism. Parallel For is still used to engage the threads within each team. Coarse and Fine-grained parallelism may be combined (A) or split (B). How do I know which will be better for my code & machine? Use Teams Distribute #pragma omp target teams distribute parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } A #pragma omp target teams distribute reduction(max:error) for( int j = 1; j < n-1; j++) { #pragma parallel for reduction(max:error) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } B OMP TEAMS OMP DISTRIBUTE OMP FOR
  • 9. 9 Expose Even More Parallelism! Collapsing loops increases the parallelism available for the compiler to exploit. It’s likely possible for this code to perform equally well to parallel for on the CPU and still exploit more parallelism on a GPU. Collapsing does erase possible gains from locality (maybe tile can help?) Not all loops can be collapsed Use Collapse Clause Aggressively #pragma omp target teams distribute parallel for reduction(max:error) collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } }
  • 10. 10 Optimize Data Motion The more loops you move to the device, the more data movement you may introduce. Compilers will always choose correctness over performance, so the programmer can often do better. Target Data directives and map clauses enable the programmer to take control of data movement. On Unified Memory machines, locality may be good enough, but optimizing the data movement is more portable. Note: I’ve seen cases where registering arrays with CUDA may further improve remaining data movement. Use Target Data Mapping #pragma omp target data map(to:Anew) map(tofrom:A) while ( error > tol && iter < iter_max ) { … // Use A and Anew in target directives }
  • 11. 11 Accelerated Libraries & OpenMP Accelerated libraries are free performance The use_device_ptr clause enables sharing data from OpenMP to CUDA. OpenMP 5.1 adds the interop construct to enable sharing CUDA Streams Eat your free lunch #pragma omp target data map(alloc:x[0:n],y[0:n]) { #pragma omp target teams distribute parallel for for( i = 0; i < n; i++) { x[i] = 1.0f; y[i] = 0.0f; } #pragma omp target data use_device_ptr(x,y) { cublasSaxpy(n, 2.0, x, 1, y, 1); // Synchronize before using results } #pragma omp target update from(y[0:n]) }
  • 12. 12 Tasking & GPUs Data copies, GPU Compute, and CPU compute can all overlap. OpenMP uses its existing tasking framework to manage asynchronous execution and dependencies. Your mileage may vary regarding how well the runtime maps OpenMP tasks to CUDA streams. Get it working synchronously first! Keep the whole system busy #pragma omp target data map(alloc:a[0:N],b[0:N]) { #pragma omp target update to(a[0:N]) nowait depend(inout:a) #pragma omp target update to(b[0:N]) nowait depend(inout:b) #pragma omp target teams distribute parallel for nowait depend(inout:a) for(int i=0; i<N; i++ ) { a[i] = 2.0f * a[i]; } #pragma omp target teams distribute parallel for nowait depend(inout:a,b) for(int i=0; i<N; i++ ) { b[i] = 2.0f * a[i]; } #pragma omp target update from(b[0:N]) nowait depend(inout:b) #pragma omp taskwait depend(inout:b) }
  • 13. 13 Host Fallback The if clause can be used to selectively turn off OpenMP offloading. Unless the compiler knows at compile-time the value of the if clause, both code paths are generated For simple loops, written to use the previous guidelines, it may be possible to have unified CPU & GPU code (no guarantee it will be optimal) The “if” clause #pragma omp target teams distribute parallel for reduction(max:error) collapse(2) if(target:USE_GPU) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } }
  • 14. 14 The Loop Directive The Loop directive was added in 5.0 as a descriptive option for programming. Loop asserts the ability of a loop to be run in any order, including concurrently. As more compilers & applications adopt it, we hope it will enable more performance portability. Give descriptiveness a try #pragma omp target teams loop reduction(max:error) collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } }
  • 15. 15 BEST PRACTICES FOR OPENMP ON GPUS Always use the teams and distribute directive to expose all available parallelism Aggressively collapse loops to increase available parallelism Use the target data directive and map clauses to reduce data movement between CPU and GPU Use accelerated libraries whenever possible Use OpenMP tasks to go asynchronous and better utilize the whole system Use host fallback (if clause) to generate host and device code Bonus: Give loop a try