11. 11
PARTNER LIBRARIES
Computer Vision
Sparse direct solvers
Sparse Iterative Methods
Linear Algebra
Linear Algebra
Graph
Audio and Video Matrix, Signal and Image
Math, Signal and Image Linear Algebra
Computational
Geometry
Real-time visual
simulation
18. 18
OPENACC
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
... serial code …
End Program myscience
CPU
GPU
既存のC/Fortranコード
簡単: 既存のコードに
コンパイラへのヒントを追加
強力: 相応の労力で、コンパイラが
コードを自動で並列化
オープン: 複数コンパイラベンダが、
様々なプロセッサをサポート
NVIDIA GPU, AMD GPU,
x86 CPU, ARM CPU
ヒントの
追加
20. 20
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc parallel copy(y[:n]) copyin(x[:n])
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
saxpy(N, 3.0, x, y);
...
SAXPY (y=a*x+y)
omp acc データの移動
OpenMP OpenACC
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma omp parallel for
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
saxpy(N, 3.0, x, y);
...
21. 21
subroutine saxpy(n, a, X, Y)
real :: a, Y(:), Y(:)
integer :: n, i
!$acc parallel copy(Y(:)) copyin(X(:))
do i=1,n
Y(i) = a*X(i)+Y(i)
enddo
!$acc end parallel
end subroutine saxpy
...
call saxpy(N, 3.0, x, y)
...
SAXPY (y=a*x+y, FORTRAN)
FORTRANも同様
OpenMP OpenACC
subroutine saxpy(n, a, X, Y)
real :: a, X(:), Y(:)
integer :: n, i
!$omp parallel do
do i=1,n
Y(i) = a*X(i)+Y(i)
enddo
!$omp end parallel do
end subroutine saxpy
...
call saxpy(N, 3.0, x, y)
...
22. 22
void saxpy(int n, float a,
float *x,
float *restrict y)
{
#pragma acc parallel copy(y[:n]) copyin(x[:n])
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
saxpy(N, 3.0, x, y);
...
OPENACC: PARALLEL DIRECTIVE
並列化領域を
指示する
23. 23
void saxpy(int n, float a,
float *x,
float *restrict y)
{
#pragma acc kernels copy(y[:n]) copyin(x[:n])
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
saxpy(N, 3.0, x, y);
...
OPENACC: KERNELS DIRECTIVE
並列化領域を
指示する
Parallel: ユーザ主導
Kernels: コンパイラ主導
31. 31
Janus Juul Eriksen, PhD Fellow
qLEAP Center for Theoretical Chemistry, Aarhus University
“
OpenACC makes GPU computing approachable for
domain scientists. Initial OpenACC implementation
required only minor effort, and more importantly,
no modifications of our existing CPU implementation.
“
LSDALTON
Large-scale application for calculating high-
accuracy molecular energies
Lines of Code
Modified
# of Weeks
Required
# of Codes to
Maintain
<100 Lines 1 Week 1 Source
Big Performance
7.9x
8.9x
11.7x
ALANINE-1
13 ATOMS
ALANINE-2
23 ATOMS
ALANINE-3
33 ATOMS
Speedup
vs
CPU
Minimal Effort
LS-DALTON CCSD(T) Module
Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
https://developer.nvidia.com/openacc/success-stories
32. 32
Brad Sutton, Associate Professor of Bioengineering and
Technical Director of the Biomedical Imaging Center
University of Illinois at Urbana-Champaign
“
Now that we’ve seen how easy it is to program the
GPU using OpenACC and the PGI compiler, we’re
looking forward to translating more of our projects
“
CHALLENGE
• Produce detailed and accurate brain images by applying
computationally intensive algorithms to MRI data
• Reduce reconstruction time to make diagnostic use possible
SOLUTION
• Accelerated MRI reconstruction application with OpenACC
using NVIDIA GPUs
RESULTS
• Reduced reconstruction time for a single high-resolution MRI
scan from 40 days to a couple of hours
• Scaled on Blue Waters at NCSA to reconstruct 3000 images in
under 24 hours
POWERGRID
Advanced MRI Reconstruction Model
https://developer.nvidia.com/openacc/success-stories
33. 33
“
OpenACC enabled us to
target routines for GPU
acceleration without
rewriting code, allowing us
to maintain portability on a
code that is 20-years old
“
David Gutzwiller, Head of HPC
NUMECA International
CHALLENGE
• Accelerate 20 year old highly optimized code
on GPUs
SOLUTION
• Accelerated computationally intensive routines
with OpenACC
RESULTS
• Achieved 10x or higher speed-up on key
routines
• Full app speedup of up to 2x on the Oak Ridge
Titan supercomputer
• Total time spent on optimizing various routines
with OpenACC was just five person-months
NUMECA FINE/TURBO
Commercial CFD Application
https://developer.nvidia.com/openacc/success-stories
34. 34
The most significant result from
our performance studies is faster
computation with less energy
consumption compared with our
CPU-only runs.
Dr. Misun Min, Computation Scientist
Argonne National Laboratory
CHALLENGE
• Enable NekCEM to deliver strong scaling on
next-generation architectures while
maintaining portability
SOLUTION
• Use OpenACC to port the entire program
to GPUs
RESULTS
• 2.5x speedup over a highly tuned CPU-only
version
• GPU used only 39 percent of the energy
needed by 16 CPUs to do the same
computation in the same amount of time
NEKCEM
Computational Electromagnetics Application
“
“
https://developer.nvidia.com/openacc/success-stories
35. 35
Adam Jacobs, PhD candidate in the Department of Physics and
Astronomy at Stony Brook University
On the reactions side, accelerated calculations allow us to model larger
networks of nuclear reactions for similar computational costs as the
simple networks we model now
MAESTRO & CASTRO
Astrophysics 3D Simulation
“
“
CHALLENGE
• Pursuing strong scaling and portability to run
code on GPU-powered supercomputers
SOLUTION
• OpenACC compiler on OLCF’s Titan
supercomputer
RESULTS
• 4.4x faster reactions than on a multi-core
computer with 16 cores
• Accelerated calculations allow modeling larger
networks of nuclear reactions for similar
computational costs as simple networks
2 weeks to
learn OpenACC
2 weeks to
modify code
https://developer.nvidia.com/openacc/success-stories
36. 36
Wayne Gaudin and Oliver Perks
Atomic Weapons Establishment, UK
We were extremely impressed that we can run OpenACC on a CPUwith
no code change and get equivalent performance to our OpenMP/ MPI
implementation.
CLOVERLEAF
Performance Portability for a Hydrodynamics Application
Speedup
vs
1
CPU
Core
Benchmarked Intel(R) Xeon(R) CPUE5-2690 v2 @3.00GHz, Accelerator: Tesla K80
“
“
CHALLENGE
• Application code that runs across architectures
without compromising on performance
SOLUTION
• Use OpenACC to port the CloverLeaf mini app
to GPUs and then recompile and run the same
source code on multi-core CPUs
RESULTS
• Same performance as optimized OpenMP
version on x86 CPU
• 4x faster performance using the same code on
a GPU
https://developer.nvidia.com/openacc/success-stories
37. 37
Lixiang Luo, Researcher
Aerospace Engineering Computational Fluid Dynamics Laboratory
North Carolina State University (NCSU)
“
OpenACC is a highly effective tool for programming fully
implicit CFD solvers on GPU to achieve true 4X speedup
“
* CPU Speedup on 6 cores of Xeon E5645, with additional cores performance reduces due to partitioning and MPI overheads
CHALLENGE
• Accelerate a complex implicit CFD solver on GPU
SOLUTION
• Used OpenACC to run solver on GPU with
minimal code changes.
RESULTS
• Achieved up to 4X speedups on Tesla GPU over
parallel MPI implementation on x86 multicore
processor
• Structured approach to parallelism provided by
OpenACC allowed better algorithm design
without worrying about GPU architecture
INCOMP3D
3D Fully Implicit CFD Solver
https://developer.nvidia.com/openacc/success-stories
38. 38
Earthquake Simulation (理研、東大地震研)
WACCPD 2016(SC16併催WS): Best Paper Award
http://waccpd.org/wp-content/uploads/2016/04/SC16_WACCPD_fujita.pdf より引用