By Hitoshi Murai, RIKEN AICS
For higher performance and productivity of HPC systems, it is important to provide users with good programming environment including languages, compilers, and tools. In this talk, the programming model of the post-K supercomputer will be shown.
Hitoshi Murai Bio
Hitoshi Murai received a master's degree in information science from Kyoto University in 1996. He worked as a software developer in NEC from 1996 to 2010. He received a Ph.D degree in computer science from University of Tsukuba in 2010. He is currently a research scientist of the programming environment research team and the Flagship 2020 project in Advanced Institute for Computational Science, RIKEN. His research interests include compilers and parallel programming languages.
Email
h-murai@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Programming Languages & Tools for Higher Performance & Productivity
1. Programming Languages & Tools
for Higher Performance &
Productivity
Hitoshi Murai (RIKEN)
Shun Kamatsuka (Fujitsu)
Tomotake Nakamura (Fujitsu)
Dec. 13, 2017 ARM HPC Workshop 1
2. Introduction of this Session
nFor higher performance & productivity on
HPC systems, programming environments
have a crucial role.
⦁ languages
⦁ compilers
⦁ tools
⦁ libraries
nRIKEN AICS and Fujitsu are collaborating to
design the programming env. of the
upcoming post-K computer.
Dec. 13, 2017 ARM HPC Workshop 2
3. Agenda of this Session
1. XcalableMP PGAS Language
⦁ by Hitoshi Murai
2. Advantages of the Compiler for Post-K
Computer
⦁ by Shun Kamatsuka
3. Overview of Programming Assistance
Tools for Post-K Computer
⦁ by Tomotake Nakamura
Dec. 13, 2017 ARM HPC Workshop 3
5. Introduction
nMessage Passing Interface (MPI) is a de-
facto standard for programming distributed-
memory HPC systems.
nProgramming with MPI is a very hard work.
Dec. 13, 2017 ARM HPC Workshop 5
We are developing the XcalableMP (XMP)
PGAS language, which could provide
both high performance and productivity,
for post-K.
6. What's PGAS?
nPartitioned Global Address Space
n"Global"
⦁ All processes or threads share one address
space and can access to every data in it.
n"Partitioned"
⦁ Remote and local data are distinguished and
might have different manners and costs of
access.
Dec. 13, 2017 ARM HPC Workshop 6
p0 p1 p2 p3
PGAS
private address space
7. What's ?
n A directive-based PGAS language
⦁ Extension for C/Fortran.
⦁ Latest ver. 1.3 is available at:
⦁ Defined by XMP WG of the PC Cluster Consortium.
n Two models of PGAS for distributed-memory
parallel programming:
⦁ Global view (data/work mapping directives)
⦁ Local view (coarray)
n Interoperable with other languages and
models (e.g. Python, MPI, OpenMP, OpenACC)
Dec. 13, 2017 ARM HPC Workshop 7
www.xcalablemp.org
8. Two Parallelization Models in XMP
nGlobal view
⦁ Users specify how a set of nodes cooperate to solve a
whole problem.
⦁ Rich directives for data/work mapping and comm.
⦁ Highly productive but suitable mainly to data parallelism.
nLocal view
⦁ Users specify how each node works to solve a partial
problem.
⦁ Coarray of Fortran 2008.
⦁ Lowly productive but more flexible.
Dec. 13, 2017 8ARM HPC Workshop
9. Example of a Global-view XMP Program
Dec. 13, 2017 9
real, dimension(lx,ly,lz) :: sr, se, ...
...
do iz = 1, lz-1
do iy = 1, ly
do ix = 1, lx
wu0 = sm(ix,iy,iz ) / sr(ix,iy,iz )
wu1 = sm(ix,iy,iz+1) / sr(ix,iy,iz+1)
wv0 = sn(ix,iy,iz ) / sr(ix,iy,iz )
...
ARM HPC Workshop
10. Example of a Global-view XMP Program
Dec. 13, 2017 10
!$xmp nodes p(npx,npy,npz)
!$xmp template (lx,ly,lz) :: t
!$xmp distribute (block,block,block) onto p :: t
real, dimension(lx,ly,lz) :: sr, se, ...
!$xmp align (ix,iy,iz) with t(ix,iy,iz) ::
!$xmp& sr, se, sm, sp, sn, sl, ...
!$xmp shadow (1,1,1) ::
!$xmp& sr, se, sm, sp, sn, sl, ...
...
!$xmp reflect (sr, sm, sp, se, sn, sl)
!$xmp loop (ix,iy,iz) on t(ix,iy,iz)
do iz = 1, lz-1
do iy = 1, ly
do ix = 1, lx
wu0 = sm(ix,iy,iz ) / sr(ix,iy,iz )
wu1 = sm(ix,iy,iz+1) / sr(ix,iy,iz+1)
wv0 = sn(ix,iy,iz ) / sr(ix,iy,iz )
...
stencil communication
work mapping
(parallel loops)
ARM HPC Workshop
data mapping
11. Local-view Programming
nCoarray, a PGAS feature of Fortran 2008, is
available in XMP/C as well as in
XMP/Fortran.
nBasic idea: data declared as coarray can
be accessed by remote nodes.
Dec. 13, 2017 ARM HPC Workshop 11
real a(1024)[*], b(1024)
a(512:1024)[1] = b(1:512)
sync all
float a[1024]:[*], b[1024];
a[512:512]:[0] = b[0:512];
xmp_sync_all(NULL);
XMP/Fortran XMP/C
1. An array a is declared as a coarray.
2. A local array section b(1:512) is put to a remote array section a(512:1024) on image 1.
3. A memory fence and barrier synchronization is performed.
1
2
3
1
2
3
12. Omni XcalableMP Compiler
n An open-source reference
impl. being developed by
RIKEN & U. Tsukuba.
n Latest Ver. 1.2.2 available at:
n Supported platforms include:
K, Fujitsu FX100, NEC SX, IBM BlueGene,
Hitachi SR, Cray, Linux clusters, etc.
n Proven applications include:
⦁ Plasma (3D fluid)
⦁ Seismic Imaging (3D stencil)
⦁ Fusion (Particle-in-Cell)
⦁ etc.
Dec. 13, 2017 ARM HPC Workshop 12
omni-compiler.org
C/Fortran
compiler
Frontend
Translator
Backend
.....
.....
XMP program
.....
.....
Executable
Comm. libraries
XMP runtime
Omni XMP
C/Fortran+MPI
program
13. HPL (of HPC Challenge Benchmarks)
nWritten in the global view of XMP/C
nData is distributed in the block-cyclic manner
and DGEMM is invoked for each block.
nOverlapping comm. and calc. using
asynchronous gmove
Dec. 13, 2017 13
double A_L[N][NB];
#pragma xmp align A_L[i][*] with t(*,i)
:
#pragma xmp gmove async(1)
A_L[k:len][0:NB] = A[k:len][j:NB];
:
for(m=j+NB;m<N;m+=NB){
for(n=j+NB;n<N;n+=NB){
cblas_dgemm(&A[m][n], ..);
if(xmp_test_async(1)){
// receive A[k:len][j:NB];
:
10
100
1000
256 2048 16384
423 TFlops (80.7%)
4,096 nodes
TFlops
Number of nodes
971 TFlops (46.3%)
16,384 nodes
ARM HPC Workshop
14. NICAM-DC (of Fiber Miniapps)
Dec. 13, 2017 ARM HPC Workshop 14
10
15
20
25
30
35
10 20 30 40
Speedup (MPI/10 = 10)
Number of MPI Processes
XMP MPI
n Written in the local
view of
XMP/Fortran with
coarray.
n The coarray-based
impl. is almost
comparable to the
original MPI-based
one.
15. XcalableMP2.0
nDynamic multitasking for manycore
processors
⦁ Breakaway from Bulk Synchronous Parallel (BSP)
model.
⦁ More chances for overlapping comm. and
comp.
nEnhancements of loop parallelization
nSupport for newer version of base
languages (Fortran 2008, C99, and C++11)
Dec. 13, 2017 ARM HPC Workshop 15
16. Summary
n PGAS languages are promising alternatives to MPI.
n XMP is a directive-based PGAS extension for Fortran
and C.
n XMP supports the global- and local-view
programming to achieve both high performance
and productivity.
n XMP will be available on post-K.
Dec. 13, 2017 16
omni-compiler.orgwww.xcalablemp.org
More information is available at:
ARM HPC Workshop