By Toshiyuki Imamura, RIKEN AICS
RIKEN and Fujitsu are developing ARM-based numerical libraries optimized with the new feature of ARM-SVE. We present porting status of netlib+SSL-II for ARM-SVE and other OSS. Also, we demonstrate some optimization policies and techniques, especially for the basic numerical linear algebra kernels.
Toshiyuki Imamura Bio
Toshiyuki Imamura is currently a team leader of Large-scale Parallel Numerical Computing Technology at Advanced Institute for Computational Science (AICS), RIKEN. He is in charge of the development of numerical libraries for the post-K project. His research interests include high-performance computing, automatic-tuning technology, eigenvalue computation (algorithm/software/applications), etc. He and his colleagues (Japan Atomic Energy Agency (JAEA) team) were nominated as one of the finalists of Gordon Bell Prize in SC05 and SC06. He is a member of IPSJ, JSIAM, and SIAM.
Email
imamura.toshiyuki@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
2. What’s Scalasca!
• Parallel application (MPI + OpenMP) performance study toolset
• Open source, 3-clause BSD license
• Portable implementation
• IBM Blue Gene, Cray XT/XE/XK/XC, SGI Altix, Fujitsu FX10/100, K
computer, Linux (x86, Power, ARM), Intel Xeon Phi
• Depends on Score-P instrumenter & measurement libraries
• Supports common data formats
• Reads event traces in OTF2 format
• Write analysis reports in CUBE4 format
2
3. Score-P!
Scalasca trace analysis
3
Scalasca workflow
Instr.
target
application
Measurement
library
HWC!
Parallel wait-
state search!
Wait-state
report!
Local event
traces!
Summary
report!
Optimized measurement configuration
Instrumenter
compiler / linker!
Instrumente
d executable!
Source
modules!
Report
manipulation!
Which problem?!
Where in the
program?!
Which
process?!
4. Scalasca Status!
• Scalasca can be used for parallel application performance
studies on arm64
• GNU Autotools are updated to recognize the arm64
architecture
• Latest stable version is 2.3.1 (May 2016)
• Cube v4.4 is upcoming
• Sampling mode for arm64 is being worked upon
• Bug fixes and enhancements will be coming
4
5. Sampling Mode!
• Important to avoid excessive overhead due to
instrumentation
• Requires libunwind package
• POSIX timer, perf, PAPI are the sources of interrupt
• Works on the x86, the work is on-going on arm64
• Issue: PLT-entry resolved address passed to libunwind
does not work as expected
5
6. libunwind test results on arm64!
============================================================================
Testsuite summary for libunwind 1.3-rc1
============================================================================
# TOTAL: 35
# PASS: 26
# SKIP: 0
# XFAIL: 0
# FAIL: 9
# XPASS: 0
# ERROR: 0
============================================================================
See tests/test-suite.log
Please report to libunwind-devel@nongnu.org
============================================================================
As of 1.3-rc1 AArch64 “Works well”
$ make check on arm64 produces:
• kernel: 4.14
• gcc: 4.8.5 20150623 (Red Hat 4.8.5-16)
• glib: 2.17
• hardware: Cavium ThunderX
6
7. Cube Status!
• Release v4.4 is upcoming
• Major changes since stable v4.3:
• Packaging
• Many plugins for customized performance analysis
• KNL vectorization adviser
• OTF2 Trace visualizer
• Sunburst
• ScorePion
• Memory footprint reduction (to appear in v4.4 or after)
• http://www.scalasca.org/software/cube-4.x/download.html
7
9. NPB3.3-MZ-MPI/BT Exercise on ThunderX!
• NAS Parallel Bench suite (sample MZ-MPI version)
• Available from http://www.nas.nasa.gov/Software/NPB
• 3 benchmarks (all in Fortran77, using OpenMP+MPI)
• Configurable for various sizes & classes
9
10. NPB-MZ-MPI/BT (Block Tridiagonal Solver)!
10
• What does it do?
• Solves a discretized version of unsteady, compressible Navier-
Stokes equations in three spatial dimensions
• Performs 200 time-steps on a regular 3-dimensional grid using
ADI and verifies solution error within acceptable limit
• Intra-zone computation with OpenMP, inter-zone with MPI
• Implemented in 20 or so Fortran77 source modules
• Runs with any number of MPI processes & OpenMP threads
• On ThunderX, bt-mz_B.16 x6 should run in 30 seconds
• CLASS=B is recommended
11. NPB-MZ-MPI/BT profile execution!
11
• Set OMP_NUM_THREDS and launch as an MPI application
-bash-4.2$ scan -s mpiexec -np 16 ./bt-mz.B.16
S=C=A=N: Scalasca 2.3.1 runtime summarization
S=C=A=N: ./scorep_bt-mz_16x6_sum experiment archive
S=C=A=N: Sat Dec 2 12:30:05 2017: Collect start
/home/itaru/opt/openmpi-2.1.1/bin/mpiexec -np 16 ./bt-mz.B.16
NAS Parallel Benchmarks (NPB3.3-MZ-MPI) - BT-MZ MPI+OpenMP Benchmark
Number of zones: 8 x 8
Iterations: 200 dt: 0.000300
Number of active processes: 16
Use the default load factors with threads
Total number of threads: 96 ( 6.0 threads/process)
Calculated speedup = 93.84
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class B
accuracy setting for epsilon = 0.1000000000000E-07
[…]
S=C=A=N: Sat Dec 2 12:30:44 2017: Collect done (status=0) 39s
S=C=A=N: ./scorep_bt-mz_16x6_sum complete.
12. NPB-MZ-MPI/BT build configuration definition!
12
# F77 - Fortran compiler
# FFLAGS - Fortran compilation arguments
# F_INC - any -I arguments required for compiling Fortran
# FLINK - Fortran linker
# FLINKFLAGS - Fortran linker arguments
# F_LIB - any -L and -l arguments required for linking Fortran
#
# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
# $(F77) $(FFLAGS)
# linking is done with $(FLINK) $(F_LIB) $(FLINKFLAGS)
#------------------------------------------------------------------------
---
#------------------------------------------------------------------------
---
# This is the fortran compiler used for fortran programs
#------------------------------------------------------------------------
---
F77 = scorep mpif77
#F77 = mpif77
• config/make.def
Score-P wrapper, just before the
compiler!
14. Summary!
• Scalasca and Score-P have been ported to arm64 and
tools are working fine on real hardware
• Missing feature is sampling
• Data visualization and analysis framework will be updated
14
15. Thanks to!
• Markus Geimer (JSC)
• Pavel Saviankou (JSC)
• Brian Wylie (JSC)
• Michael Knobloch (JSC)
• Scalasca and Score-P Communites
15