Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -
1. Towards Auto-tuning Facilities
into Supercomputers in Operation
- The FIBER approach and
minimizing software-stack requirements -
Takahiro Katagiri (片桐 孝洋)
Information Technology Center,
The University of Tokyo
(東京大学 情報基盤センター)
1
2014 ATAT in HPSC, National Taiwan University,
March 15, 2014 (Saturday), Performance 10:10-10:30
Joint work with: Satoshi Ohshima(大島 聡史)
Masaharu Matsumoto(松本 正晴)
2. Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
2
3. Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
3
4. Background
High-Thread Parallelism (HTP)
◦ Multi-core and many-core processors are
pervasive.
Multicore CPUs: 8-16 cores, 16-64 Threads with Hyper
Threading (HT) or Simultaneous Multithreading (SMT)
Many Core CPU: Xeon Phi – 60 cores, 240 Threads
with HT.
◦ Utilizing parallelism with full-threads is important.
4
Performance Portability (PP)
◦ Keeping high performance in multiple computer environments.
Not only multiple CPUs, but also multiple compilers.
Run-time information, such as loop length and
number of threads, is important.
◦ Auto-tuning (AT) is one of candidates technologies to
establish PP in multiple computer environments.
5. ppOpen-HPC Project
Middleware for HPC and Its AT
◦ Supported by JST, CREST, from 2011FY to 2016FY.
◦ PI: Professor Kengo Nakajima (U. Tokyo)
ppOpen-HPC
◦ An open source infrastructure for reliable simulation
codes on post-peta (pp) scale parallel computers.
◦ consists of various types of libraries,
which covers 5 kinds of discretization methods for
scientific computations.
ppOpen-AT
◦ An auto-tuning language for ppOpen-HPC codes
◦ Using knowledge of previous project, that is
ABCLibScript Project.
◦ Auto-tuning language based on directives of AT. 5
6. 6
FVM DEMFDMFEM
Many-core CPUs GPU
Low Power
CPUs
Vector CPUs
MG
COMM
Auto-Tuning Facility
Code Generation for Optimization Candidates
Search for the best candidate
Automatic Execution for the optimization
Resource Allocation Facility
ppOpen-APPL
ppOpen-MATH
BEM
ppOpen-AT
User’s Program
GRAPH VIS MP
STATIC DYNAMIC
ppOpen-SYS FT
Specify
The Best
Execution
Allocations
Software Architecture of ppOpen-HPC
7. Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
7
9. A Scenario to Software Developers for
ppOpen-AT
9
Executable Code with
Optimization Candidates
and AT Function
Invocate dedicated
Preprocessor
Software
Developer
Description of AT by Using
ppOpen-AT
Program with AT
Functions
Optimization
that cannot be
established by
compilers
#pragma oat install unroll (i,j,k) region start
#pragma oat varied (i,j,k) from 1 to 8
for(i = 0 ; i < n ; i++){
for(j = 0 ; j < n ; j++){
for(k = 0 ; k < n ; k++){
A[i][j]=A[i][j]+B[i][k]*C[k][j]; }}}
#pragma oat install unroll (i,j,k) region end
■Automatic Generated
Functions
Optimization
Candidates
Performance Monitor
Parameter Search
Performance Modeling
Description By Software Developer
Optimizations for Source Codes,
Computer Resource, Power Consumption
10. Compiler Optimization and AT
1. Loop length is unclear in compile‐time.
Optimal loop split and loop fusion are specified in run‐time.
Run‐time compiling is on only research.
2. Loop split with data dependencies.
Some loop splits require increase of computations or memory
space.
Some compilers are providing directive, but the directive is not
standardized.
Code optimization is not also standardized between compilers.
3. Restrictions from Operation in Supercomputers.
Some supercomputer environments cannot supply required “software‐
stack”, or the software‐stack cannot be utilize due to restriction by operation.
Out of target for the system due to hardware restriction.
Ex) CAPS in the K‐computer.
Operation costs (budgets), vender strategy, etc…. 10
11. Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
11
13. Target Application
Seism3D
: Simulation software for
seismic wave analysis.
Strategic simulation software in Japan.
Developed by Professor Furumura
at the University of Tokyo.
◦ The code is re-constructed as
ppOpen-APPL/FDM.
Finite Differential Method (FDM)
3D simulation
◦ 3D arrays are allocated.
Data type: Single Precision (real*4)
13
Source: http://www.eri.u-
tokyo.ac.jp/furumura/tsunami
/tsunami.html
14. The Heaviest Loop (20%+ to Total Time)
14
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
RM2 = RM1 + RM1; RLRM2 = RL1+RM2;
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DO
END DO
END DO
!$omp end parallel do
A Flow Dependency
15. Optimization Possibilities
Loop Splitting
◦ To reduce spill code.
◦ To maximize register usage.
Loop fusion (Loop Collapse)
◦ 3 nested loop -> The following two approaches.
◦ One nest loop
To increase outer loop parallelism for thread
parallelism.
◦ Two nested loop
To increase outer loop parallelism for thread
parallelism.
To utilize pre-fetching for the inner loop.
15
16. Loop fusion –
One dimensional (a loop collapse)
16
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)
DO KK = 1, NZ * NY * NX
K = (KK-1)/(NY*NX) + 1
J = mod((KK-1)/NX,NY) + 1
I = mod(KK-1,NX) + 1
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
RM2 = RM1 + RM1; RLRM2 = RL1+RM2;
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DO
!$omp end parallel do
Merit: Loop length is huge.
This is good for OpenMP thread parallelism.
17. Loop fusion – Two dimensional
17
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)
DO KK = 1, NZ * NY
K = (KK-1)/NY + 1
J = mod(KK-1,NY) + 1
DO I = 1, NX
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
RM2 = RM1 + RM1; RLRM2 = RL1+RM2;
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
ENDDO
END DO
!$omp end parallel do
Example:
Merit: Loop length is huge.
This is good for OpenMP thread parallelism.
This I-loop enables us an opportunity of pre-fetching.
18. 18
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
RM2 = RM1 + RM1; RLRM2 = RL1+RM2;
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT
ENDDO
DO I = 1, NX
RM1 = RIG (I,J,K)
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DO
END DO
END DO
Re-computation
(a copy) is needed.
⇒Compilers
do not apply it
without directive.
Perfect Splitting: Two 3-nested Loops
20. Candidates of Auto-generated Codes
#1 [Baseline] : Original three-nested loop.
#2 [Spilt] : Loop split for the k-loop
(separated two three-nested loops).
#3 [Split] : Loop split for the j-loop.
#4 [Split] : Loop split for the i-loop.
#5 [Fusion] : Loop fusion for the k-loop and j-loop
(a two-nested loop).
#6 [Split and Fusion] : Loop fusions for the k-loop
and j-loop for the loops in #2.
#7 [Fusion] : Loop fusions for the k-loop, j-loop,
and i-loop (loop collapse).
#8 [Split and Fusion] : Loop fusions for the k-loop, j-loop,
and i-loop for the loops in #2
(loop collapses for the separated two-loops).
20
21. Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
21
23. An Example of Seism3D Simulation
West part earthquake in Tottori prefecture in Japan
at year 2000. ([1], pp.14)
The region of 820km x 410km x 128 km is discretized with 0.4km.
NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.
[1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News,
Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.
Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan.
(a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
24. Test Condition
Software version
◦ ppOpen-APPL/FDM version 0.2
◦ ppOpen-AT version 0.2
Target Kernels in ppOpen-APPL/FDM
◦ TOP 10 Kernels (All three-nested loops)
Update_stress
Update_vel
Update_spong
Other 7 kernels in finite differential computations.
AT Timing
◦ Before Execute-time
After fixing problem size and the number of threads by user.
Then, adapt AT in time for calling of the library routine.
All candidates of AT are evaluated. (Brute-force search)
◦ Only 8+3+6+7*3 = 38 candidates.
#Repeats for each kernel in the AT mode
◦ 100 times
24
25. The Xeon Phi Cluster System
Intel Xeon (Ivy Bridge) : HOST CPU
OS:Red Hat Enterprise Linux Server release 6.2
#Nodes:32 (Available: 14 nodes)
CPU:Intel Xeon E5‐2670 V2 @ 2.50GHz,2 sockets×10 cores
Hyper Threading:ON
Theoretical Peak Performance for 1 node of CPU:400 GFLOPS
Memory size on 1 node:64 GB
Interconnect:Infiniband
Compiler:Intel Fortran version 14.0.0.080 Build 20130728
Compiler Option:‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel
KMP_AFFINITY=granularity=fine, compact (all threads are on socket)
Intel Xeon Phi co‐processor (Xeon Phi) : Accelerator
CPU:Xeon Phi 5110P (B1 stepping) 1.053 GHz,60 core
Memory size:8 GB
Theoretical Peak Performance :1 TFLOPS ( = 1.053 GHz x 16 FLOPS x 60 core)
Connected one board on each node of the Cluster
Native mode
Compiler:Intel Fortran version 14.0.0.080 Build 20130728
Compiler Option:‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel
–mmiccl ‐align array64byte
KMP_AFFINITY=granularity=fine, balanced (all threads are equally distributed on
socket)
27. Execution Details
• ppOpen‐APPL/FDM ver.0.2
• ppOpen‐AT ver.0.2
• Target Problem Size
– NX * NY * NZ = 256 x 96 x 100 / node
– NX * NY * NZ = 32 * 16 * 20 / core (!= per MPI Process)
• Native mode for MIC
• Target MPI Processes and Threads on the Xeon Phi
– 1 node of the Xeon Phi with 4 HT (Hyper Threading)
– PXTY : X MPI Processes and Y Threads per process
– P240T1 : pure MPI with 4HT per core
– P120T2
– P60T4
– P16T15
– P8T30 : Minimum Hybrid MPI‐OpenMP execution for
ppOpen‐APPL/FDM, since it needs minimum 8 MPI Processes.
• The number of iterations for the kernels: 100
28. 2.11
2.32 2.33
2.96 3.14
1.29
1.70 1.74 1.91 1.97
0
1
2
3
4
P240T1 P120T2 P60T4 P16T15 P8T30
Without AT With AT
AT Effect (update_stress, Xeon Phi)[Seconds]
KMP_AFFINITY=balanced
‐align array64byte New Kernels
1.63
1.36 1.34
1.55 1.59
0
0.5
1
1.5
2
P240T1 P120T2 P60T4 P16T15 P8T30
Speedups
Best SW: 6 Best SW: 5 Best SW: 5 Best SW: 5 Best SW: 6
29. Conclusion
Loop fusion to obtain high parallelism
is one of key techniques for current
multi- and many-core architectures.
◦ Execution with 240 threads/MPI process
in the Xeon Phi.
◦ Strong scaling with more than 10,000+ cores
in the FX10.
To do AT in supercomputers
in operation, minimizing requirement
of “software-stack” is a practical way
to establish AT.
30. ppOpen-AT is free software!
ppOpen-AT version 0.2 is
available!
The licensing is MIT.
Please access the following page:
http://ppopenhpc.cc.u-tokyo.ac.jp/
30