SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Automatic Generation of High-
Order Finite-Difference Code with
Temporal Blocking For Extreme-
Scale Many-Core Systems
ESPM2 2018
Nov.12th, Dallas
Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto,
Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori,
Miyuki Tsubouchi, Jun Makino
Abstract
 For an explicit finite-difference scheme applied to computation
fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency
of peak performance on the large-scale PEZY-SC2 based system
which has very low B/F by temporal blocking
 The achieved efficiency is comparable to recent works on very
high B/F systems
 To achieve this high efficiency on a low B/F machine,
we developed
 A framework for explicit stencil computation which generates
the boilerplate code for MPI and device kernel code with
temporal blocking
 A finite-difference scheme suitable for temporal blocking
Table of Contents
 Introduction
 Explicit stencil computation
 Temporal blocking
 About PEZY-SC2
 Details of our work
 Code generation framework: Formura
 Optimization for PEZY-SC2
 Benchmark results
 Performance on large-scale systems (Gyoukou)
 Discussion and summary
Introduction
Explicit Stencil Computation
 Explicit stencil computation is simple but very important
application of HPC
 It is used for simulating weather, earthquake, inside of
the sun, etc.
 Optimizing stencil computation is very important
Source: Riken
Efficiency of Recent Stencil Computation
 Efficiency of explicit method on recent HPC
hardware is not high enough
 Even the best case efficiency on K computer is ≅ 20%,
other many cases are only ≅ 10% (not high enough)
 This low efficiency is caused by the problem in the
architecture of processors, or memory bandwidth
 We try to solve the problem of memory bandwidth
which does not depend on architecture
 In the past decades, B/F of HPC systems has been
reduced dramatically
 This trend seems likely to continue
Relative Performance Trend
 Green: FLOPS vs
memory bandwidth
(4.5x/decade)
 Red: FLOPS vs
network latency
(~30x/decade)
 This seems to
continue
Source: John D. McCalpin
PEZY-SC2
 Many core MIMD processor
 1984 individual RISC cores
 2.8TFlops peak DP performance (@700MHz)
 4ch DDR4 DRAM, 64GB, 80GB/s
 ⇒ B/F≒0.03
 cf. K computer: 0.5
Tesla V100: 0.12
TaihuLight: 0.04
PEZY-SC2 Architecture
 The chip consists of
8 prefectures
 Each prefecture
contains 16 cities
 Each city contains 16
processor elements
(PEs)
 8×16×16-64(redun-
dancy) = 1984PEs
 Each city shares L2
cache
Gyoukou
 Supercomputer installed at JAMSTEC, Japan
 (Available until April 2018)
 Peak 28.2PFlops (Full nodes)
 Top500 4th (Nov 2017)
 10000 PEZY-SC2s +
1250 Xeon D (1 for 8-SC2s)
 World’s largest numbers of
MIMD processor cores
(≒ 20M
cf. TaihuLight ≒ 11M)
 Suitable for the test to check
if the code can scale to exa-scale systems
Details of the work
Temporal Blocking (TB)
 One of the solution to explicit method on low B/F system
 With TB, multiple timesteps are calculated for working
array, so it can reduce required B/F when the working
array fits to the processor cache memory size
DRAM DRAM
Cache Cache Cache Cache
Computation
for one time step
Read from memory Write to memory
Network Network
・・・
Various Methods of TB
A variation of this used for
inter-node communication
This is used for
in-node computation
Detail of Our TB Calculation
 Inter-node communication
 Each node sends data to one-direction
 Each node receives data from one-direction
 Simple communication-computation overwrapping
Details of Our TB calculation
 Computation starts from the right-most block
 Upper-right of the parallelogram use dummy data to
equalize all loop lengths
 Gray part is unnecessary results
 This method increases few computation
SL4TH3 Scheme
 Fourth order accuracy
 Number of stencil = 2
 Flop per cell per step ≒ 2800
 Required B/F ≒ 0.05 w/o TB
Input Differential Equation
Formura: a Framework for TB
 From a description of a stencil written in formura DSL,
optimized distributed parallel codes for large scale
parallel computers are generated
 In this work, we add the support of the TB method for
this work, and developed a device kernel code generator
for PEZY-SC2
Formura
DSL
MPI driver
code
TB kernel
code
Executable
formura gcc/mpicc
formura
Code generation by Formura
Input Equation
(Needs some more configuration files)
(Very part of) Generated C Codes
・・・
Output (Zoomed-in)
Code Generation
 Formura generates:
 Driver code for TB distributing on MPI
 Optimized kernel codes for node-local computations
 For new accelerator (or any other processors), we can
add a backend by modifying the code that calculates
temporal blocking steps
 Typically, major optimizations for each device are
blocking layout for data access locality and thread
scheduling
Optimization for PEZY-SC2
 Decide the block size
 Size of block is smaller than LLC size
 Parallelism close to number of PEs for load-balancing
 44× 44× 44 is the best block size
 443×10(variables/cell)×8Byte = 6.50MB
 6.50MB×2 (for overwrapping read and write)<32MB=LLC size
 442 (parallelisms) = 1936≒ 1984(number of PEs)
 Total 880(=44×20)3 cells per node < 64GB
 Allocate adjacent cells in PEs which shares L2
 Decrease inner-most loop instruction size
 PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)
 SL4TH3 requires 2800ops/cell
Results
Benchmark Results
 Conditions
 SL4TH3 scheme
 Optimized backend for PEZY-SC2
 8000 PEYZ-SC2s (20× 20× 20 layout) on Gyoukou
 ≒ 16M cores
 Total (880×20=)17600 3 cells
 Performance results
 4.78 PFlops
 21.5% efficiency (22.2PFlops theoretical-peak)
Effect of Temporal Blocking
NT Redundant
calculation
by TB
Required B/F
1 1.4% 0.058
2 2.7% 0.029
3 4.0% 0.020
4 5.4% 0.015
5 6.7% 0.012
6 8.0% 0.010
7 9.2% 0.009
8 10.5% 0.008
Required B/F by NT
(size per node = 8803) Calculation speed by NT
GF
NT
Calculation speed by NT
(= time step parameter)
Comparison with Other Studies
 This work achieves very high efficiency
 Comparable result with very high B/F (= 0.5) system
0
0.1
0.2
0.3
0.4
0.5
0.6
0
5
10
15
20
25
30
Yashiro et. al. Yang et. al. Hotta et. al. This work
B/F
Efficiency(%)
Efficiency
列 1
Device B/F
Weak Scaling
 The communication is
completely hidden by
computation
 Thus even though the
actual time for
communication increase
when we increase the
number of nodes, weak
scaling of the performance
is pretty good
Communication time
Total time
Weak Scaling
Future Works
 Other schemes
 HLLD
 Other application
 Tsunami (shallow-water equation)
 Reaction-diffusion system
 Further performance improvement
Conclusion
 We have achieved 4.78 PFlops, 21.5% efficiency of peak
performance on the fluid simulation code on the large-
scale PEZY-SC2 based system
 We developed an automatic code generation framework
for TB, a scheme suitable for it and a backend for PEZY-
SC2 accelerator
 Our achieved efficiency is comparable to other works on
high B/F systems

Weitere ähnliche Inhalte

Was ist angesagt?

Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
elliando dias
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)
IISRT
 
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERSHIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
Lalitha Gosukonda
 

Was ist angesagt? (20)

Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
 
B1030610
B1030610B1030610
B1030610
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless ApplicationDesign Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
 
A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)
 
High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAHigh-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
 
Ad4103173176
Ad4103173176Ad4103173176
Ad4103173176
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERSHIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISALec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontroller
 

Ähnlich wie ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
vsachde
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodelling
Obsidian Software
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5
Steen Larsen
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
guest40fc7cd
 

Ähnlich wie ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems (20)

Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodelling
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5
 
Melp codec optimization using DSP kit
Melp codec optimization using DSP kitMelp codec optimization using DSP kit
Melp codec optimization using DSP kit
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
7 eti pres
7 eti pres7 eti pres
7 eti pres
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
DSP_Assign_1
DSP_Assign_1DSP_Assign_1
DSP_Assign_1
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
 

Mehr von Hideyuki Tanaka (8)

Xpath in-lens
Xpath in-lensXpath in-lens
Xpath in-lens
 
IdrisでWebアプリを書く
IdrisでWebアプリを書くIdrisでWebアプリを書く
IdrisでWebアプリを書く
 
手書きスライド
手書きスライド手書きスライド
手書きスライド
 
Monad tutorial
Monad tutorialMonad tutorial
Monad tutorial
 
Yesod勉強会
Yesod勉強会Yesod勉強会
Yesod勉強会
 
C++コミュニティーの中心でC++をDISる
C++コミュニティーの中心でC++をDISるC++コミュニティーの中心でC++をDISる
C++コミュニティーの中心でC++をDISる
 
関数プログラミング入門
関数プログラミング入門関数プログラミング入門
関数プログラミング入門
 
Icfp2009
Icfp2009Icfp2009
Icfp2009
 

Kürzlich hochgeladen

Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

  • 1. Automatic Generation of High- Order Finite-Difference Code with Temporal Blocking For Extreme- Scale Many-Core Systems ESPM2 2018 Nov.12th, Dallas Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto, Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori, Miyuki Tsubouchi, Jun Makino
  • 2. Abstract  For an explicit finite-difference scheme applied to computation fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency of peak performance on the large-scale PEZY-SC2 based system which has very low B/F by temporal blocking  The achieved efficiency is comparable to recent works on very high B/F systems  To achieve this high efficiency on a low B/F machine, we developed  A framework for explicit stencil computation which generates the boilerplate code for MPI and device kernel code with temporal blocking  A finite-difference scheme suitable for temporal blocking
  • 3. Table of Contents  Introduction  Explicit stencil computation  Temporal blocking  About PEZY-SC2  Details of our work  Code generation framework: Formura  Optimization for PEZY-SC2  Benchmark results  Performance on large-scale systems (Gyoukou)  Discussion and summary
  • 5. Explicit Stencil Computation  Explicit stencil computation is simple but very important application of HPC  It is used for simulating weather, earthquake, inside of the sun, etc.  Optimizing stencil computation is very important Source: Riken
  • 6. Efficiency of Recent Stencil Computation  Efficiency of explicit method on recent HPC hardware is not high enough  Even the best case efficiency on K computer is ≅ 20%, other many cases are only ≅ 10% (not high enough)  This low efficiency is caused by the problem in the architecture of processors, or memory bandwidth  We try to solve the problem of memory bandwidth which does not depend on architecture  In the past decades, B/F of HPC systems has been reduced dramatically  This trend seems likely to continue
  • 7. Relative Performance Trend  Green: FLOPS vs memory bandwidth (4.5x/decade)  Red: FLOPS vs network latency (~30x/decade)  This seems to continue Source: John D. McCalpin
  • 8. PEZY-SC2  Many core MIMD processor  1984 individual RISC cores  2.8TFlops peak DP performance (@700MHz)  4ch DDR4 DRAM, 64GB, 80GB/s  ⇒ B/F≒0.03  cf. K computer: 0.5 Tesla V100: 0.12 TaihuLight: 0.04
  • 9. PEZY-SC2 Architecture  The chip consists of 8 prefectures  Each prefecture contains 16 cities  Each city contains 16 processor elements (PEs)  8×16×16-64(redun- dancy) = 1984PEs  Each city shares L2 cache
  • 10. Gyoukou  Supercomputer installed at JAMSTEC, Japan  (Available until April 2018)  Peak 28.2PFlops (Full nodes)  Top500 4th (Nov 2017)  10000 PEZY-SC2s + 1250 Xeon D (1 for 8-SC2s)  World’s largest numbers of MIMD processor cores (≒ 20M cf. TaihuLight ≒ 11M)  Suitable for the test to check if the code can scale to exa-scale systems
  • 12. Temporal Blocking (TB)  One of the solution to explicit method on low B/F system  With TB, multiple timesteps are calculated for working array, so it can reduce required B/F when the working array fits to the processor cache memory size DRAM DRAM Cache Cache Cache Cache Computation for one time step Read from memory Write to memory Network Network ・・・
  • 13. Various Methods of TB A variation of this used for inter-node communication This is used for in-node computation
  • 14. Detail of Our TB Calculation  Inter-node communication  Each node sends data to one-direction  Each node receives data from one-direction  Simple communication-computation overwrapping
  • 15. Details of Our TB calculation  Computation starts from the right-most block  Upper-right of the parallelogram use dummy data to equalize all loop lengths  Gray part is unnecessary results  This method increases few computation
  • 16. SL4TH3 Scheme  Fourth order accuracy  Number of stencil = 2  Flop per cell per step ≒ 2800  Required B/F ≒ 0.05 w/o TB
  • 18. Formura: a Framework for TB  From a description of a stencil written in formura DSL, optimized distributed parallel codes for large scale parallel computers are generated  In this work, we add the support of the TB method for this work, and developed a device kernel code generator for PEZY-SC2 Formura DSL MPI driver code TB kernel code Executable formura gcc/mpicc formura
  • 19. Code generation by Formura Input Equation (Needs some more configuration files) (Very part of) Generated C Codes ・・・ Output (Zoomed-in)
  • 20. Code Generation  Formura generates:  Driver code for TB distributing on MPI  Optimized kernel codes for node-local computations  For new accelerator (or any other processors), we can add a backend by modifying the code that calculates temporal blocking steps  Typically, major optimizations for each device are blocking layout for data access locality and thread scheduling
  • 21. Optimization for PEZY-SC2  Decide the block size  Size of block is smaller than LLC size  Parallelism close to number of PEs for load-balancing  44× 44× 44 is the best block size  443×10(variables/cell)×8Byte = 6.50MB  6.50MB×2 (for overwrapping read and write)<32MB=LLC size  442 (parallelisms) = 1936≒ 1984(number of PEs)  Total 880(=44×20)3 cells per node < 64GB  Allocate adjacent cells in PEs which shares L2  Decrease inner-most loop instruction size  PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)  SL4TH3 requires 2800ops/cell
  • 23. Benchmark Results  Conditions  SL4TH3 scheme  Optimized backend for PEZY-SC2  8000 PEYZ-SC2s (20× 20× 20 layout) on Gyoukou  ≒ 16M cores  Total (880×20=)17600 3 cells  Performance results  4.78 PFlops  21.5% efficiency (22.2PFlops theoretical-peak)
  • 24. Effect of Temporal Blocking NT Redundant calculation by TB Required B/F 1 1.4% 0.058 2 2.7% 0.029 3 4.0% 0.020 4 5.4% 0.015 5 6.7% 0.012 6 8.0% 0.010 7 9.2% 0.009 8 10.5% 0.008 Required B/F by NT (size per node = 8803) Calculation speed by NT GF NT Calculation speed by NT (= time step parameter)
  • 25. Comparison with Other Studies  This work achieves very high efficiency  Comparable result with very high B/F (= 0.5) system 0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 20 25 30 Yashiro et. al. Yang et. al. Hotta et. al. This work B/F Efficiency(%) Efficiency 列 1 Device B/F
  • 26. Weak Scaling  The communication is completely hidden by computation  Thus even though the actual time for communication increase when we increase the number of nodes, weak scaling of the performance is pretty good Communication time Total time
  • 28. Future Works  Other schemes  HLLD  Other application  Tsunami (shallow-water equation)  Reaction-diffusion system  Further performance improvement
  • 29. Conclusion  We have achieved 4.78 PFlops, 21.5% efficiency of peak performance on the fluid simulation code on the large- scale PEZY-SC2 based system  We developed an automatic code generation framework for TB, a scheme suitable for it and a backend for PEZY- SC2 accelerator  Our achieved efficiency is comparable to other works on high B/F systems