Subtle Asynchrony by Jeff Hammond

1
Subtle Asynchrony
Jeff Hammond
NVIDIA HPC Group
2
Abstract
I will discuss subtle asynchrony in two contexts.
First, how do we bring asynchronous task parallelism to the Fortran
language, without relying on threads or related concepts?
Second, I will describe how asynchronous task parallelism emerges in
NWChem via overdecomposition, without programmers thinking about
tasks. This example demonstrates that many of the principles of
asynchronous many task execution can be achieved without specialized
runtime systems or programming abstractions.
3
The Concept of Asynchrony
4
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
5
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
6
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
“There is no asynchrony, just some other context.”
7
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
“There is no asynchrony, just some other context.”
Hardware concurrency
Software concurrency
Software concurrency
8
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
“There is no asynchrony, just some other context.”
Hardware concurrency
Software concurrency
Software concurrency
Forward progress
Forward progress
Scheduled
9
Examples of asynchrony
#pragma omp parallel num_threads(2)
{
assert( omp_get_num_threads() >= 2 );
switch( omp_get_thread_num() )
{
case 0:
MPI_Ssend(...);
break;
case 1:
MPI_Recv(...);
break;
}
}
10
Examples of asynchrony (or not)
#pragma omp parallel num_threads(2)
#pragma omp master
{
#pragma omp task
{
MPI_Ssend(...);
}
#pragma omp task
{
MPI_Recv(...);
}
}
11
#pragma omp parallel num_threads(2)
#pragma omp master
{
MPI_Request r;
#pragma omp task
{
MPI_Issend(...,&r);
nicewait(&r);
}
#pragma omp task
{
MPI_Irecv(...,&r);
nicewait(&r);
}
}
static inline
void nicewait(MPI_Request * r)
{
int flag=0;
while (!flag) {
MPI_Test(r, &flag, ..);
#pragma omp taskyield
}
}
Totally useless
(or not)
12
Analysis
OpenMP tasks do not make forward progress and are not schedulable.
This allows a trivial implementation and compiler optimizations like
task fusion.
Prescriptive parallelism: programmer decides, e.g. OpenMP threads.
Descriptive parallelism: implementation decides, e.g. OpenMP tasks.
https://asc.llnl.gov/sites/asc/files/2020-09/2-20_larkin.pdf
13
Fortran Tasks and Asynchrony?
14
Parallelism in Fortran 2018
! coarse-grain parallelism
.., codimension[:] :: X, Y, Z
npes = num_images()
n_local = n / npes
do i=1,n_local
Z(i) = X(i) + Y(i)
end do
sync all
! fine-grain parallelism
! explicit
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! implicit
MATMUL
TRANSPOSE
RESHAPE
...
15
Three trivial tasks
module numerot
contains
pure real function yksi(X)
real, intent(in) :: X(100
yksi = norm2(X)
end function yksi
pure real function kaksi(X)
real, intent(in) :: X(100)
kaksi = 2*norm2(X)
end function kaksi
pure real function kolme(X)
real, intent(in) :: X(100)
kolme = 3*norm2(X)
end function kolme
end module numerot
program main
use numerot
real, dimension(100) :: A, B, C
real :: RA, RB, RC
A = 1
B = 1
C = 1
RA = yksi(A)
RB = kaksi(B)
RC = kolme(C)
print*,RA+RB+RC
end program main
16
Tasks with DO CONCURRENT
real, dimension(100) :: A, B, C
real :: RA, RB, RC
A = 1
B = 1
C = 1
do concurrent (k=1:3)
select case (k)
case(1) RA = yksi(A)
case(2) RB = kaksi(B)
case(3) RC = kolme(C)
end select
end do
print*,RA+RB+RC
DO CONCURRENT (DC) is descriptive
parallelism. “concurrent” is only a hint and
does not imply any form of concurrency in
the implementation.
Any code based on DC has
implementation-defined asynchrony.
Only PURE tasks are supported by DC are
allowed, which further limits this construct
as a mechanism for realizing asynchronous
tasks.
17
Tasks with coarrays
real, dimension(100) :: A
real :: R
A = 1
if (num_images().ne.3) error stop
select case (this_image()
case(1) R = yksi(A)
case(2) R = kaksi(A)
case(3) R = kolme(A)
end select
sync all
call co_sum(R)
if (this_image().eq.1) print*,R
Coarray images are properly concurrent,
usually equivalent to an MPI/OS process.
The number of images is non-increasing
(constant minus failures).
Data is private to images unless explicitly
communicated. Cooperative algorithms
require an MPI-like approach.
18
An explicit Fortran tasking model
real, dimension(100) :: A, B, C
real :: RA, RB, RC
A = 1; B = 1; C = 1
task(1)
RA = yksi(A)
end task
task(2)
RB = kaksi(B)
end task
task(3)
RC = kolme(C)
end task
task_wait([1,2,3])
print*,RA+RB+RC
Like OpenMP, tasks are descriptive and not
required to be asynchronous, to permit
trivial implementations.
Tasks can share data but only in a limited
way, because Fortran lacks a (shared)
memory consistency model.
Is this sufficient for interesting use cases?
19
Motivation for Tasks
0 1 2 3
4 core CPU
Sequential
Sequential
Parallel
Fork
Join
! sequential
call my_input(X,Y)
! parallel
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential
call my_output(Z)
20
Motivation for Tasks
0 1 2 3
4 core CPU
Sequential
Sequential
Parallel
Fork
Join
! sequential
call my_input(X,Y)
! parallel
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential
call my_unrelated(A)
21
Motivation for Tasks
0 GPU
CPU+GPU
Sequential
Sequential
Parallel
Fork
Join
! sequential on CPU
call my_input(X,Y)
! parallel on GPU
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential on CPU
call my_unrelated(A)
22
Motivation for Tasks
0 GPU
CPU+GPU
Sequential
Sequential
Parallel
Fork
Join
! sequential on CPU
call my_input(X,Y)
! parallel on GPU w/ async
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential on CPU w/ async
call my_unrelated(A)
Savings
23
Motivation for Tasks (synthetic)
call sub1(IN=A,OUT=B)
call sub2(IN=C,OUT=D)
call sub3(IN=E,OUT=F)
call sub4(IN1=B,IN2=D,OUT=G)
call sub5(IN1=F,IN2=G,OUT=H)
! 5 steps require only 3 phases
A C E
B D F
1 2 3
G
G
4
4
5
5
Fortran compilers may be able to prove
these procedures are independent, but it
is often impossible to prove that executing
them in parallel is profitable.
24
Motivation for Tasks (realistic)
https://dl.acm.org/doi/10.1145/2425676.2425687
https://pubs.acs.org/doi/abs/10.1021/ct100584w
25
Describing asynchronous communication
subroutine stuff(A,B,C)
real :: A, B, C
call co_sum(A)
call co_min(B)
call co_max(C)
end subroutine stuff
subroutine stuff(A,B,C)
real :: A, B, C
task co_sum(A)
task co_min(B)
task co_max(C)
task_wait
end subroutine stuff
subroutine stuff(A,B,C)
use mpi_f08
real :: A, B, C
type(MPI_Request) :: R(3)
call MPI_Iallreduce(..A..SUM..R(1))
call MPI_Iallreduce(..B..MIN..R(2))
call MPI_Iallreduce(..C..MAX..R(3))
call MPI_Waitall(3,R,..)
end subroutine stuff
26
Describing asynchronous computation
do i = 1,b
C(i) = MATMUL(A(i),B(i))
end do
do i = 1,b
task
C(i) = MATMUL(A(i),B(i))
end task
end do
task_wait
cudaStreamCreate(s)
cublasCreate(h)
cublasSetStream(h,s)
do i = 1,b
cublasDgemm_v2(h,
cu_op_n,cu_op_n,
n,n,n,
one,A(i),n,B(i),n,
one,C(i),n)
end do
cudaDeviceSynchronize()
27
Describing asynchronous computation
do i = 1,b
C(i) = MATMUL(A(i),B(i))
end do
do i = 1,b
j = mod(i,8)
task j
C(i) = MATMUL(A(i),B(i))
end task
end do
task_wait
do j=1,8
cudaStreamCreate(s(j))
cublasCreate(h(j))
cublasSetStream(h(j), s(j))
end do
do i = 1,b
j = mod(i,8)
cublasDgemm_v2(h(j),
cu_op_n,cu_op_n,
n,n,n,
one,A(i),n,B(i),n,
one,C(i),n)
end do
cudaDeviceSynchronize()
https://github.com/nwchemgit/nwchem/blob/master/src/ccsd/ccsd_trpdrv_openacc.F
28
J3/WG5 papers targeting Fortran 2026
https://j3-fortran.org/doc/year/22/22-169.pdf Fortran asynchronous tasks
https://j3-fortran.org/doc/year/23/23-174.pdf Asynchronous Tasks in Fortran
There is consensus that this is a good feature to add to Fortran, but we have a long way
to go to define syntax and semantics. We will not just copy C++, nor specify threads.
29
Overdecomposition and
Implicit Task Parallelism
30
Classic MPI Domain Decomposition
31
Over Decomposition
32
Quantum Chemistry Algorithms
! SCF Fock build
Forall I,J,K,L = (1:N)^4
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
! Static Parallelization
MySet = decompose[ (1:N)^4 ]
Forall I,J,K,L = MySet
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
33
Quantum Chemistry Algorithms
! SCF Fock build
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End Forall
! Static Parallelization
IJKL = (1:N)^4
MySet = decompose[ NonZero(IJKL) ]
Forall I,J,K,L = MySet
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
34
Quantum Chemistry Algorithms
! SCF Fock build
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
V = (IJ|KL) ! Variable cost
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End Forall
! Static Parallelization
IJKL = (1:N)^4
MySet = decompose[ Cost(IJKL) ]
Forall I,J,K,L = MySet
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
35
Quantum Chemistry Algorithms
! Dynamic Parallelization
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
If MyTurn()
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End If
End Forall
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
If MyTurn()
Task(I,J,K,L)
End If
End If
End Forall
36
Quantum Chemistry Algorithms
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
If MyTurn()
Task(I,J,K,L)
End If
End If
End Forall
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
FancySystem(NonZeroSet,Task)
37
Summary
NWChem, GAMESS and other QC codes distribute irregular computations
decoupling work decomposition from processing elements.
The body of a distributed loop is a task.
Efficient when num_tasks >> num_proc and dynamic scheduling is cheap.
Overdecomposition + Dynamic Scheduling = AMT w/o the system
https://www.mcs.anl.gov/papers/P3056-1112_1.pdf
38
Summary
• Task parallelism, which may be asynchronous, is under consideration for
Fortran standardization.
• Learn from prior art in OpenMP, OpenACC, Ada, etc.
• Descriptive, not prescriptive, behavior, like DO CONCURRENT.
• Successful distributed memory quantum chemistry codes are implicitly using
AMT concepts, but without explicit tasks or a tasking system.
• Irregular workloads or inhomogeneous system performance are nicely solved by AMT
systems, but not all apps are capable of adopted AMT systems.
• Can we find ways to subtly bring AMT concepts into more “old fashioned” apps?
Subtle Asynchrony by Jeff Hammond
1 von 39

Recomendados

Kamil witecki asynchronous, yet readable, code von
Kamil witecki asynchronous, yet readable, codeKamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil Witecki
323 views63 Folien
Столпы функционального программирования для адептов ООП, Николай Мозговой von
Столпы функционального программирования для адептов ООП, Николай МозговойСтолпы функционального программирования для адептов ООП, Николай Мозговой
Столпы функционального программирования для адептов ООП, Николай МозговойSigma Software
128 views46 Folien
C Programming Homework Help von
C Programming Homework HelpC Programming Homework Help
C Programming Homework HelpProgramming Homework Help
96 views22 Folien
Functional Concepts for OOP Developers von
Functional Concepts for OOP DevelopersFunctional Concepts for OOP Developers
Functional Concepts for OOP Developersbrweber2
29.8K views107 Folien
golang_getting_started.pptx von
golang_getting_started.pptxgolang_getting_started.pptx
golang_getting_started.pptxGuy Komari
99 views55 Folien
A peek on numerical programming in perl and python e christopher dyken 2005 von
A peek on numerical programming in perl and python  e christopher dyken  2005A peek on numerical programming in perl and python  e christopher dyken  2005
A peek on numerical programming in perl and python e christopher dyken 2005Jules Krdenas
8 views12 Folien

Más contenido relacionado

Similar a Subtle Asynchrony by Jeff Hammond

Golang dot-testing-lite von
Golang dot-testing-liteGolang dot-testing-lite
Golang dot-testing-liteRichárd Kovács
52 views24 Folien
Class 16: Making Loops von
Class 16: Making LoopsClass 16: Making Loops
Class 16: Making LoopsDavid Evans
618 views18 Folien
Matlab integration von
Matlab integrationMatlab integration
Matlab integrationpramodkumar1804
474 views6 Folien
Hierarchical free monads and software design in fp von
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fpAlexander Granin
161 views59 Folien
Functional Programming You Already Know von
Functional Programming You Already KnowFunctional Programming You Already Know
Functional Programming You Already KnowKevlin Henney
1.4K views76 Folien
Data structure (2nd semester) von
Data structure (2nd semester)Data structure (2nd semester)
Data structure (2nd semester)Ketan Rajpal
206 views41 Folien

Similar a Subtle Asynchrony by Jeff Hammond(20)

Class 16: Making Loops von David Evans
Class 16: Making LoopsClass 16: Making Loops
Class 16: Making Loops
David Evans618 views
Hierarchical free monads and software design in fp von Alexander Granin
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fp
Alexander Granin161 views
Functional Programming You Already Know von Kevlin Henney
Functional Programming You Already KnowFunctional Programming You Already Know
Functional Programming You Already Know
Kevlin Henney1.4K views
Data structure (2nd semester) von Ketan Rajpal
Data structure (2nd semester)Data structure (2nd semester)
Data structure (2nd semester)
Ketan Rajpal206 views
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs von Jeff Larkin
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Jeff Larkin3.7K views
Stdlib functions lesson von teach4uin
Stdlib functions lessonStdlib functions lesson
Stdlib functions lesson
teach4uin553 views
GRAPHICAL STRUCTURES in our lives von xryuseix
GRAPHICAL STRUCTURES in our livesGRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our lives
xryuseix 62 views
NVIDIA HPC ソフトウエア斜め読み von NVIDIA Japan
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
NVIDIA Japan733 views
Scala as a Declarative Language von vsssuresh
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Language
vsssuresh3.2K views
Talk - Query monad von Fabernovel
Talk - Query monad Talk - Query monad
Talk - Query monad
Fabernovel1.7K views
Beyond Breakpoints: A Tour of Dynamic Analysis von Fastly
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
Fastly347 views
Multinomial Logistic Regression with Apache Spark von DB Tsai
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
DB Tsai12.9K views
Alpine Spark Implementation - Technical von alpinedatalabs
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
alpinedatalabs10.4K views
Threading Is Not A Model von guest2a5acfb
Threading Is Not A ModelThreading Is Not A Model
Threading Is Not A Model
guest2a5acfb514 views

Más de Patrick Diehl

Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger von
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-TigerEvaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-TigerPatrick Diehl
10 views28 Folien
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools von
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and ToolsD-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and ToolsPatrick Diehl
2 views2 Folien
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran von
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranFramework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranPatrick Diehl
12 views62 Folien
JOSS and FLOSS for science: Examples for promoting open source software and s... von
JOSS and FLOSS for science: Examples for promoting open source software and s...JOSS and FLOSS for science: Examples for promoting open source software and s...
JOSS and FLOSS for science: Examples for promoting open source software and s...Patrick Diehl
23 views31 Folien
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku von
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuSimulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuPatrick Diehl
15 views27 Folien
A tale of two approaches for coupling nonlocal and local models von
A tale of two approaches for coupling nonlocal and local modelsA tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local modelsPatrick Diehl
11 views59 Folien

Más de Patrick Diehl(20)

Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger von Patrick Diehl
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-TigerEvaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Patrick Diehl10 views
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools von Patrick Diehl
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and ToolsD-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
Patrick Diehl2 views
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran von Patrick Diehl
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranFramework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Patrick Diehl12 views
JOSS and FLOSS for science: Examples for promoting open source software and s... von Patrick Diehl
JOSS and FLOSS for science: Examples for promoting open source software and s...JOSS and FLOSS for science: Examples for promoting open source software and s...
JOSS and FLOSS for science: Examples for promoting open source software and s...
Patrick Diehl23 views
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku von Patrick Diehl
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuSimulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Patrick Diehl15 views
A tale of two approaches for coupling nonlocal and local models von Patrick Diehl
A tale of two approaches for coupling nonlocal and local modelsA tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local models
Patrick Diehl11 views
Recent developments in HPX and Octo-Tiger von Patrick Diehl
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-Tiger
Patrick Diehl5 views
Challenges for coupling approaches for classical linear elasticity and bond-b... von Patrick Diehl
Challenges for coupling approaches for classical linear elasticity and bond-b...Challenges for coupling approaches for classical linear elasticity and bond-b...
Challenges for coupling approaches for classical linear elasticity and bond-b...
Patrick Diehl31 views
Quantifying Overheads in Charm++ and HPX using Task Bench von Patrick Diehl
Quantifying Overheads in Charm++ and HPX using Task BenchQuantifying Overheads in Charm++ and HPX using Task Bench
Quantifying Overheads in Charm++ and HPX using Task Bench
Patrick Diehl20 views
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Explicit T... von Patrick Diehl
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Explicit T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Explicit T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Explicit T...
Patrick Diehl13 views
A Fracture Multiscale Model for Peridynamic enrichment within the Partition o... von Patrick Diehl
A Fracture Multiscale Model for Peridynamic enrichment within the Partition o...A Fracture Multiscale Model for Peridynamic enrichment within the Partition o...
A Fracture Multiscale Model for Peridynamic enrichment within the Partition o...
Patrick Diehl160 views
Interactive C++ code development using C++Explorer and GitHub Classroom for e... von Patrick Diehl
Interactive C++ code development using C++Explorer and GitHub Classroom for e...Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Patrick Diehl72 views
Porting our astrophysics application to Arm64FX and adding Arm64FX support us... von Patrick Diehl
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Patrick Diehl583 views
An asynchronous and task-based implementation of peridynamics utilizing HPX—t... von Patrick Diehl
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
Patrick Diehl88 views
Recent developments in HPX and Octo-Tiger von Patrick Diehl
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-Tiger
Patrick Diehl87 views
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T... von Patrick Diehl
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Patrick Diehl123 views
A review of benchmark experiments for the validation of peridynamics models von Patrick Diehl
A review of benchmark experiments for the validation of peridynamics modelsA review of benchmark experiments for the validation of peridynamics models
A review of benchmark experiments for the validation of peridynamics models
Patrick Diehl98 views
Deploying a Task-based Runtime System on Raspberry Pi Clusters von Patrick Diehl
Deploying a Task-based Runtime System on Raspberry Pi ClustersDeploying a Task-based Runtime System on Raspberry Pi Clusters
Deploying a Task-based Runtime System on Raspberry Pi Clusters
Patrick Diehl101 views
On the treatment of boundary conditions for bond-based peridynamic models von Patrick Diehl
On the treatment of boundary conditions for bond-based peridynamic modelsOn the treatment of boundary conditions for bond-based peridynamic models
On the treatment of boundary conditions for bond-based peridynamic models
Patrick Diehl87 views
EMI 2021 - A comparative review of peridynamics and phase-field models for en... von Patrick Diehl
EMI 2021 - A comparative review of peridynamics and phase-field models for en...EMI 2021 - A comparative review of peridynamics and phase-field models for en...
EMI 2021 - A comparative review of peridynamics and phase-field models for en...
Patrick Diehl75 views

Último

ADDO_2022_CICID_Tom_Halpin.pdf von
ADDO_2022_CICID_Tom_Halpin.pdfADDO_2022_CICID_Tom_Halpin.pdf
ADDO_2022_CICID_Tom_Halpin.pdfTomHalpin9
5 views33 Folien
Programming Field von
Programming FieldProgramming Field
Programming Fieldthehardtechnology
6 views9 Folien
FOSSLight Community Day 2023-11-30 von
FOSSLight Community Day 2023-11-30FOSSLight Community Day 2023-11-30
FOSSLight Community Day 2023-11-30Shane Coughlan
6 views18 Folien
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile... von
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...Stefan Wolpers
33 views38 Folien
Introduction to Maven von
Introduction to MavenIntroduction to Maven
Introduction to MavenJohn Valentino
6 views10 Folien
Using Qt under LGPL-3.0 von
Using Qt under LGPL-3.0Using Qt under LGPL-3.0
Using Qt under LGPL-3.0Burkhard Stubert
13 views11 Folien

Último(20)

ADDO_2022_CICID_Tom_Halpin.pdf von TomHalpin9
ADDO_2022_CICID_Tom_Halpin.pdfADDO_2022_CICID_Tom_Halpin.pdf
ADDO_2022_CICID_Tom_Halpin.pdf
TomHalpin95 views
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile... von Stefan Wolpers
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
Stefan Wolpers33 views
AI and Ml presentation .pptx von FayazAli87
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptx
FayazAli8714 views
How Workforce Management Software Empowers SMEs | TraQSuite von TraQSuite
How Workforce Management Software Empowers SMEs | TraQSuiteHow Workforce Management Software Empowers SMEs | TraQSuite
How Workforce Management Software Empowers SMEs | TraQSuite
TraQSuite6 views
Electronic AWB - Electronic Air Waybill von Freightoscope
Electronic AWB - Electronic Air Waybill Electronic AWB - Electronic Air Waybill
Electronic AWB - Electronic Air Waybill
Freightoscope 5 views
tecnologia18.docx von nosi6702
tecnologia18.docxtecnologia18.docx
tecnologia18.docx
nosi67025 views
Bootstrapping vs Venture Capital.pptx von Zeljko Svedic
Bootstrapping vs Venture Capital.pptxBootstrapping vs Venture Capital.pptx
Bootstrapping vs Venture Capital.pptx
Zeljko Svedic15 views
JioEngage_Presentation.pptx von admin125455
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptx
admin1254558 views
Automated Testing of Microsoft Power BI Reports von RTTS
Automated Testing of Microsoft Power BI ReportsAutomated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI Reports
RTTS8 views
Navigating container technology for enhanced security by Niklas Saari von Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy14 views
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... von NimaTorabi2
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
NimaTorabi216 views

Subtle Asynchrony by Jeff Hammond

  • 2. 2 Abstract I will discuss subtle asynchrony in two contexts. First, how do we bring asynchronous task parallelism to the Fortran language, without relying on threads or related concepts? Second, I will describe how asynchronous task parallelism emerges in NWChem via overdecomposition, without programmers thinking about tasks. This example demonstrates that many of the principles of asynchronous many task execution can be achieved without specialized runtime systems or programming abstractions.
  • 3. 3 The Concept of Asynchrony
  • 4. 4 “There is no cloud, just someone else’s computer.” “There is no asynchrony, just some other processor.”
  • 5. 5 “There is no cloud, just someone else’s computer.” “There is no asynchrony, just some other processor.” “There is no asynchrony, just some other thread.”
  • 6. 6 “There is no cloud, just someone else’s computer.” “There is no asynchrony, just some other processor.” “There is no asynchrony, just some other thread.” “There is no asynchrony, just some other context.”
  • 7. 7 “There is no cloud, just someone else’s computer.” “There is no asynchrony, just some other processor.” “There is no asynchrony, just some other thread.” “There is no asynchrony, just some other context.” Hardware concurrency Software concurrency Software concurrency
  • 8. 8 “There is no cloud, just someone else’s computer.” “There is no asynchrony, just some other processor.” “There is no asynchrony, just some other thread.” “There is no asynchrony, just some other context.” Hardware concurrency Software concurrency Software concurrency Forward progress Forward progress Scheduled
  • 9. 9 Examples of asynchrony #pragma omp parallel num_threads(2) { assert( omp_get_num_threads() >= 2 ); switch( omp_get_thread_num() ) { case 0: MPI_Ssend(...); break; case 1: MPI_Recv(...); break; } }
  • 10. 10 Examples of asynchrony (or not) #pragma omp parallel num_threads(2) #pragma omp master { #pragma omp task { MPI_Ssend(...); } #pragma omp task { MPI_Recv(...); } }
  • 11. 11 #pragma omp parallel num_threads(2) #pragma omp master { MPI_Request r; #pragma omp task { MPI_Issend(...,&r); nicewait(&r); } #pragma omp task { MPI_Irecv(...,&r); nicewait(&r); } } static inline void nicewait(MPI_Request * r) { int flag=0; while (!flag) { MPI_Test(r, &flag, ..); #pragma omp taskyield } } Totally useless (or not)
  • 12. 12 Analysis OpenMP tasks do not make forward progress and are not schedulable. This allows a trivial implementation and compiler optimizations like task fusion. Prescriptive parallelism: programmer decides, e.g. OpenMP threads. Descriptive parallelism: implementation decides, e.g. OpenMP tasks. https://asc.llnl.gov/sites/asc/files/2020-09/2-20_larkin.pdf
  • 13. 13 Fortran Tasks and Asynchrony?
  • 14. 14 Parallelism in Fortran 2018 ! coarse-grain parallelism .., codimension[:] :: X, Y, Z npes = num_images() n_local = n / npes do i=1,n_local Z(i) = X(i) + Y(i) end do sync all ! fine-grain parallelism ! explicit do concurrent (i=1:n) Z(i) = X(i) + Y(i) end do ! implicit MATMUL TRANSPOSE RESHAPE ...
  • 15. 15 Three trivial tasks module numerot contains pure real function yksi(X) real, intent(in) :: X(100 yksi = norm2(X) end function yksi pure real function kaksi(X) real, intent(in) :: X(100) kaksi = 2*norm2(X) end function kaksi pure real function kolme(X) real, intent(in) :: X(100) kolme = 3*norm2(X) end function kolme end module numerot program main use numerot real, dimension(100) :: A, B, C real :: RA, RB, RC A = 1 B = 1 C = 1 RA = yksi(A) RB = kaksi(B) RC = kolme(C) print*,RA+RB+RC end program main
  • 16. 16 Tasks with DO CONCURRENT real, dimension(100) :: A, B, C real :: RA, RB, RC A = 1 B = 1 C = 1 do concurrent (k=1:3) select case (k) case(1) RA = yksi(A) case(2) RB = kaksi(B) case(3) RC = kolme(C) end select end do print*,RA+RB+RC DO CONCURRENT (DC) is descriptive parallelism. “concurrent” is only a hint and does not imply any form of concurrency in the implementation. Any code based on DC has implementation-defined asynchrony. Only PURE tasks are supported by DC are allowed, which further limits this construct as a mechanism for realizing asynchronous tasks.
  • 17. 17 Tasks with coarrays real, dimension(100) :: A real :: R A = 1 if (num_images().ne.3) error stop select case (this_image() case(1) R = yksi(A) case(2) R = kaksi(A) case(3) R = kolme(A) end select sync all call co_sum(R) if (this_image().eq.1) print*,R Coarray images are properly concurrent, usually equivalent to an MPI/OS process. The number of images is non-increasing (constant minus failures). Data is private to images unless explicitly communicated. Cooperative algorithms require an MPI-like approach.
  • 18. 18 An explicit Fortran tasking model real, dimension(100) :: A, B, C real :: RA, RB, RC A = 1; B = 1; C = 1 task(1) RA = yksi(A) end task task(2) RB = kaksi(B) end task task(3) RC = kolme(C) end task task_wait([1,2,3]) print*,RA+RB+RC Like OpenMP, tasks are descriptive and not required to be asynchronous, to permit trivial implementations. Tasks can share data but only in a limited way, because Fortran lacks a (shared) memory consistency model. Is this sufficient for interesting use cases?
  • 19. 19 Motivation for Tasks 0 1 2 3 4 core CPU Sequential Sequential Parallel Fork Join ! sequential call my_input(X,Y) ! parallel do concurrent (i=1:n) Z(i) = X(i) + Y(i) end do ! sequential call my_output(Z)
  • 20. 20 Motivation for Tasks 0 1 2 3 4 core CPU Sequential Sequential Parallel Fork Join ! sequential call my_input(X,Y) ! parallel do concurrent (i=1:n) Z(i) = X(i) + Y(i) end do ! sequential call my_unrelated(A)
  • 21. 21 Motivation for Tasks 0 GPU CPU+GPU Sequential Sequential Parallel Fork Join ! sequential on CPU call my_input(X,Y) ! parallel on GPU do concurrent (i=1:n) Z(i) = X(i) + Y(i) end do ! sequential on CPU call my_unrelated(A)
  • 22. 22 Motivation for Tasks 0 GPU CPU+GPU Sequential Sequential Parallel Fork Join ! sequential on CPU call my_input(X,Y) ! parallel on GPU w/ async do concurrent (i=1:n) Z(i) = X(i) + Y(i) end do ! sequential on CPU w/ async call my_unrelated(A) Savings
  • 23. 23 Motivation for Tasks (synthetic) call sub1(IN=A,OUT=B) call sub2(IN=C,OUT=D) call sub3(IN=E,OUT=F) call sub4(IN1=B,IN2=D,OUT=G) call sub5(IN1=F,IN2=G,OUT=H) ! 5 steps require only 3 phases A C E B D F 1 2 3 G G 4 4 5 5 Fortran compilers may be able to prove these procedures are independent, but it is often impossible to prove that executing them in parallel is profitable.
  • 24. 24 Motivation for Tasks (realistic) https://dl.acm.org/doi/10.1145/2425676.2425687 https://pubs.acs.org/doi/abs/10.1021/ct100584w
  • 25. 25 Describing asynchronous communication subroutine stuff(A,B,C) real :: A, B, C call co_sum(A) call co_min(B) call co_max(C) end subroutine stuff subroutine stuff(A,B,C) real :: A, B, C task co_sum(A) task co_min(B) task co_max(C) task_wait end subroutine stuff subroutine stuff(A,B,C) use mpi_f08 real :: A, B, C type(MPI_Request) :: R(3) call MPI_Iallreduce(..A..SUM..R(1)) call MPI_Iallreduce(..B..MIN..R(2)) call MPI_Iallreduce(..C..MAX..R(3)) call MPI_Waitall(3,R,..) end subroutine stuff
  • 26. 26 Describing asynchronous computation do i = 1,b C(i) = MATMUL(A(i),B(i)) end do do i = 1,b task C(i) = MATMUL(A(i),B(i)) end task end do task_wait cudaStreamCreate(s) cublasCreate(h) cublasSetStream(h,s) do i = 1,b cublasDgemm_v2(h, cu_op_n,cu_op_n, n,n,n, one,A(i),n,B(i),n, one,C(i),n) end do cudaDeviceSynchronize()
  • 27. 27 Describing asynchronous computation do i = 1,b C(i) = MATMUL(A(i),B(i)) end do do i = 1,b j = mod(i,8) task j C(i) = MATMUL(A(i),B(i)) end task end do task_wait do j=1,8 cudaStreamCreate(s(j)) cublasCreate(h(j)) cublasSetStream(h(j), s(j)) end do do i = 1,b j = mod(i,8) cublasDgemm_v2(h(j), cu_op_n,cu_op_n, n,n,n, one,A(i),n,B(i),n, one,C(i),n) end do cudaDeviceSynchronize() https://github.com/nwchemgit/nwchem/blob/master/src/ccsd/ccsd_trpdrv_openacc.F
  • 28. 28 J3/WG5 papers targeting Fortran 2026 https://j3-fortran.org/doc/year/22/22-169.pdf Fortran asynchronous tasks https://j3-fortran.org/doc/year/23/23-174.pdf Asynchronous Tasks in Fortran There is consensus that this is a good feature to add to Fortran, but we have a long way to go to define syntax and semantics. We will not just copy C++, nor specify threads.
  • 30. 30 Classic MPI Domain Decomposition
  • 32. 32 Quantum Chemistry Algorithms ! SCF Fock build Forall I,J,K,L = (1:N)^4 V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End Forall ! Static Parallelization MySet = decompose[ (1:N)^4 ] Forall I,J,K,L = MySet V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End Forall
  • 33. 33 Quantum Chemistry Algorithms ! SCF Fock build Forall I,J,K,L = (1:N)^4 If NonZero(I,J,K,L) V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End If End Forall ! Static Parallelization IJKL = (1:N)^4 MySet = decompose[ NonZero(IJKL) ] Forall I,J,K,L = MySet V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End Forall
  • 34. 34 Quantum Chemistry Algorithms ! SCF Fock build Forall I,J,K,L = (1:N)^4 If NonZero(I,J,K,L) V = (IJ|KL) ! Variable cost F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End If End Forall ! Static Parallelization IJKL = (1:N)^4 MySet = decompose[ Cost(IJKL) ] Forall I,J,K,L = MySet V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End Forall
  • 35. 35 Quantum Chemistry Algorithms ! Dynamic Parallelization Forall I,J,K,L = (1:N)^4 If NonZero(I,J,K,L) If MyTurn() V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End If End If End Forall Task(I,J,K,L): V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V Forall I,J,K,L = (1:N)^4 If NonZero(I,J,K,L) If MyTurn() Task(I,J,K,L) End If End If End Forall
  • 36. 36 Quantum Chemistry Algorithms Task(I,J,K,L): V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V Forall I,J,K,L = (1:N)^4 If NonZero(I,J,K,L) If MyTurn() Task(I,J,K,L) End If End If End Forall Task(I,J,K,L): V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V FancySystem(NonZeroSet,Task)
  • 37. 37 Summary NWChem, GAMESS and other QC codes distribute irregular computations decoupling work decomposition from processing elements. The body of a distributed loop is a task. Efficient when num_tasks >> num_proc and dynamic scheduling is cheap. Overdecomposition + Dynamic Scheduling = AMT w/o the system https://www.mcs.anl.gov/papers/P3056-1112_1.pdf
  • 38. 38 Summary • Task parallelism, which may be asynchronous, is under consideration for Fortran standardization. • Learn from prior art in OpenMP, OpenACC, Ada, etc. • Descriptive, not prescriptive, behavior, like DO CONCURRENT. • Successful distributed memory quantum chemistry codes are implicitly using AMT concepts, but without explicit tasks or a tasking system. • Irregular workloads or inhomogeneous system performance are nicely solved by AMT systems, but not all apps are capable of adopted AMT systems. • Can we find ways to subtly bring AMT concepts into more “old fashioned” apps?