Subtle Asynchrony by Jeff Hammond

1
Subtle Asynchrony
Jeff Hammond
NVIDIA HPC Group

2
Abstract
I will discuss subtle asynchrony in two contexts.
First, how do we bring asynchronous task parallelism to the Fortran
language, without relying on threads or related concepts?
Second, I will describe how asynchronous task parallelism emerges in
NWChem via overdecomposition, without programmers thinking about
tasks. This example demonstrates that many of the principles of
asynchronous many task execution can be achieved without specialized
runtime systems or programming abstractions.

4
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”

5
“There is no asynchrony, just some other thread.”

6
“There is no asynchrony, just some other context.”

7
Hardware concurrency
Software concurrency

8
Hardware concurrency
Forward progress
Forward progress
Scheduled

9
Examples of asynchrony
#pragma omp parallel num_threads(2)
{
assert( omp_get_num_threads() >= 2 );
switch( omp_get_thread_num() )
{
case 0:
MPI_Ssend(...);
break;
case 1:
MPI_Recv(...);
break;
}
}

10
Examples of asynchrony (or not)
#pragma omp master
{
#pragma omp task
{
MPI_Ssend(...);
}
#pragma omp task
{
MPI_Recv(...);
}
}

11
#pragma omp master
{
MPI_Request r;
#pragma omp task
{
MPI_Issend(...,&r);
nicewait(&r);
}
#pragma omp task
{
MPI_Irecv(...,&r);
nicewait(&r);
}
}
static inline
void nicewait(MPI_Request * r)
{
int flag=0;
while (!flag) {
MPI_Test(r, &flag, ..);
#pragma omp taskyield
}
}
Totally useless
(or not)

12
Analysis
OpenMP tasks do not make forward progress and are not schedulable.
This allows a trivial implementation and compiler optimizations like
task fusion.
Prescriptive parallelism: programmer decides, e.g. OpenMP threads.
Descriptive parallelism: implementation decides, e.g. OpenMP tasks.
https://asc.llnl.gov/sites/asc/files/2020-09/2-20_larkin.pdf

13
Fortran Tasks and Asynchrony?

14
Parallelism in Fortran 2018
! coarse-grain parallelism
.., codimension[:] :: X, Y, Z
npes = num_images()
n_local = n / npes
do i=1,n_local
Z(i) = X(i) + Y(i)
end do
sync all
! fine-grain parallelism
! explicit
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! implicit
MATMUL
TRANSPOSE
RESHAPE
...

15
Three trivial tasks
module numerot
contains
pure real function yksi(X)
real, intent(in) :: X(100
yksi = norm2(X)
end function yksi
pure real function kaksi(X)
real, intent(in) :: X(100)
kaksi = 2*norm2(X)
end function kaksi
pure real function kolme(X)
real, intent(in) :: X(100)
kolme = 3*norm2(X)
end function kolme
end module numerot
program main
use numerot
real, dimension(100) :: A, B, C
real :: RA, RB, RC
A = 1
B = 1
C = 1
RA = yksi(A)
RB = kaksi(B)
RC = kolme(C)
print*,RA+RB+RC
end program main

16
Tasks with DO CONCURRENT
real :: RA, RB, RC
A = 1
B = 1
C = 1
do concurrent (k=1:3)
select case (k)
case(1) RA = yksi(A)
case(2) RB = kaksi(B)
case(3) RC = kolme(C)
end select
end do
print*,RA+RB+RC
DO CONCURRENT (DC) is descriptive
parallelism. “concurrent” is only a hint and
does not imply any form of concurrency in
the implementation.
Any code based on DC has
implementation-defined asynchrony.
Only PURE tasks are supported by DC are
allowed, which further limits this construct
as a mechanism for realizing asynchronous
tasks.

17
Tasks with coarrays
real, dimension(100) :: A
real :: R
A = 1
if (num_images().ne.3) error stop
select case (this_image()
case(1) R = yksi(A)
case(2) R = kaksi(A)
case(3) R = kolme(A)
end select
sync all
call co_sum(R)
if (this_image().eq.1) print*,R
Coarray images are properly concurrent,
usually equivalent to an MPI/OS process.
The number of images is non-increasing
(constant minus failures).
Data is private to images unless explicitly
communicated. Cooperative algorithms
require an MPI-like approach.

18
An explicit Fortran tasking model
real :: RA, RB, RC
A = 1; B = 1; C = 1
task(1)
RA = yksi(A)
end task
task(2)
RB = kaksi(B)
end task
task(3)
RC = kolme(C)
end task
task_wait([1,2,3])
print*,RA+RB+RC
Like OpenMP, tasks are descriptive and not
required to be asynchronous, to permit
trivial implementations.
Tasks can share data but only in a limited
way, because Fortran lacks a (shared)
memory consistency model.
Is this sufficient for interesting use cases?

19
Motivation for Tasks
0 1 2 3
4 core CPU
Sequential
Sequential
Parallel
Fork
Join
! sequential
call my_input(X,Y)
! parallel
Z(i) = X(i) + Y(i)
end do
! sequential
call my_output(Z)

20
0 1 2 3
4 core CPU
Sequential
Sequential
Parallel
Fork
Join
! sequential
call my_input(X,Y)
! parallel
Z(i) = X(i) + Y(i)
end do
! sequential
call my_unrelated(A)

21
0 GPU
CPU+GPU
Sequential
Sequential
Parallel
Fork
Join
! sequential on CPU
call my_input(X,Y)
! parallel on GPU
Z(i) = X(i) + Y(i)
end do
! sequential on CPU

22
0 GPU
CPU+GPU
Sequential
Sequential
Parallel
Fork
Join
! sequential on CPU
call my_input(X,Y)
! parallel on GPU w/ async
Z(i) = X(i) + Y(i)
end do
! sequential on CPU w/ async
Savings

23
Motivation for Tasks (synthetic)
call sub1(IN=A,OUT=B)
call sub2(IN=C,OUT=D)
call sub3(IN=E,OUT=F)
call sub4(IN1=B,IN2=D,OUT=G)
call sub5(IN1=F,IN2=G,OUT=H)
! 5 steps require only 3 phases
A C E
B D F
1 2 3
G
G
4
4
5
5
Fortran compilers may be able to prove
these procedures are independent, but it
is often impossible to prove that executing
them in parallel is profitable.

24
Motivation for Tasks (realistic)
https://dl.acm.org/doi/10.1145/2425676.2425687
https://pubs.acs.org/doi/abs/10.1021/ct100584w

25
Describing asynchronous communication
subroutine stuff(A,B,C)
real :: A, B, C
call co_sum(A)
call co_min(B)
call co_max(C)
end subroutine stuff
real :: A, B, C
task co_sum(A)
task co_min(B)
task co_max(C)
task_wait
use mpi_f08
real :: A, B, C
type(MPI_Request) :: R(3)
call MPI_Iallreduce(..A..SUM..R(1))
call MPI_Iallreduce(..B..MIN..R(2))
call MPI_Iallreduce(..C..MAX..R(3))
call MPI_Waitall(3,R,..)

26
Describing asynchronous computation
do i = 1,b
C(i) = MATMUL(A(i),B(i))
end do
do i = 1,b
task
end task
end do
task_wait
cudaStreamCreate(s)
cublasCreate(h)
cublasSetStream(h,s)
do i = 1,b
cublasDgemm_v2(h,
cu_op_n,cu_op_n,
n,n,n,
one,A(i),n,B(i),n,
one,C(i),n)
end do
cudaDeviceSynchronize()

27
Describing asynchronous computation
do i = 1,b
end do
do i = 1,b
j = mod(i,8)
task j
end task
end do
task_wait
do j=1,8
cudaStreamCreate(s(j))
cublasCreate(h(j))
cublasSetStream(h(j), s(j))
end do
do i = 1,b
j = mod(i,8)
cublasDgemm_v2(h(j),
cu_op_n,cu_op_n,
n,n,n,
one,A(i),n,B(i),n,
one,C(i),n)
end do
cudaDeviceSynchronize()
https://github.com/nwchemgit/nwchem/blob/master/src/ccsd/ccsd_trpdrv_openacc.F

28
J3/WG5 papers targeting Fortran 2026
https://j3-fortran.org/doc/year/22/22-169.pdf Fortran asynchronous tasks
https://j3-fortran.org/doc/year/23/23-174.pdf Asynchronous Tasks in Fortran
There is consensus that this is a good feature to add to Fortran, but we have a long way
to go to define syntax and semantics. We will not just copy C++, nor specify threads.

29
Overdecomposition and
Implicit Task Parallelism

30
Classic MPI Domain Decomposition

32
Quantum Chemistry Algorithms
! SCF Fock build
Forall I,J,K,L = (1:N)^4
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
! Static Parallelization
MySet = decompose[ (1:N)^4 ]
Forall I,J,K,L = MySet
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall

33
! SCF Fock build
If NonZero(I,J,K,L)
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End Forall
IJKL = (1:N)^4
MySet = decompose[ NonZero(IJKL) ]
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall

34
! SCF Fock build
If NonZero(I,J,K,L)
V = (IJ|KL) ! Variable cost
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End Forall
IJKL = (1:N)^4
MySet = decompose[ Cost(IJKL) ]
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall

35
! Dynamic Parallelization
If NonZero(I,J,K,L)
If MyTurn()
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End If
End Forall
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
If NonZero(I,J,K,L)
If MyTurn()
Task(I,J,K,L)
End If
End If
End Forall

36
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
If NonZero(I,J,K,L)
If MyTurn()
Task(I,J,K,L)
End If
End If
End Forall
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
FancySystem(NonZeroSet,Task)

37
Summary
NWChem, GAMESS and other QC codes distribute irregular computations
decoupling work decomposition from processing elements.
The body of a distributed loop is a task.
Efficient when num_tasks >> num_proc and dynamic scheduling is cheap.
Overdecomposition + Dynamic Scheduling = AMT w/o the system
https://www.mcs.anl.gov/papers/P3056-1112_1.pdf

38
Summary
• Task parallelism, which may be asynchronous, is under consideration for
Fortran standardization.
• Learn from prior art in OpenMP, OpenACC, Ada, etc.
• Descriptive, not prescriptive, behavior, like DO CONCURRENT.
• Successful distributed memory quantum chemistry codes are implicitly using
AMT concepts, but without explicit tasks or a tasking system.
• Irregular workloads or inhomogeneous system performance are nicely solved by AMT
systems, but not all apps are capable of adopted AMT systems.
• Can we find ways to subtly bring AMT concepts into more “old fashioned” apps?

Subtle Asynchrony by Jeff Hammond

Subtle Asynchrony by Jeff Hammond

Recommended

Recommended

More Related Content

Similar to Subtle Asynchrony by Jeff Hammond

Similar to Subtle Asynchrony by Jeff Hammond (20)

More from Patrick Diehl

More from Patrick Diehl (20)

Recently uploaded

Recently uploaded (20)

Subtle Asynchrony by Jeff Hammond