The document discusses two contexts of subtle asynchrony. First, how to bring asynchronous task parallelism to Fortran without relying on threads. Second, it describes how NWChem achieves asynchronous task parallelism through overdecomposition of work, without programmers explicitly using tasks. This demonstrates that asynchronous many-task execution principles can be achieved without specialized runtime systems or programming abstractions. Quantum chemistry algorithms are provided as an example where overdecomposition leads to implicit asynchronous parallelism through dynamic scheduling of irregularly distributed tasks.
2. 2
Abstract
I will discuss subtle asynchrony in two contexts.
First, how do we bring asynchronous task parallelism to the Fortran
language, without relying on threads or related concepts?
Second, I will describe how asynchronous task parallelism emerges in
NWChem via overdecomposition, without programmers thinking about
tasks. This example demonstrates that many of the principles of
asynchronous many task execution can be achieved without specialized
runtime systems or programming abstractions.
4. 4
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
5. 5
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
6. 6
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
“There is no asynchrony, just some other context.”
7. 7
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
“There is no asynchrony, just some other context.”
Hardware concurrency
Software concurrency
Software concurrency
8. 8
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
“There is no asynchrony, just some other context.”
Hardware concurrency
Software concurrency
Software concurrency
Forward progress
Forward progress
Scheduled
9. 9
Examples of asynchrony
#pragma omp parallel num_threads(2)
{
assert( omp_get_num_threads() >= 2 );
switch( omp_get_thread_num() )
{
case 0:
MPI_Ssend(...);
break;
case 1:
MPI_Recv(...);
break;
}
}
12. 12
Analysis
OpenMP tasks do not make forward progress and are not schedulable.
This allows a trivial implementation and compiler optimizations like
task fusion.
Prescriptive parallelism: programmer decides, e.g. OpenMP threads.
Descriptive parallelism: implementation decides, e.g. OpenMP tasks.
https://asc.llnl.gov/sites/asc/files/2020-09/2-20_larkin.pdf
14. 14
Parallelism in Fortran 2018
! coarse-grain parallelism
.., codimension[:] :: X, Y, Z
npes = num_images()
n_local = n / npes
do i=1,n_local
Z(i) = X(i) + Y(i)
end do
sync all
! fine-grain parallelism
! explicit
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! implicit
MATMUL
TRANSPOSE
RESHAPE
...
15. 15
Three trivial tasks
module numerot
contains
pure real function yksi(X)
real, intent(in) :: X(100
yksi = norm2(X)
end function yksi
pure real function kaksi(X)
real, intent(in) :: X(100)
kaksi = 2*norm2(X)
end function kaksi
pure real function kolme(X)
real, intent(in) :: X(100)
kolme = 3*norm2(X)
end function kolme
end module numerot
program main
use numerot
real, dimension(100) :: A, B, C
real :: RA, RB, RC
A = 1
B = 1
C = 1
RA = yksi(A)
RB = kaksi(B)
RC = kolme(C)
print*,RA+RB+RC
end program main
16. 16
Tasks with DO CONCURRENT
real, dimension(100) :: A, B, C
real :: RA, RB, RC
A = 1
B = 1
C = 1
do concurrent (k=1:3)
select case (k)
case(1) RA = yksi(A)
case(2) RB = kaksi(B)
case(3) RC = kolme(C)
end select
end do
print*,RA+RB+RC
DO CONCURRENT (DC) is descriptive
parallelism. “concurrent” is only a hint and
does not imply any form of concurrency in
the implementation.
Any code based on DC has
implementation-defined asynchrony.
Only PURE tasks are supported by DC are
allowed, which further limits this construct
as a mechanism for realizing asynchronous
tasks.
17. 17
Tasks with coarrays
real, dimension(100) :: A
real :: R
A = 1
if (num_images().ne.3) error stop
select case (this_image()
case(1) R = yksi(A)
case(2) R = kaksi(A)
case(3) R = kolme(A)
end select
sync all
call co_sum(R)
if (this_image().eq.1) print*,R
Coarray images are properly concurrent,
usually equivalent to an MPI/OS process.
The number of images is non-increasing
(constant minus failures).
Data is private to images unless explicitly
communicated. Cooperative algorithms
require an MPI-like approach.
18. 18
An explicit Fortran tasking model
real, dimension(100) :: A, B, C
real :: RA, RB, RC
A = 1; B = 1; C = 1
task(1)
RA = yksi(A)
end task
task(2)
RB = kaksi(B)
end task
task(3)
RC = kolme(C)
end task
task_wait([1,2,3])
print*,RA+RB+RC
Like OpenMP, tasks are descriptive and not
required to be asynchronous, to permit
trivial implementations.
Tasks can share data but only in a limited
way, because Fortran lacks a (shared)
memory consistency model.
Is this sufficient for interesting use cases?
19. 19
Motivation for Tasks
0 1 2 3
4 core CPU
Sequential
Sequential
Parallel
Fork
Join
! sequential
call my_input(X,Y)
! parallel
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential
call my_output(Z)
20. 20
Motivation for Tasks
0 1 2 3
4 core CPU
Sequential
Sequential
Parallel
Fork
Join
! sequential
call my_input(X,Y)
! parallel
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential
call my_unrelated(A)
21. 21
Motivation for Tasks
0 GPU
CPU+GPU
Sequential
Sequential
Parallel
Fork
Join
! sequential on CPU
call my_input(X,Y)
! parallel on GPU
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential on CPU
call my_unrelated(A)
22. 22
Motivation for Tasks
0 GPU
CPU+GPU
Sequential
Sequential
Parallel
Fork
Join
! sequential on CPU
call my_input(X,Y)
! parallel on GPU w/ async
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential on CPU w/ async
call my_unrelated(A)
Savings
23. 23
Motivation for Tasks (synthetic)
call sub1(IN=A,OUT=B)
call sub2(IN=C,OUT=D)
call sub3(IN=E,OUT=F)
call sub4(IN1=B,IN2=D,OUT=G)
call sub5(IN1=F,IN2=G,OUT=H)
! 5 steps require only 3 phases
A C E
B D F
1 2 3
G
G
4
4
5
5
Fortran compilers may be able to prove
these procedures are independent, but it
is often impossible to prove that executing
them in parallel is profitable.
24. 24
Motivation for Tasks (realistic)
https://dl.acm.org/doi/10.1145/2425676.2425687
https://pubs.acs.org/doi/abs/10.1021/ct100584w
25. 25
Describing asynchronous communication
subroutine stuff(A,B,C)
real :: A, B, C
call co_sum(A)
call co_min(B)
call co_max(C)
end subroutine stuff
subroutine stuff(A,B,C)
real :: A, B, C
task co_sum(A)
task co_min(B)
task co_max(C)
task_wait
end subroutine stuff
subroutine stuff(A,B,C)
use mpi_f08
real :: A, B, C
type(MPI_Request) :: R(3)
call MPI_Iallreduce(..A..SUM..R(1))
call MPI_Iallreduce(..B..MIN..R(2))
call MPI_Iallreduce(..C..MAX..R(3))
call MPI_Waitall(3,R,..)
end subroutine stuff
26. 26
Describing asynchronous computation
do i = 1,b
C(i) = MATMUL(A(i),B(i))
end do
do i = 1,b
task
C(i) = MATMUL(A(i),B(i))
end task
end do
task_wait
cudaStreamCreate(s)
cublasCreate(h)
cublasSetStream(h,s)
do i = 1,b
cublasDgemm_v2(h,
cu_op_n,cu_op_n,
n,n,n,
one,A(i),n,B(i),n,
one,C(i),n)
end do
cudaDeviceSynchronize()
27. 27
Describing asynchronous computation
do i = 1,b
C(i) = MATMUL(A(i),B(i))
end do
do i = 1,b
j = mod(i,8)
task j
C(i) = MATMUL(A(i),B(i))
end task
end do
task_wait
do j=1,8
cudaStreamCreate(s(j))
cublasCreate(h(j))
cublasSetStream(h(j), s(j))
end do
do i = 1,b
j = mod(i,8)
cublasDgemm_v2(h(j),
cu_op_n,cu_op_n,
n,n,n,
one,A(i),n,B(i),n,
one,C(i),n)
end do
cudaDeviceSynchronize()
https://github.com/nwchemgit/nwchem/blob/master/src/ccsd/ccsd_trpdrv_openacc.F
28. 28
J3/WG5 papers targeting Fortran 2026
https://j3-fortran.org/doc/year/22/22-169.pdf Fortran asynchronous tasks
https://j3-fortran.org/doc/year/23/23-174.pdf Asynchronous Tasks in Fortran
There is consensus that this is a good feature to add to Fortran, but we have a long way
to go to define syntax and semantics. We will not just copy C++, nor specify threads.
32. 32
Quantum Chemistry Algorithms
! SCF Fock build
Forall I,J,K,L = (1:N)^4
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
! Static Parallelization
MySet = decompose[ (1:N)^4 ]
Forall I,J,K,L = MySet
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
33. 33
Quantum Chemistry Algorithms
! SCF Fock build
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End Forall
! Static Parallelization
IJKL = (1:N)^4
MySet = decompose[ NonZero(IJKL) ]
Forall I,J,K,L = MySet
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
34. 34
Quantum Chemistry Algorithms
! SCF Fock build
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
V = (IJ|KL) ! Variable cost
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End Forall
! Static Parallelization
IJKL = (1:N)^4
MySet = decompose[ Cost(IJKL) ]
Forall I,J,K,L = MySet
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
35. 35
Quantum Chemistry Algorithms
! Dynamic Parallelization
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
If MyTurn()
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End If
End Forall
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
If MyTurn()
Task(I,J,K,L)
End If
End If
End Forall
36. 36
Quantum Chemistry Algorithms
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
If MyTurn()
Task(I,J,K,L)
End If
End If
End Forall
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
FancySystem(NonZeroSet,Task)
37. 37
Summary
NWChem, GAMESS and other QC codes distribute irregular computations
decoupling work decomposition from processing elements.
The body of a distributed loop is a task.
Efficient when num_tasks >> num_proc and dynamic scheduling is cheap.
Overdecomposition + Dynamic Scheduling = AMT w/o the system
https://www.mcs.anl.gov/papers/P3056-1112_1.pdf
38. 38
Summary
• Task parallelism, which may be asynchronous, is under consideration for
Fortran standardization.
• Learn from prior art in OpenMP, OpenACC, Ada, etc.
• Descriptive, not prescriptive, behavior, like DO CONCURRENT.
• Successful distributed memory quantum chemistry codes are implicitly using
AMT concepts, but without explicit tasks or a tasking system.
• Irregular workloads or inhomogeneous system performance are nicely solved by AMT
systems, but not all apps are capable of adopted AMT systems.
• Can we find ways to subtly bring AMT concepts into more “old fashioned” apps?