2. Introduction
Motivating example
Type system
Demonstration
Other applications
Tasks and processes
Task execution
Research
Conclusion
GA Tech | 2014/09/22| 2
4. My aims for a new parallel programming
system
1. There are many types of parallelism
) Uniform treatment of parallelism
2. Data movement is more important than computation
) While acknowledging the realities of hardware
3. CS theory seems to ignore HPC-type of parallelism
) Strongly theory based
IMP: Integrative Model for Parallelism
GA Tech | 2014/09/22| 4
5. Design of a programming system
One needs to distinguish:
Programming model How does it look in code
Execution model How is it actually executed
Data model How is data placed and moved about
Three dierent vocabularies!
GA Tech | 2014/09/22| 5
6. Programming model
Sequential semantics
[A]n HPF program may be understood (and debugged)
using sequential semantics, a deterministic world that we are
comfortable with. Once again, as in traditional programming,
the programmer works with a single address space, treating an
array as a single, monolithic object, regardless of how it may
be distributed across the memories of a parallel machine.
(Nikhil 1993)
As opposed to
[H]umans are quickly overwhelmed by concurrency and
7. nd it much more dicult to reason about concurrent than
sequential code. Even careful people miss possible interleavings
among even simple collections of partially ordered operations.
(Sutter and Larus 2005)
GA Tech | 2014/09/22| 6
8. Programming model
Sequential semantics is close to the mathematics of the problem.
Note: sequential semantics in the programming model does not
mean BSP synchronization in the execution.
Also note: sequential semantics is subtly dierent from SPMD
(but at least SPMD puts you in the asynchronous mindset)
GA Tech | 2014/09/22| 7
9. Execution model
Virtual machine: data
ow.
Data
ow expresses the essential dependencies in an
algorithm.
Data
ow applies to multiple parallelism models.
But it would be a mistake to program data
ow explicitly.
GA Tech | 2014/09/22| 8
10. Data model
Distribution: mapping from processors to data.
(note: traditionally the other way around)
Needed (and missing from existing systems such as UPC, HPF):
distributions need to be
11. rst-class objects:
) we want an algebra of distributions
algorithms need to be expressed in distributions
GA Tech | 2014/09/22| 9
12. Integrative Model for Parallelism (IMP)
Theoretical model for describing parallelism
Library (or maybe language) for describing operations on
parallel data
Minimal, yet sucient, speci
14. rst-class
objects), including messages and task dependencies.
) Specify what, not how
) Improve programmer productivity, code quality, eciency
and robustness
GA Tech | 2014/09/22| 10
16. 1D example: 3-pt averaging
Data parallel calculation: yi = f (xi1; xi ; xi+1)
Each point has a dependency on three points, some on other
processing elements
GA Tech | 2014/09/22| 12
22. Data
ow
We get a dependency structure:
Interpretation:
Tasks: local task graph
Message passing: messages
Note: this structure follows from the distributions of the algorithm,
it is not programmed.
GA Tech | 2014/09/22| 14
23. Algorithms in the Integrative Model
Kernel: mapping between two distributed objects
An algorithm consists of Kernels
Each kernel consists of independent operations/tasks
Traditional elements of parallel programming are derived from
the kernel speci
27. nes function
If : N ! 2N
for instance If = fi ; i 1; i + 1g.
GA Tech | 2014/09/22| 17
28. Distributions
Distribution is (non-disjoint, non-unique) mapping from processors
to sets of indices:
d : P ! 2N
Distributed data:
x(d) : p7! fxi : i 2 d(p)g
Operations on distributions:
g : N ! N ) g(d) : p7! fg(i) : i 2 d(p)g
GA Tech | 2014/09/22| 18
29. Algorithms in terms of distributions
If d is a distribution, and (funky notation)
x y x + y; x y x y
the motivating example becomes:
y(d) = x(d) + x(d 1) + x(d 1)
and the
51. (p)
Parts of a data
ow graph
can be realized with OMP tasks
or MPI messages
Total data
ow graph from
all kernels and
all processes in kernels
GA Tech | 2014/09/22| 22
52. To summarize
Distribution language is global with sequential semantics
Leads to data
ow formulation
Can be interpreted in multiple parallelism modes
Execution likely to be ecient
GA Tech | 2014/09/22| 23
54. Can you code this?
As a library / internal DSL: express distributions in custom
API, write local operation in ordinary C/F
) easy integration in existing codes
As a programming language / external DSL: requires compiler
technology:
) prospect for interactions between data movement and local
code.
GA Tech | 2014/09/22| 25
55. Approach taken
Program expresses the sequential semantics of kernels
Base class to realize the IMP concepts
One derived class that turns IMP into MPI
One derived class that turns IMP into OpenMP+tasks
Total: few thousand lines.
GA Tech | 2014/09/22| 26
60. (Do I really have to put up performance
graphs?)
GA Tech | 2014/09/22| 31
61. (Do I really have to put up performance
graphs?)
2 4 6 8 10 12 14 16
100
10-1
Gflop under strong scaling of vector averaging
OpenMP
IMP
140
120
100
80
60
40
20
0
Gflop under weak scaling of vector averaging
MPI
IMP
0 200 400 600 800 1000
GA Tech | 2014/09/22| 32
62. Summary: the motivating example in
parallel language
Write the three-point averaging as
y(u) =
x(u) + x(u 1) + x(u 1)
=3
Global description, sequential semantics
Execution is driven by data
ow, no synchronization
-distribution given by context
63. -distribution is u + u 1 + u 1
Messages and task dependencies are derived.
GA Tech | 2014/09/22| 33
67. (k) = 2
(k) [ 2
(k) + j
(k)j:
Redundant computation is never explicitly mentioned.
(This can be coded; code is essentially the same as the formulas)
GA Tech | 2014/09/22| 36
72. ne a task t0 as a synchronization point if
t0 is an immediate predecessor on another processor:
t 2 Cp ^ t0 t ^ t0 2 Cp0 ^ p6= p0:
If L Task, base BL is
BL = ft 2 L: pred(t)6 Lg:
GA Tech | 2014/09/22| 39
73. Local computations
Two-parameter covering fLk;pgk;p of T is called local
computations if
1. the p index corresponds to the division in processors:
Cp = [kLk;p:
2. the k index corresponds to the partial ordering on tasks: the
sets Lk = [pLk;p satisfy
t 2 Lk ^ t0 t ) t0 2
[
`k
L`
3. the synchronization points synchronize only with previous
levels:
pred(Bk;p) Cp
[
`k
L`
For a given k, all Lk;p can be executed independently.
GA Tech | 2014/09/22| 40
74. (a): (b): (c):
Are these local computations? Yes, No, Yes
GA Tech | 2014/09/22| 41
76. nitions can be given purely in terms of the task graph.
Programmer decides how `thick' to make the Lk;p covering,
communication avoiding scheduling is formally derived.
GA Tech | 2014/09/22| 42
77. Co-processors
Distributions can describe data placement
Our main worry is latency of data movement: in IMP, data can be
sent early-as-possible; our communication avoiding compiler
transforms algorithms to maximize granularity
GA Tech | 2014/09/22| 43
79. What is a task?
A task is a Finite State Automaton with
80. ve states, transitions are
triggered by receiving signals from other tasks:
requesting Each task starts out by posting a request for
incoming data to each of its predecessors.
accepting The requested data is in the process of arriving or
being made available.
exec The data dependencies are satis
82. nement of this model there
can be a separate exec state for each predecessor.
avail Data that was produced and that serves as origin for
some dependency is published to all successor tasks.
used All published origin data has been absorbed by the
endpoint of the data dependency, and any temporary
buers can be released.
GA Tech | 2014/09/22| 45
83. p states control messages q; s states
requesting
#
notifyReadyToSend
...
exec
8q p
requestToSend
#
q p
accepting !
#
sendData
avail
acknowledgeReceipt
! #
8q p
used
requesting
notifyReadyToSend
exec !
#
8s p
#
requestToSend s p
accepting
9s p
...
avail
#
sendData
!
acknowledgeReceipt
8s p
GA Tech | 2014/09/22| 46
84. How does a processor manage tasks?
Theorem: if you get a request-to-send, you can release the send
buers of your predecessor tasks
Corrolary: we have a functional model that doesn't need garbage
collection
GA Tech | 2014/09/22| 47
86. Open questions
Many!
Software is barely in demonstration stage: needs much more
functionality
Theoretical questions: SSA, cost, scheduling,
Practical questions: interaction with local code, heterogeneity,
interaction with hardware
Application: this works for tradition HPC, N-body, probably
sorting and graph algorithms. Beyond?
Software-hardware co-design: IMP model has semantics for
data movement, hardware can be made more ecient using
this.
GA Tech | 2014/09/22| 49
88. The future's so bright, I gotta wear shades
IMP has the right abstraction level: global expression, yet
natural derivation of practical concepts.
Concept notation looks humanly possible: basis for an
expressive programming system
Global description without talking about processes/processors:
prospect for heterogeneous programming
All concepts are explicit: middleware for scheduling, resilience,
et cetera
Applications to most conceivable scienti