All-Reduce and Prefix-Sum Operations

All-Reduce and Prefix-Sum Operations
• In all-reduce, each node starts with a buffer of size m and the final
results of the operation are identical buffers of size m on each node
that are formed by combining the original p buffers using an
associative operator.
• Identical to all-to-one reduction followed by a one-to-all broadcast.
This formulation is not the most efficient. Uses the pattern of all-to-
all broadcast, instead. The only difference is that message size does
not increase here. Time for this operation is (ts + twm) log p.
• Different from all-to-all reduction, in which p simultaneous all-to-one
reductions take place, each with a different destination for the
result.

The Prefix-Sum Operation
• Given p numbers n0,n1,…,np-1 (one on each node), the problem is to
compute the sums sk = ∑i
k
=0 ni for all k between 0 and p-1 .
• Initially, nk resides on the node labeled k, and at the end of the
procedure, the same node holds Sk.

Computing prefix sums on an eight-node hypercube. At each node, square brackets
show the local prefix sum accumulated in the result buffer and parentheses enclose
the contents of the outgoing message buffer for the next step.

• The operation can be implemented using the all-to-all broadcast
kernel.
• We must account for the fact that in prefix sums the node with label
k uses information from only the k-node subset whose labels are less
than or equal to k.
• This is implemented using an additional result buffer. The content of
an incoming message is added to the result buffer only if the
message comes from a node with a smaller label than the recipient
node.
• The contents of the outgoing message (denoted by parentheses in
the figure) are updated with every incoming message.

Prefix sums on a d-dimensional hypercube.

Scatter and Gather
• In the scatter operation, a single node sends a unique message of
size m to every other node (also called a one-to-all personalized
communication).
• In the gather operation, a single node collects a unique message
from each node.
• While the scatter operation is fundamentally different from
broadcast, the algorithmic structure is similar, except for differences
in message sizes (messages get smaller in scatter and stay constant in
broadcast).
• The gather operation is exactly the inverse of the scatter operation
and can be executed as such.

Gather and Scatter Operations
Scatter and gather operations.

Example of the Scatter Operation
The scatter operation on an eight-node hypercube.

Cost of Scatter and Gather
• There are log p steps, in each step, the machine size halves and the
data size halves.
• We have the time for this operation to be:
• This time holds for a linear array as well as a 2-D mesh.
• These times are asymptotically optimal in message size.

All-to-All Personalized Communication
• Each node has a distinct message of size m for every other node.
• This is unlike all-to-all broadcast, in which each node sends the same
message to all other nodes.
• All-to-all personalized communication is also known as total
exchange.

All-to-all personalized communication.

All-to-All Personalized Communication:
Example
• Consider the problem of transposing a matrix.
• Each processor contains one full row of the matrix.
• The transpose operation in this case is identical to an all-to-all
personalized communication operation.

All-to-All Personalized Communication:
Example
All-to-all personalized communication in transposing a 4 x 4 matrix using four
processes.

on a Ring
• Each node sends all pieces of data as one consolidated message of
size m(p – 1) to one of its neighbors.
• Each node extracts the information meant for it from the data
received, and forwards the remaining (p – 2) pieces of size m each to
the next node.
• The algorithm terminates in p – 1 steps.
• The size of the message reduces by m at each step.

on a Ring
All-to-all personalized communication on a six-node ring. The label of each message is
of the form {x,y}, where x is the label of the node that originally owned the message,
and y is the label of the node that is the final destination of the message. The label
({x1,y1}, {x2,y2},…, {xn,yn}, indicates a message that is formed by concatenating n
individual messages.

on a Ring: Cost
• We have p – 1 steps in all.
• In step i, the message size is m(p – i).
• The total time is given by:
• The tw term in this equation can be reduced by a factor of 2 by
communicating messages in both directions.

on a Mesh
• Each node first groups its p messages according to the columns of
their destination nodes.
• All-to-all personalized communication is performed independently in
each row with clustered messages of size m√p.
• Messages in each node are sorted again, this time according to the
rows of their destination nodes.
• All-to-all personalized communication is performed independently in
each column with clustered messages of size m√p.

on a Mesh
The distribution of messages at the beginning of each phase of all-to-all personalized
communication on a 3 x 3 mesh. At the end of the second phase, node i has messages ({0,i},…,
{8,i}), where 0 ≤ i ≤ 8. The groups of nodes communicating together in each phase are enclosed
in dotted boundaries.

on a Mesh: Cost
• Time for the first phase is identical to that in a ring with √p processors,
i.e., (ts + twmp/2)(√p – 1).
• Time in the second phase is identical to the first phase. Therefore, total
time is twice of this time, i.e.,
• It can be shown that the time for rearrangement is less much less than
this communication time.

on a Hypercube
• Generalize the mesh algorithm to log p steps.
• At any stage in all-to-all personalized communication, every node
holds p packets of size m each.
• While communicating in a particular dimension, every node sends
p/2 of these packets (consolidated as one message).
• A node must rearrange its messages locally before each of the log p
communication steps.

on a Hypercube
An all-to-all personalized communication algorithm on a three-dimensional hypercube.

on a Hypercube: Cost
• We have log p iterations and mp/2 words are communicated in each
iteration. Therefore, the cost is:
• This is not optimal!

on a Hypercube: Optimal Algorithm
• Each node simply performs p – 1 communication steps, exchanging
m words of data with a different node in every step.
• A node must choose its communication partner in each step so that
the hypercube links do not suffer congestion.
• In the jth
communication step, node i exchanges data with node (i
XOR j).
• In this schedule, all paths in every communication step are
congestion-free, and none of the bidirectional links carry more than
one message in the same direction.

Seven steps in all-to-all personalized communication on an eight-node hypercube.

A procedure to perform all-to-all personalized communication on a d-
dimensional hypercube. The message Mi,j initially resides on node i and is
destined for node j.

All-to-All Personalized Communication on a
Hypercube: Cost Analysis of Optimal Algorithm
• There are p – 1 steps and each step involves non-congesting message
transfer of m words.
• We have:
• This is asymptotically optimal in message size.

Dense Matrix Algorithms
Ananth Grama, Anshul Gupta,
George Karypis, and Vipin Kumar
To accompany the text “Introduction to Parallel Computing”,
Addison Wesley, 2003.

Topic Overview
• Matrix-Vector Multiplication
• Matrix-Matrix Multiplication
• Solving a System of Linear Equations

Matix Algorithms: Introduction
• Due to their regular structure, parallel computations involving
matrices and vectors readily lend themselves to data-decomposition.
• Typical algorithms rely on input, output, or intermediate data
decomposition.
• Most algorithms use one- and two-dimensional block, cyclic, and
block-cyclic partitionings.

Matrix-Vector Multiplication
• We aim to multiply a dense n x n matrix A with an n x 1 vector x to
yield the n x 1 result vector y.
• The serial algorithm requires n2
multiplications and additions.

Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• The n x n matrix is partitioned among n processors, with each
processor storing complete row of the matrix.
• The n x 1 vector x is distributed such that each process owns one of
its elements.

Multiplication of an n x n matrix with an n x 1 vector using
rowwise block 1-D partitioning. For the one-row-per-process
case, p = n.

• Since each process starts with only one element of x , an all-to-all
broadcast is required to distribute all the elements to all the
processes.
• Process Pi now computes .
• The all-to-all broadcast and the computation of y[i] both take time
Θ(n) . Therefore, the parallel time is Θ(n) .

• Consider now the case when p < n and we use block 1D partitioning.
• Each process initially stores n=p complete rows of the matrix and a
portion of the vector of size n=p.
• The all-to-all broadcast takes place among p processes and involves
messages of size n=p.
• This is followed by n=p local dot products.
• Thus, the parallel run time of this procedure is
This is cost-optimal.

Scalability Analysis:
• We know that T0 = pTP - W, therefore, we have,
• For isoefficiency, we have W = KT0, where K = E/(1 – E) for desired
efficiency E.
• From this, we have W = O(p2
) (from the tw term).
• There is also a bound on isoefficiency because of concurrency. In this
case, p < n, therefore, W = n2
= Ω(p2
).
• Overall isoefficiency is W = O(p2
).

2-D Partitioning
• The n x n matrix is partitioned among n2
processors such that each
processor owns a single element.
• The n x 1 vector x is distributed only in the last column of n
processors.

Matrix-Vector Multiplication: 2-D Partitioning
Matrix-vector multiplication with block 2-D partitioning. For the
one-element-per-process case, p = n2
if the matrix size is n x n .

2-D Partitioning
• We must first align the vector with the matrix appropriately.
• The first communication step for the 2-D partitioning aligns the
vector x along the principal diagonal of the matrix.
• The second step copies the vector elements from each diagonal
process to all the processes in the corresponding column using n
simultaneous broadcasts among all processors in the column.
• Finally, the result vector is computed by performing an all-to-one
reduction along the columns.

2-D Partitioning
• Three basic communication operations are used in this algorithm:
one-to-one communication to align the vector along the main
diagonal, one-to-all broadcast of each vector element among the n
processes of each column, and all-to-one reduction in each row.
• Each of these operations takes Θ(log n) time and the parallel time is
Θ(log n) .
• The cost (process-time product) is Θ(n2
log n) ; hence, the algorithm
is not cost-optimal.

2-D Partitioning
• When using fewer than n2
processors, each process owns an
block of the matrix.
• The vector is distributed in portions of elements in the last
process-column only.
• In this case, the message sizes for the alignment, broadcast, and
reduction are all .
• The computation is a product of an submatrix with a
vector of length .

2-D Partitioning
• The first alignment step takes time
• The broadcast and reductions take time
• Local matrix-vector products take time
• Total time is

2-D Partitioning
• Scalability Analysis:
•
• Equating T0 with W, term by term, for isoefficiency, we have,
as the dominant term.
• The isoefficiency due to concurrency is O(p).
• The overall isoefficiency is (due to the network
bandwidth).
• For cost optimality, we have, . For this, we have,

Matrix-Matrix Multiplication
• Consider the problem of multiplying two n x n dense, square matrices A
and B to yield the product matrix C =A x B.
• The serial complexity is O(n3
).
• We do not consider better serial algorithms (Strassen's method),
although, these can be used as serial kernels in the parallel algorithms.
• A useful concept in this case is called block operations. In this view, an n
x n matrix A can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q)
such that each block is an (n/q) x (n/q) submatrix.
• In this view, we perform q3
matrix multiplications, each involving (n/q) x
(n/q) matrices.

• Consider two n x n matrices A and B partitioned into p blocks Ai,j
and Bi,j (0 ≤ i, j < ) of size each.
• Process Pi,jinitially stores Ai,j and Bi,j and computes block Ci,j of the
result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k <
.
• All-to-all broadcast blocks of A along rows and B along columns.
• Perform local submatrix multiplication.

• The two broadcasts take time
• The computation requires multiplications of
sized submatrices.
• The parallel run time is approximately
• The algorithm is cost optimal and the isoefficiency is O(p1.5
) due to
bandwidth term tw and concurrency.
• Major drawback of the algorithm is that it is not memory optimal.

Matrix-Matrix Multiplication:
Cannon's Algorithm
• In this algorithm, we schedule the computations of the
processes of the ith row such that, at any given time, each process is
using a different block Ai,k.
• These blocks can be systematically rotated among the processes
after every submatrix multiplication so that every process gets a
fresh Ai,k after each rotation.

Cannon's Algorithm
Communication steps in Cannon's algorithm on 16 processes.

Cannon's Algorithm
• Align the blocks of A and B in such a way that each process multiplies
its local submatrices. This is done by shifting all submatrices Ai,j to the
left (with wraparound) by i steps and all submatrices Bi,j up (with
wraparound) by j steps.
• Perform local block multiplication.
• Each block of A moves one step left and each block of B moves one
step up (again with wraparound).
• Perform next block multiplication, add to partial result, repeat until
all blocks have been multiplied.

Cannon's Algorithm
• In the alignment step, since the maximum distance over which a
block shifts is , the two shift operations require a total of
time.
• Each of the single-step shifts in the compute-and-shift phase of
the algorithm takes time.
• The computation time for multiplying matrices of size
is .
• The parallel time is approximately:
• The cost-efficiency and isoefficiency of the algorithm are identical to
the first algorithm, except, this is memory optimal.

DNS Algorithm
• Uses a 3-D partitioning.
• Visualize the matrix multiplication algorithm as a cube . matrices A
and B come in two orthogonal faces and result C comes out the
other orthogonal face.
• Each internal node in the cube represents a single add-multiply
operation (and thus the complexity).
• DNS algorithm partitions this cube using a 3-D block scheme.

DNS Algorithm
• Assume an n x n x n mesh of processors.
• Move the columns of A and rows of B and perform broadcast.
• Each processor computes a single add-multiply.
• This is followed by an accumulation along the C dimension.
• Since each add-multiply takes constant time and accumulation and
broadcast takes log n time, the total runtime is log n.
• This is not cost optimal. It can be made cost optimal by using n / log n
processors along the direction of accumulation.

DNS Algorithm
The communication steps in the DNS algorithm while
multiplying 4 x 4 matrices A and B on 64 processes.

DNS Algorithm
Using fewer than n3
processors.
• Assume that the number of processes p is equal to q3
for some q < n.
• The two matrices are partitioned into blocks of size (n/q) x(n/q).
• Each matrix can thus be regarded as a q x q two-dimensional square
array of blocks.
• The algorithm follows from the previous one, except, in this case, we
operate on blocks rather than on individual elements.

DNS Algorithm
Using fewer than n3
processors.
• The first one-to-one communication step is performed for both A
and B, and takes time for each matrix.
• The two one-to-all broadcasts take time for each
matrix.
• The reduction takes time .
• Multiplication of submatrices takes time.
• The parallel time is approximated by:
• The isoefficiency function is .

Solving a System of Linear Equations
• Consider the problem of solving linear equations of the kind:
• This is written as Ax = b, where A is an n x n matrix with A[i, j] = ai,j,
b is an n x 1 vector [ b0, b1, … , bn ]T
, and x is the solution.

Solving a System of Linear Equations
Two steps in solution are: reduction to triangular form, and back-
substitution. The triangular form is as:
We write this as: Ux = y .
A commonly used method for transforming a given matrix into an
upper-triangular matrix is Gaussian Elimination.

Gaussian Elimination
Serial Gaussian Elimination

• The computation has three nested loops - in the kth iteration of the
outer loop, the algorithm performs (n-k)2
computations. Summing from
k = 1..n, we have roughly (n3
/3) multiplications-subtractions.
A typical computation in Gaussian elimination.

Parallel Gaussian Elimination
• Assume p = n with each row assigned to a processor.
• The first step of the algorithm normalizes the row. This is a serial
operation and takes time (n-k) in the kth iteration.
• In the second step, the normalized row is broadcast to all the
processors. This takes time .
• Each processor can independently eliminate this row from its own. This
requires (n-k-1) multiplications and subtractions.
• The total parallel time can be computed by summing from k = 1 … n-1
as
• The formulation is not cost optimal because of the tw term.

Parallel Gaussian Elimination
Gaussian elimination steps during the iteration corresponding k =
3 for an 8 x 8 matrix partitioned rowwise among eight processes.

Parallel Gaussian Elimination:
Pipelined Execution
• In the previous formulation, the (k+1)st iteration starts only after all
the computation and communication for the kth iteration is
complete.
• In the pipelined version, there are three steps - normalization of a
row, communication, and elimination. These steps are performed in
an asynchronous fashion.
• A processor Pk waits to receive and eliminate all rows prior to k.
• Once it has done this, it forwards its own row to processor Pk+1.

Pipelined Execution
Pipelined Gaussian elimination on a 5 x 5 matrix partitioned
withone row per process.

Pipelined Execution
• The total number of steps in the entire pipelined procedure is Θ(n).
• In any step, either O(n) elements are communicated between
directly-connected processes, or a division step is performed on O(n)
elements of a row, or an elimination step is performed on O(n)
elements of a row.
• The parallel time is therefore O(n2
) .
• This is cost optimal.

Pipelined Execution
The communication in the Gaussian elimination iteration
corresponding to k = 3 for an 8 x 8 matrix distributed among
four processes using block 1-D partitioning.

Block 1D with p < n
• The above algorithm can be easily adapted to the case when p < n.
• In the kth iteration, a processor with all rows belonging to the active part
of the matrix performs (n – k -1) / np multiplications and subtractions.
• In the pipelined version, for n > p, computation dominates
communication.
• The parallel time is given by:
or approximately, n3
/p.
• While the algorithm is cost optimal, the cost of the parallel algorithm is
higher than the sequential run time by a factor of 3/2.

Block 1D with p < n
Computation load on different processes in block and cyclic
1-D partitioning of an 8 x 8 matrix on four processes during the
Gaussian elimination iteration corresponding to k = 3.

Block 1D with p < n
• The load imbalance problem can be alleviated by using a cyclic
mapping.
• In this case, other than processing of the last p rows, there is no load
imbalance.
• This corresponds to a cumulative load imbalance overhead of O(n2
p)
(instead of O(n3
) in the previous case).

with Partial Pivoting
• For numerical stability, one generally uses partial pivoting.
• In the k th iteration, we select a column i (called the pivot column)
such that A[k, i] is the largest in magnitude among all A[k, i] such
that k ≤ j < n.
• The k th and the i th columns are interchanged.
• Simple to implement with row-partitioning and does not add
overhead since the division step takes the same time as computing
the max.
• Column-partitioning, however, requires a global reduction, adding a
log p term to the overhead.
• Pivoting precludes the use of pipelining.

Gaussian Elimination with Partial
Pivoting: 2-D Partitioning
• Partial pivoting restricts use of pipelining, resulting in performance
loss.
• This loss can be alleviated by restricting pivoting to specific columns.
• Alternately, we can use faster algorithms for broadcast.

Solving a Triangular System:
Back-Substitution
• The upper triangular matrix U undergoes back-substitution to
determine the vector x.
A serial algorithm for back-substitution.

Back-Substitution
• The algorithm performs approximately n2
/2 multiplications and
subtractions.
• Since complexity of this part is asymptotically lower, we should optimize
the data distribution for the factorization part.
• Consider a rowwise block 1-D mapping of the n x n matrix U with vector
y distributed uniformly.
• The value of the variable solved at a step can be pipelined back.
• Each step of a pipelined implementation requires a constant amount of
time for communication and Θ(n/p) time for computation.
• The parallel run time of the entire algorithm is Θ(n2
/p).

Back-Substitution
• If the matrix is partitioned by using 2-D partitioning on a logical
mesh of processes, and the elements of the vector are
distributed along one of the columns of the process mesh, then only
the processes containing the vector perform any computation.
• Using pipelining to communicate the appropriate elements of U to
the process containing the corresponding elements of y for the
substitution step (line 7), the algorithm can be executed in
time.
• While this is not cost optimal, since this does not dominate the
overall computation, the cost optimality is determined by the
factorization.

Sorting Algorithms
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
To accompany the text ``Introduction to Parallel Computing'',
Addison Wesley, 2003.

Topic Overview
• Issues in Sorting on Parallel Computers
• Sorting Networks
• Bubble Sort and its Variants
• Quicksort
• Bucket and Sample Sort
• Other Sorting Algorithms

Sorting: Overview
• One of the most commonly used and well-studied kernels.
• Sorting can be comparison-based or noncomparison-based.
• The fundamental operation of comparison-based sorting is compare-
exchange.
• The lower bound on any comparison-based sort of n numbers is Θ(nlog
n) .
• We focus here on comparison-based sorting algorithms.

Sorting: Basics
What is a parallel sorted sequence? Where are the input and output lists
stored?
• We assume that the input and output lists are distributed.
• The sorted list is partitioned with the property that each partitioned list is
sorted and each element in processor Pi's list is less than that in Pj's list if i <
j.

Sorting: Parallel Compare Exchange Operation
A parallel compare-exchange operation. Processes Pi and Pj send their
elements to each other. Process Pi keeps min{ai,aj}, and Pj keeps max{ai,
aj}.

Sorting: BasicsWhat is the parallel counterpart to a sequential comparator?
• If each processor has one element, the compare exchange operation stores
the smaller element at the processor with smaller id. This can be done in ts
+ tw time.
• If we have more than one element per processor, we call this operation a
compare split. Assume each of two processors have n/p elements.
• After the compare-split operation, the smaller n/p elements are at
processor Pi and the larger n/p elements at Pj, where i < j.
• The time for a compare-split operation is (ts+ twn/p), assuming that the
two partial lists were initially sorted.

Sorting: Parallel Compare Split Operation
A compare-split operation. Each process sends its block of size n/p to
the other process. Each process merges the received block with its
own block and retains only the appropriate half of the merged block.
In this example, process Pi retains the smaller elements and process Pi
retains the larger elements.

Sorting Networks
• Networks of comparators designed specifically for sorting.
• A comparator is a device with two inputs x and y and two outputs x'
and y'. For an increasing comparator, x' = min{x,y} and y' =
min{x,y}; and vice-versa.
• We denote an increasing comparator by ⊕ and a decreasing
comparator by Ө.
• The speed of the network is proportional to its depth.

Sorting Networks: Comparators
A schematic representation of comparators: (a) an increasing comparator,
and (b) a decreasing comparator.

Sorting Networks
A typical sorting network. Every sorting network is made up of a
series of columns, and each column contains a number of
comparators connected in parallel.

Sorting Networks: Bitonic Sort
• A bitonic sorting network sorts n elements in Θ(log2
n) time.
• A bitonic sequence has two tones - increasing and decreasing, or vice versa.
Any cyclic rotation of such networks is also considered bitonic.
• 〈1,2,4,7,6,0〉 is a bitonic sequence, because it first increases and then
decreases. 〈8,9,2,1,0,4〉 is another bitonic sequence, because it is a cyclic
shift of 〈0,4,8,9,2,1〉.
• The kernel of the network is the rearrangement of a bitonic sequence into a
sorted sequence.

Sorting Networks: Bitonic Sort• Let s = 〈a0,a1,…,an-1〉 be a bitonic sequence such that a0 ≤ a1 ≤ ··· ≤ an/2-1
and an/2 ≥ an/2+1 ≥ ··· ≥ an-1.
• Consider the following subsequences of s:
s1 = 〈min{a0,an/2},min{a1,an/2+1},…,min{an/2-1,an-1}〉
s2 = 〈max{a0,an/2},max{a1,an/2+1},…,max{an/2-1,an-1}〉
(1)
• Note that s1 and s2 are both bitonic and each element of s1 is less than
every element in s2.
• We can apply the procedure recursively on s1 and s2 to get the sorted
sequence.

Merging a 16-element bitonic sequence through a series of log 16
bitonic splits.

Sorting Networks: Bitonic Sort• We can easily build a sorting network to implement this bitonic merge
algorithm.
• Such a network is called a bitonic merging network.
• The network contains log n columns. Each column contains n/2
comparators and performs one step of the bitonic merge.
• We denote a bitonic merging network with n inputs by ⊕BM[n].
• Replacing the ⊕ comparators by Ө comparators results in a decreasing
output sequence; such a network is denoted by ӨBM[n].

A bitonic merging network for n = 16. The input wires are numbered 0,1,…, n
- 1, and the binary representation of these numbers is shown. Each column of
comparators is drawn separately; the entire figure represents a ⊕BM[16]
bitonic merging network. The network takes a bitonic sequence and outputs it
in sorted order.

Sorting Networks: Bitonic SortHow do we sort an unsorted sequence using a bitonic merge?
• We must first build a single bitonic sequence from the given sequence.
• A sequence of length 2 is a bitonic sequence.
• A bitonic sequence of length 4 can be built by sorting the first two elements
using ⊕BM[2] and next two, using ӨBM[2].
• This process can be repeated to generate larger bitonic sequences.

A schematic representation of a network that converts an input
sequence into a bitonic sequence. In this example, ⊕BM[k] and
ӨBM[k] denote bitonic merging networks of input size k that use ⊕
and Ө comparators, respectively. The last merging network
(⊕BM[16]) sorts the input. In this example, n = 16.

The comparator network that transforms an input sequence of 16
unordered numbers into a bitonic sequence.

Sorting Networks: Bitonic Sort• The depth of the network is Θ(log2
n).
• Each stage of the network contains n/2 comparators. A serial
implementation of the network would have complexity Θ(nlog2
n).

Mapping Bitonic Sort to Hypercubes• Consider the case of one item per processor. The question becomes one of
how the wires in the bitonic network should be mapped to the hypercube
interconnect.
• Note from our earlier examples that the compare-exchange operation is
performed between two wires only if their labels differ in exactly one bit!
• This implies a direct mapping of wires to processors. All communication is
nearest neighbor!

Mapping Bitonic Sort to Hypercubes
Communication during the last stage of bitonic sort. Each wire is mapped
to a hypercube process; each connection represents a compare-
exchange between processes.

Communication characteristics of bitonic sort on a hypercube. During
each stage of the algorithm, processes communicate along the
dimensions shown.

Parallel formulation of bitonic sort on a hypercube with n = 2d
processes.

• During each step of the algorithm, every process performs a
compare-exchange operation (single nearest neighbor
communication of one word).
• Since each step takes Θ(1) time, the parallel time is
Tp = Θ(log2
n) (2)
• This algorithm is cost optimal w.r.t. its serial counterpart, but not
w.r.t. the best sorting algorithm.

Mapping Bitonic Sort to Meshes
• The connectivity of a mesh is lower than that of a hypercube, so we
must expect some overhead in this mapping.
• Consider the row-major shuffled mapping of wires to processors.

Different ways of mapping the input wires of the bitonic sorting network
to a mesh of processes: (a) row-major mapping, (b) row-major snakelike
mapping, and (c) row-major shuffled mapping.

The last stage of the bitonic sort algorithm for n = 16 on a mesh, using
the row-major shuffled mapping. During each step, process pairs
compare-exchange their elements. Arrows indicate the pairs of
processes that perform compare-exchange operations.

Mapping Bitonic Sort to Meshes• In the row-major shuffled mapping, wires that differ at the ith
least-
significant bit are mapped onto mesh processes that are 2(i-1)/2
communication links away.
• The total amount of communication performed by each process is
. The total computation performed by each process is
Θ(log2
n).
• The parallel runtime is:
• This is not cost optimal.
 
)(or,72
log
1 1
2/)1(
nn
n
i
i
j
j
Θ≈∑ ∑= =
−

Block of Elements Per Processor
• Each process is assigned a block of n/p elements.
• The first step is a local sort of the local block.
• Each subsequent compare-exchange operation is replaced by a
compare-split operation.
• We can effectively view the bitonic network as having (1 + log p)
(log p)/2 steps.

Block of Elements Per Processor: Hypercube
• Initially the processes sort their n/p elements (using merge sort) in time
Θ((n/p)log(n/p)) and then perform Θ(log2
p) compare-split steps.
• The parallel run time of this formulation is
• Comparing to an optimal sort, the algorithm can efficiently use up to
processes.
• The isoefficiency function due to both communication and extra work is
Θ(plog p
log2
p) .
)2( logn
p Θ=

Block of Elements Per Processor: Mesh
• The parallel runtime in this case is given by:
• This formulation can efficiently use up to p = Θ(log2
n) processes.
• The isoefficiency function is

Performance of Parallel Bitonic Sort
The performance of parallel formulations of bitonic sort for n elements
on p processes.

Bubble Sort and its VariantsThe sequential bubble sort algorithm compares and exchanges
adjacent elements in the sequence to be sorted:
Sequential bubble sort algorithm.

Bubble Sort and its Variants
• The complexity of bubble sort is Θ(n2
).
• Bubble sort is difficult to parallelize since the algorithm has no
concurrency.
• A simple variant, though, uncovers the concurrency.

Odd-Even Transposition
Sequential odd-even transposition sort algorithm.

Sorting n = 8 elements, using the odd-even transposition sort
algorithm. During each phase, n = 8 elements are compared.

• After n phases of odd-even exchanges, the sequence is sorted.
• Each phase of the algorithm (either odd or even) requires Θ(n)
comparisons.
• Serial complexity is Θ(n2
).

Parallel Odd-Even Transposition
• Consider the one item per processor case.
• There are n iterations, in each iteration, each processor does one
compare-exchange.
• The parallel run time of this formulation is Θ(n).
• This is cost optimal with respect to the base serial algorithm but not
the optimal one.

Parallel formulation of odd-even transposition.

• Consider a block of n/p elements per processor.
• The first step is a local sort.
• In each subsequent step, the compare exchange operation is
replaced by the compare split operation.
• The parallel run time of the formulation is

• The parallel formulation is cost-optimal for p = O(log n).
• The isoefficiency function of this parallel formulation is Θ(p2p
).

Shellsort
• Let n be the number of elements to be sorted and p be the number
of processes.
• During the first phase, processes that are far away from each other in
the array compare-split their elements.
• During the second phase, the algorithm switches to an odd-even
transposition sort.

Parallel Shellsort
• Initially, each process sorts its block of n/p elements internally.
• Each process is now paired with its corresponding process in the reverse
order of the array. That is, process Pi, where i < p/2, is paired with
process Pp-i-1.
• A compare-split operation is performed.
• The processes are split into two groups of size p/2 each and the
process repeated in each group.

Parallel Shellsort
An example of the first phase of parallel shellsort on an eight-process
array.

Parallel Shellsort• Each process performs d = log p compare-split operations.
• With O(p) bisection width, each communication can be performed in time
Θ(n/p) for a total time of Θ((nlog p)/p).
• In the second phase, l odd and even phases are performed, each requiring
time Θ(n/p).
• The parallel run time of the algorithm is:

Quicksort
• Quicksort is one of the most common sorting algorithms for
sequential computers because of its simplicity, low overhead, and
optimal average complexity.
• Quicksort selects one of the entries in the sequence to be the pivot
and divides the sequence into two - one with all elements less than
the pivot and other greater.
• The process is recursively applied to each of the sublists.

Quicksort
The sequential quicksort algorithm.

Quicksort
Example of the quicksort algorithm sorting a sequence of size n = 8.

Quicksort
• The performance of quicksort depends critically on the quality of the
pivot.
• In the best case, the pivot divides the list in such a way that the
larger of the two lists does not have more than αn elements (for
some constant α).
• In this case, the complexity of quicksort is O(nlog n).

Parallelizing Quicksort
• Lets start with recursive decomposition - the list is partitioned
serially and each of the subproblems is handled by a different
processor.
• The time for this algorithm is lower-bounded by Ω(n)!
• Can we parallelize the partitioning step - in particular, if we can use n
processors to partition a list of length n around a pivot in O(1) time,
we have a winner.
• This is difficult to do on real machines, though.

Parallelizing Quicksort: PRAM
Formulation• We assume a CRCW (concurrent read, concurrent write) PRAM with
concurrent writes resulting in an arbitrary write succeeding.
• The formulation works by creating pools of processors. Every processor is
assigned to the same pool initially and has one element.
• Each processor attempts to write its element to a common location (for the
pool).
• Each processor tries to read back the location. If the value read back is
greater than the processor's value, it assigns itself to the `left' pool, else, it
assigns itself to the `right' pool.
• Each pool performs this operation recursively.
• Note that the algorithm generates a tree of pivots. The depth of the tree is
the expected parallel runtime. The average value is O(log n).

Parallelizing Quicksort: PRAM
Formulation
A binary tree generated by the execution of the quicksort algorithm. Each
level of the tree represents a different array-partitioning iteration. If
pivot selection is optimal, then the height of the tree is Θ(log n), which
is also the number of iterations.

Parallelizing Quicksort: PRAM Formulation
The execution of the PRAM algorithm on the array shown in (a).

Parallelizing Quicksort: Shared Address Space
Formulation
• Consider a list of size n equally divided across p processors.
• A pivot is selected by one of the processors and made known to all
processors.
• Each processor partitions its list into two, say Li and Ui, based on the
selected pivot.
• All of the Li lists are merged and all of the Ui lists are merged
separately.
• The set of processors is partitioned into two (in proportion of the size
of lists L and U). The process is recursively applied to each of the
lists.

Shared Address Space Formulation

Formulation
• The only thing we have not described is the global reorganization
(merging) of local lists to form L and U.
• The problem is one of determining the right location for each element in
the merged list.
• Each processor computes the number of elements locally less than and
greater than pivot.
• It computes two sum-scans to determine the starting location for its
elements in the merged L and U lists.
• Once it knows the starting locations, it can write its elements safely.

Formulation
Efficient global rearrangement of the array.

Formulation
• The parallel time depends on the split and merge time, and the quality
of the pivot.
• The latter is an issue independent of parallelism, so we focus on the first
aspect, assuming ideal pivot selection.
• The algorithm executes in four steps: (i) determine and broadcast the
pivot; (ii) locally rearrange the array assigned to each process; (iii)
determine the locations in the globally rearranged array that the local
elements will go to; and (iv) perform the global rearrangement.
• The first step takes time Θ(log p), the second, Θ(n/p) , the third,
Θ(log p) , and the fourth, Θ(n/p).
• The overall complexity of splitting an n-element array is Θ(n/p) +
Θ(log p).

Formulation
• The process recurses until there are p lists, at which point, the lists are
sorted locally.
• Therefore, the total parallel time is:
• The corresponding isoefficiency is Θ(plog2
p) due to broadcast and scan
operations.

Parallelizing Quicksort: Message Passing Formulation
• A simple message passing formulation is based on the recursive halving
of the machine.
• Assume that each processor in the lower half of a p processor ensemble
is paired with a corresponding processor in the upper half.
• A designated processor selects and broadcasts the pivot.
• Each processor splits its local list into two lists, one less (Li), and other
greater (Ui) than the pivot.
• A processor in the low half of the machine sends its list Ui to the paired
processor in the other half. The paired processor sends its list Li.
• It is easy to see that after this step, all elements less than the pivot are in
the low half of the machine and all elements greater than the pivot are
in the high half.

Parallelizing Quicksort: Message Passing Formulation
• The above process is recursed until each processor has its own local list,
which is sorted locally.
• The time for a single reorganization is Θ(log p) for broadcasting the pivot
element, Θ(n/p) for splitting the locally assigned portion of the array,
Θ(n/p) for exchange and local reorganization.
• We note that this time is identical to that of the corresponding shared
address space formulation.
• It is important to remember that the reorganization of elements is a
bandwidth sensitive operation.

Graph Algorithms
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003

Topic Overview
• Definitions and Representation
• Minimum Spanning Tree: Prim's Algorithm
• Single-Source Shortest Paths: Dijkstra's Algorithm
• All-Pairs Shortest Paths
• Transitive Closure
• Connected Components
• Algorithms for Sparse Graphs

Definitions and Representation
• An undirected graph G is a pair (V,E), where V is a finite set of points
called vertices and E is a finite set of edges.
• An edge e ∈ E is an unordered pair (u,v), where u,v ∈ V.
• In a directed graph, the edge e is an ordered pair (u,v). An edge (u,v)
is incident from vertex u and is incident to vertex v.
• A path from a vertex v to a vertex u is a sequence <v0,v1,v2,…,vk> of
vertices where v0 = v, vk = u, and (vi, vi+1) ∈ E for I = 0, 1,…, k-1.
• The length of a path is defined as the number of edges in the path.

a) An undirected graph and (b) a directed graph.

• An undirected graph is connected if every pair of vertices is
connected by a path.
• A forest is an acyclic graph, and a tree is a connected acyclic graph.
• A graph that has weights associated with each edge is called a
weighted graph.

• Graphs can be represented by their adjacency matrix or an edge (or
vertex) list.
• Adjacency matrices have a value ai,j = 1 if nodes i and j share an edge;
0 otherwise. In case of a weighted graph, ai,j = wi,j, the weight of the
edge.
• The adjacency list representation of a graph G = (V,E) consists of an
array Adj[1..|V|] of lists. Each list Adj[v] is a list of all vertices
adjacent to v.
• For a grapn with n nodes, adjacency matrices take Θ(n2
) space and
adjacency list takes Θ(|E|) space.

An undirected graph and its adjacency matrix representation.
An undirected graph and its adjacency list representation.

Minimum Spanning Tree
• A spanning tree of an undirected graph G is a subgraph of G that is a
tree containing all the vertices of G.
• In a weighted graph, the weight of a subgraph is the sum of the
weights of the edges in the subgraph.
• A minimum spanning tree (MST) for a weighted undirected graph is a
spanning tree with minimum weight.

Minimum Spanning Tree
An undirected graph and its minimum spanning tree.

Minimum Spanning Tree: Prim's
Algorithm
• Prim's algorithm for finding an MST is a greedy algorithm.
• Start by selecting an arbitrary vertex, include it into the current MST.
• Grow the current MST by inserting into it the vertex closest to one of
the vertices already in current MST.

Minimum Spanning Tree: Prim's Algorithm
Prim's minimum spanning tree algorithm.

Minimum Spanning Tree: Prim's
Algorithm
Prim's sequential minimum spanning tree algorithm.

Prim's Algorithm: Parallel Formulation
• The algorithm works in n outer iterations - it is hard to execute these
iterations concurrently.
• The inner loop is relatively easy to parallelize. Let p be the number of
processes, and let n be the number of vertices.
• The adjacency matrix is partitioned in a 1-D block fashion, with distance
vector d partitioned accordingly.
• In each step, a processor selects the locally closest node, followed by a
global reduction to select globally closest node.
• This node is inserted into MST, and the choice broadcast to all
processors.
• Each processor updates its part of the d vector locally.

The partitioning of the distance array d and the adjacency matrix A among p
processes.

• The cost to select the minimum entry is O(n/p + log p).
• The cost of a broadcast is O(log p).
• The cost of local updation of the d vector is O(n/p).
• The parallel time per iteration is O(n/p + log p).
• The total parallel time is given by O(n2
/p + n log p).
• The corresponding isoefficiency is O(p2
log2
p).

All-Reduce and Prefix-Sum Operations

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie All-Reduce and Prefix-Sum Operations

Ähnlich wie All-Reduce and Prefix-Sum Operations (20)

Mehr von Syed Zaid Irshad

Mehr von Syed Zaid Irshad (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

All-Reduce and Prefix-Sum Operations