SlideShare ist ein Scribd-Unternehmen logo
1 von 148
All-Reduce and Prefix-Sum Operations
• In all-reduce, each node starts with a buffer of size m and the final
results of the operation are identical buffers of size m on each node
that are formed by combining the original p buffers using an
associative operator.
• Identical to all-to-one reduction followed by a one-to-all broadcast.
This formulation is not the most efficient. Uses the pattern of all-to-
all broadcast, instead. The only difference is that message size does
not increase here. Time for this operation is (ts + twm) log p.
• Different from all-to-all reduction, in which p simultaneous all-to-one
reductions take place, each with a different destination for the
result.
The Prefix-Sum Operation
• Given p numbers n0,n1,…,np-1 (one on each node), the problem is to
compute the sums sk = ∑i
k
=0 ni for all k between 0 and p-1 .
• Initially, nk resides on the node labeled k, and at the end of the
procedure, the same node holds Sk.
The Prefix-Sum Operation
Computing prefix sums on an eight-node hypercube. At each node, square brackets
show the local prefix sum accumulated in the result buffer and parentheses enclose
the contents of the outgoing message buffer for the next step.
The Prefix-Sum Operation
• The operation can be implemented using the all-to-all broadcast
kernel.
• We must account for the fact that in prefix sums the node with label
k uses information from only the k-node subset whose labels are less
than or equal to k.
• This is implemented using an additional result buffer. The content of
an incoming message is added to the result buffer only if the
message comes from a node with a smaller label than the recipient
node.
• The contents of the outgoing message (denoted by parentheses in
the figure) are updated with every incoming message.
The Prefix-Sum Operation
Prefix sums on a d-dimensional hypercube.
Scatter and Gather
• In the scatter operation, a single node sends a unique message of
size m to every other node (also called a one-to-all personalized
communication).
• In the gather operation, a single node collects a unique message
from each node.
• While the scatter operation is fundamentally different from
broadcast, the algorithmic structure is similar, except for differences
in message sizes (messages get smaller in scatter and stay constant in
broadcast).
• The gather operation is exactly the inverse of the scatter operation
and can be executed as such.
Gather and Scatter Operations
Scatter and gather operations.
Example of the Scatter Operation
The scatter operation on an eight-node hypercube.
Cost of Scatter and Gather
• There are log p steps, in each step, the machine size halves and the
data size halves.
• We have the time for this operation to be:
• This time holds for a linear array as well as a 2-D mesh.
• These times are asymptotically optimal in message size.
All-to-All Personalized Communication
• Each node has a distinct message of size m for every other node.
• This is unlike all-to-all broadcast, in which each node sends the same
message to all other nodes.
• All-to-all personalized communication is also known as total
exchange.
All-to-All Personalized Communication
All-to-all personalized communication.
All-to-All Personalized Communication:
Example
• Consider the problem of transposing a matrix.
• Each processor contains one full row of the matrix.
• The transpose operation in this case is identical to an all-to-all
personalized communication operation.
All-to-All Personalized Communication:
Example
All-to-all personalized communication in transposing a 4 x 4 matrix using four
processes.
All-to-All Personalized Communication
on a Ring
• Each node sends all pieces of data as one consolidated message of
size m(p – 1) to one of its neighbors.
• Each node extracts the information meant for it from the data
received, and forwards the remaining (p – 2) pieces of size m each to
the next node.
• The algorithm terminates in p – 1 steps.
• The size of the message reduces by m at each step.
All-to-All Personalized Communication
on a Ring
All-to-all personalized communication on a six-node ring. The label of each message is
of the form {x,y}, where x is the label of the node that originally owned the message,
and y is the label of the node that is the final destination of the message. The label
({x1,y1}, {x2,y2},…, {xn,yn}, indicates a message that is formed by concatenating n
individual messages.
All-to-All Personalized Communication
on a Ring: Cost
• We have p – 1 steps in all.
• In step i, the message size is m(p – i).
• The total time is given by:
• The tw term in this equation can be reduced by a factor of 2 by
communicating messages in both directions.
All-to-All Personalized Communication
on a Mesh
• Each node first groups its p messages according to the columns of
their destination nodes.
• All-to-all personalized communication is performed independently in
each row with clustered messages of size m√p.
• Messages in each node are sorted again, this time according to the
rows of their destination nodes.
• All-to-all personalized communication is performed independently in
each column with clustered messages of size m√p.
All-to-All Personalized Communication
on a Mesh
The distribution of messages at the beginning of each phase of all-to-all personalized
communication on a 3 x 3 mesh. At the end of the second phase, node i has messages ({0,i},…,
{8,i}), where 0 ≤ i ≤ 8. The groups of nodes communicating together in each phase are enclosed
in dotted boundaries.
All-to-All Personalized Communication
on a Mesh: Cost
• Time for the first phase is identical to that in a ring with √p processors,
i.e., (ts + twmp/2)(√p – 1).
• Time in the second phase is identical to the first phase. Therefore, total
time is twice of this time, i.e.,
• It can be shown that the time for rearrangement is less much less than
this communication time.
All-to-All Personalized Communication
on a Hypercube
• Generalize the mesh algorithm to log p steps.
• At any stage in all-to-all personalized communication, every node
holds p packets of size m each.
• While communicating in a particular dimension, every node sends
p/2 of these packets (consolidated as one message).
• A node must rearrange its messages locally before each of the log p
communication steps.
All-to-All Personalized Communication
on a Hypercube
An all-to-all personalized communication algorithm on a three-dimensional hypercube.
All-to-All Personalized Communication
on a Hypercube: Cost
• We have log p iterations and mp/2 words are communicated in each
iteration. Therefore, the cost is:
• This is not optimal!
All-to-All Personalized Communication
on a Hypercube: Optimal Algorithm
• Each node simply performs p – 1 communication steps, exchanging
m words of data with a different node in every step.
• A node must choose its communication partner in each step so that
the hypercube links do not suffer congestion.
• In the jth
communication step, node i exchanges data with node (i
XOR j).
• In this schedule, all paths in every communication step are
congestion-free, and none of the bidirectional links carry more than
one message in the same direction.
All-to-All Personalized Communication
on a Hypercube: Optimal Algorithm
Seven steps in all-to-all personalized communication on an eight-node hypercube.
All-to-All Personalized Communication
on a Hypercube: Optimal Algorithm
A procedure to perform all-to-all personalized communication on a d-
dimensional hypercube. The message Mi,j initially resides on node i and is
destined for node j.
All-to-All Personalized Communication on a
Hypercube: Cost Analysis of Optimal Algorithm
• There are p – 1 steps and each step involves non-congesting message
transfer of m words.
• We have:
• This is asymptotically optimal in message size.
Dense Matrix Algorithms
Ananth Grama, Anshul Gupta,
George Karypis, and Vipin Kumar
To accompany the text “Introduction to Parallel Computing”,
Addison Wesley, 2003.
Topic Overview
• Matrix-Vector Multiplication
• Matrix-Matrix Multiplication
• Solving a System of Linear Equations
Matix Algorithms: Introduction
• Due to their regular structure, parallel computations involving
matrices and vectors readily lend themselves to data-decomposition.
• Typical algorithms rely on input, output, or intermediate data
decomposition.
• Most algorithms use one- and two-dimensional block, cyclic, and
block-cyclic partitionings.
Matrix-Vector Multiplication
• We aim to multiply a dense n x n matrix A with an n x 1 vector x to
yield the n x 1 result vector y.
• The serial algorithm requires n2
multiplications and additions.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• The n x n matrix is partitioned among n processors, with each
processor storing complete row of the matrix.
• The n x 1 vector x is distributed such that each process owns one of
its elements.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Multiplication of an n x n matrix with an n x 1 vector using
rowwise block 1-D partitioning. For the one-row-per-process
case, p = n.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• Since each process starts with only one element of x , an all-to-all
broadcast is required to distribute all the elements to all the
processes.
• Process Pi now computes .
• The all-to-all broadcast and the computation of y[i] both take time
Θ(n) . Therefore, the parallel time is Θ(n) .
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• Consider now the case when p < n and we use block 1D partitioning.
• Each process initially stores n=p complete rows of the matrix and a
portion of the vector of size n=p.
• The all-to-all broadcast takes place among p processes and involves
messages of size n=p.
• This is followed by n=p local dot products.
• Thus, the parallel run time of this procedure is
This is cost-optimal.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Scalability Analysis:
• We know that T0 = pTP - W, therefore, we have,
• For isoefficiency, we have W = KT0, where K = E/(1 – E) for desired
efficiency E.
• From this, we have W = O(p2
) (from the tw term).
• There is also a bound on isoefficiency because of concurrency. In this
case, p < n, therefore, W = n2
= Ω(p2
).
• Overall isoefficiency is W = O(p2
).
Matrix-Vector Multiplication:
2-D Partitioning
• The n x n matrix is partitioned among n2
processors such that each
processor owns a single element.
• The n x 1 vector x is distributed only in the last column of n
processors.
Matrix-Vector Multiplication: 2-D Partitioning
Matrix-vector multiplication with block 2-D partitioning. For the
one-element-per-process case, p = n2
if the matrix size is n x n .
Matrix-Vector Multiplication:
2-D Partitioning
• We must first align the vector with the matrix appropriately.
• The first communication step for the 2-D partitioning aligns the
vector x along the principal diagonal of the matrix.
• The second step copies the vector elements from each diagonal
process to all the processes in the corresponding column using n
simultaneous broadcasts among all processors in the column.
• Finally, the result vector is computed by performing an all-to-one
reduction along the columns.
Matrix-Vector Multiplication:
2-D Partitioning
• Three basic communication operations are used in this algorithm:
one-to-one communication to align the vector along the main
diagonal, one-to-all broadcast of each vector element among the n
processes of each column, and all-to-one reduction in each row.
• Each of these operations takes Θ(log n) time and the parallel time is
Θ(log n) .
• The cost (process-time product) is Θ(n2
log n) ; hence, the algorithm
is not cost-optimal.
Matrix-Vector Multiplication:
2-D Partitioning
• When using fewer than n2
processors, each process owns an
block of the matrix.
• The vector is distributed in portions of elements in the last
process-column only.
• In this case, the message sizes for the alignment, broadcast, and
reduction are all .
• The computation is a product of an submatrix with a
vector of length .
Matrix-Vector Multiplication:
2-D Partitioning
• The first alignment step takes time
• The broadcast and reductions take time
• Local matrix-vector products take time
• Total time is
Matrix-Vector Multiplication:
2-D Partitioning
• Scalability Analysis:
•
• Equating T0 with W, term by term, for isoefficiency, we have,
as the dominant term.
• The isoefficiency due to concurrency is O(p).
• The overall isoefficiency is (due to the network
bandwidth).
• For cost optimality, we have, . For this, we have,
Matrix-Matrix Multiplication
• Consider the problem of multiplying two n x n dense, square matrices A
and B to yield the product matrix C =A x B.
• The serial complexity is O(n3
).
• We do not consider better serial algorithms (Strassen's method),
although, these can be used as serial kernels in the parallel algorithms.
• A useful concept in this case is called block operations. In this view, an n
x n matrix A can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q)
such that each block is an (n/q) x (n/q) submatrix.
• In this view, we perform q3
matrix multiplications, each involving (n/q) x
(n/q) matrices.
Matrix-Matrix Multiplication
• Consider two n x n matrices A and B partitioned into p blocks Ai,j
and Bi,j (0 ≤ i, j < ) of size each.
• Process Pi,jinitially stores Ai,j and Bi,j and computes block Ci,j of the
result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k <
.
• All-to-all broadcast blocks of A along rows and B along columns.
• Perform local submatrix multiplication.
Matrix-Matrix Multiplication
• The two broadcasts take time
• The computation requires multiplications of
sized submatrices.
• The parallel run time is approximately
• The algorithm is cost optimal and the isoefficiency is O(p1.5
) due to
bandwidth term tw and concurrency.
• Major drawback of the algorithm is that it is not memory optimal.
Matrix-Matrix Multiplication:
Cannon's Algorithm
• In this algorithm, we schedule the computations of the
processes of the ith row such that, at any given time, each process is
using a different block Ai,k.
• These blocks can be systematically rotated among the processes
after every submatrix multiplication so that every process gets a
fresh Ai,k after each rotation.
Matrix-Matrix Multiplication:
Cannon's Algorithm
Communication steps in Cannon's algorithm on 16 processes.
Matrix-Matrix Multiplication:
Cannon's Algorithm
• Align the blocks of A and B in such a way that each process multiplies
its local submatrices. This is done by shifting all submatrices Ai,j to the
left (with wraparound) by i steps and all submatrices Bi,j up (with
wraparound) by j steps.
• Perform local block multiplication.
• Each block of A moves one step left and each block of B moves one
step up (again with wraparound).
• Perform next block multiplication, add to partial result, repeat until
all blocks have been multiplied.
Matrix-Matrix Multiplication:
Cannon's Algorithm
• In the alignment step, since the maximum distance over which a
block shifts is , the two shift operations require a total of
time.
• Each of the single-step shifts in the compute-and-shift phase of
the algorithm takes time.
• The computation time for multiplying matrices of size
is .
• The parallel time is approximately:
• The cost-efficiency and isoefficiency of the algorithm are identical to
the first algorithm, except, this is memory optimal.
Matrix-Matrix Multiplication:
DNS Algorithm
• Uses a 3-D partitioning.
• Visualize the matrix multiplication algorithm as a cube . matrices A
and B come in two orthogonal faces and result C comes out the
other orthogonal face.
• Each internal node in the cube represents a single add-multiply
operation (and thus the complexity).
• DNS algorithm partitions this cube using a 3-D block scheme.
Matrix-Matrix Multiplication:
DNS Algorithm
• Assume an n x n x n mesh of processors.
• Move the columns of A and rows of B and perform broadcast.
• Each processor computes a single add-multiply.
• This is followed by an accumulation along the C dimension.
• Since each add-multiply takes constant time and accumulation and
broadcast takes log n time, the total runtime is log n.
• This is not cost optimal. It can be made cost optimal by using n / log n
processors along the direction of accumulation.
Matrix-Matrix Multiplication:
DNS Algorithm
The communication steps in the DNS algorithm while
multiplying 4 x 4 matrices A and B on 64 processes.
Matrix-Matrix Multiplication:
DNS Algorithm
Using fewer than n3
processors.
• Assume that the number of processes p is equal to q3
for some q < n.
• The two matrices are partitioned into blocks of size (n/q) x(n/q).
• Each matrix can thus be regarded as a q x q two-dimensional square
array of blocks.
• The algorithm follows from the previous one, except, in this case, we
operate on blocks rather than on individual elements.
Matrix-Matrix Multiplication:
DNS Algorithm
Using fewer than n3
processors.
• The first one-to-one communication step is performed for both A
and B, and takes time for each matrix.
• The two one-to-all broadcasts take time for each
matrix.
• The reduction takes time .
• Multiplication of submatrices takes time.
• The parallel time is approximated by:
• The isoefficiency function is .
Solving a System of Linear Equations
• Consider the problem of solving linear equations of the kind:
• This is written as Ax = b, where A is an n x n matrix with A[i, j] = ai,j,
b is an n x 1 vector [ b0, b1, … , bn ]T
, and x is the solution.
Solving a System of Linear Equations
Two steps in solution are: reduction to triangular form, and back-
substitution. The triangular form is as:
We write this as: Ux = y .
A commonly used method for transforming a given matrix into an
upper-triangular matrix is Gaussian Elimination.
Gaussian Elimination
Serial Gaussian Elimination
Gaussian Elimination
• The computation has three nested loops - in the kth iteration of the
outer loop, the algorithm performs (n-k)2
computations. Summing from
k = 1..n, we have roughly (n3
/3) multiplications-subtractions.
A typical computation in Gaussian elimination.
Parallel Gaussian Elimination
• Assume p = n with each row assigned to a processor.
• The first step of the algorithm normalizes the row. This is a serial
operation and takes time (n-k) in the kth iteration.
• In the second step, the normalized row is broadcast to all the
processors. This takes time .
• Each processor can independently eliminate this row from its own. This
requires (n-k-1) multiplications and subtractions.
• The total parallel time can be computed by summing from k = 1 … n-1
as
• The formulation is not cost optimal because of the tw term.
Parallel Gaussian Elimination
Gaussian elimination steps during the iteration corresponding k =
3 for an 8 x 8 matrix partitioned rowwise among eight processes.
Parallel Gaussian Elimination:
Pipelined Execution
• In the previous formulation, the (k+1)st iteration starts only after all
the computation and communication for the kth iteration is
complete.
• In the pipelined version, there are three steps - normalization of a
row, communication, and elimination. These steps are performed in
an asynchronous fashion.
• A processor Pk waits to receive and eliminate all rows prior to k.
• Once it has done this, it forwards its own row to processor Pk+1.
Parallel Gaussian Elimination:
Pipelined Execution
Pipelined Gaussian elimination on a 5 x 5 matrix partitioned
withone row per process.
Parallel Gaussian Elimination:
Pipelined Execution
• The total number of steps in the entire pipelined procedure is Θ(n).
• In any step, either O(n) elements are communicated between
directly-connected processes, or a division step is performed on O(n)
elements of a row, or an elimination step is performed on O(n)
elements of a row.
• The parallel time is therefore O(n2
) .
• This is cost optimal.
Parallel Gaussian Elimination:
Pipelined Execution
The communication in the Gaussian elimination iteration
corresponding to k = 3 for an 8 x 8 matrix distributed among
four processes using block 1-D partitioning.
Parallel Gaussian Elimination:
Block 1D with p < n
• The above algorithm can be easily adapted to the case when p < n.
• In the kth iteration, a processor with all rows belonging to the active part
of the matrix performs (n – k -1) / np multiplications and subtractions.
• In the pipelined version, for n > p, computation dominates
communication.
• The parallel time is given by:
or approximately, n3
/p.
• While the algorithm is cost optimal, the cost of the parallel algorithm is
higher than the sequential run time by a factor of 3/2.
Parallel Gaussian Elimination:
Block 1D with p < n
Computation load on different processes in block and cyclic
1-D partitioning of an 8 x 8 matrix on four processes during the
Gaussian elimination iteration corresponding to k = 3.
Parallel Gaussian Elimination:
Block 1D with p < n
• The load imbalance problem can be alleviated by using a cyclic
mapping.
• In this case, other than processing of the last p rows, there is no load
imbalance.
• This corresponds to a cumulative load imbalance overhead of O(n2
p)
(instead of O(n3
) in the previous case).
Gaussian Elimination
with Partial Pivoting
• For numerical stability, one generally uses partial pivoting.
• In the k th iteration, we select a column i (called the pivot column)
such that A[k, i] is the largest in magnitude among all A[k, i] such
that k ≤ j < n.
• The k th and the i th columns are interchanged.
• Simple to implement with row-partitioning and does not add
overhead since the division step takes the same time as computing
the max.
• Column-partitioning, however, requires a global reduction, adding a
log p term to the overhead.
• Pivoting precludes the use of pipelining.
Gaussian Elimination with Partial
Pivoting: 2-D Partitioning
• Partial pivoting restricts use of pipelining, resulting in performance
loss.
• This loss can be alleviated by restricting pivoting to specific columns.
• Alternately, we can use faster algorithms for broadcast.
Solving a Triangular System:
Back-Substitution
• The upper triangular matrix U undergoes back-substitution to
determine the vector x.
A serial algorithm for back-substitution.
Solving a Triangular System:
Back-Substitution
• The algorithm performs approximately n2
/2 multiplications and
subtractions.
• Since complexity of this part is asymptotically lower, we should optimize
the data distribution for the factorization part.
• Consider a rowwise block 1-D mapping of the n x n matrix U with vector
y distributed uniformly.
• The value of the variable solved at a step can be pipelined back.
• Each step of a pipelined implementation requires a constant amount of
time for communication and Θ(n/p) time for computation.
• The parallel run time of the entire algorithm is Θ(n2
/p).
Solving a Triangular System:
Back-Substitution
• If the matrix is partitioned by using 2-D partitioning on a logical
mesh of processes, and the elements of the vector are
distributed along one of the columns of the process mesh, then only
the processes containing the vector perform any computation.
• Using pipelining to communicate the appropriate elements of U to
the process containing the corresponding elements of y for the
substitution step (line 7), the algorithm can be executed in
time.
• While this is not cost optimal, since this does not dominate the
overall computation, the cost optimality is determined by the
factorization.
Sorting Algorithms
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
To accompany the text ``Introduction to Parallel Computing'',
Addison Wesley, 2003.
Topic Overview
• Issues in Sorting on Parallel Computers
• Sorting Networks
• Bubble Sort and its Variants
• Quicksort
• Bucket and Sample Sort
• Other Sorting Algorithms
Sorting: Overview
• One of the most commonly used and well-studied kernels.
• Sorting can be comparison-based or noncomparison-based.
• The fundamental operation of comparison-based sorting is compare-
exchange.
• The lower bound on any comparison-based sort of n numbers is Θ(nlog
n) .
• We focus here on comparison-based sorting algorithms.
Sorting: Basics
What is a parallel sorted sequence? Where are the input and output lists
stored?
• We assume that the input and output lists are distributed.
• The sorted list is partitioned with the property that each partitioned list is
sorted and each element in processor Pi's list is less than that in Pj's list if i <
j.
Sorting: Parallel Compare Exchange Operation
A parallel compare-exchange operation. Processes Pi and Pj send their
elements to each other. Process Pi keeps min{ai,aj}, and Pj keeps max{ai,
aj}.
Sorting: BasicsWhat is the parallel counterpart to a sequential comparator?
• If each processor has one element, the compare exchange operation stores
the smaller element at the processor with smaller id. This can be done in ts
+ tw time.
• If we have more than one element per processor, we call this operation a
compare split. Assume each of two processors have n/p elements.
• After the compare-split operation, the smaller n/p elements are at
processor Pi and the larger n/p elements at Pj, where i < j.
• The time for a compare-split operation is (ts+ twn/p), assuming that the
two partial lists were initially sorted.
Sorting: Parallel Compare Split Operation
A compare-split operation. Each process sends its block of size n/p to
the other process. Each process merges the received block with its
own block and retains only the appropriate half of the merged block.
In this example, process Pi retains the smaller elements and process Pi
retains the larger elements.
Sorting Networks
• Networks of comparators designed specifically for sorting.
• A comparator is a device with two inputs x and y and two outputs x'
and y'. For an increasing comparator, x' = min{x,y} and y' =
min{x,y}; and vice-versa.
• We denote an increasing comparator by ⊕ and a decreasing
comparator by Ө.
• The speed of the network is proportional to its depth.
Sorting Networks: Comparators
A schematic representation of comparators: (a) an increasing comparator,
and (b) a decreasing comparator.
Sorting Networks
A typical sorting network. Every sorting network is made up of a
series of columns, and each column contains a number of
comparators connected in parallel.
Sorting Networks: Bitonic Sort
• A bitonic sorting network sorts n elements in Θ(log2
n) time.
• A bitonic sequence has two tones - increasing and decreasing, or vice versa.
Any cyclic rotation of such networks is also considered bitonic.
• 〈1,2,4,7,6,0〉 is a bitonic sequence, because it first increases and then
decreases. 〈8,9,2,1,0,4〉 is another bitonic sequence, because it is a cyclic
shift of 〈0,4,8,9,2,1〉.
• The kernel of the network is the rearrangement of a bitonic sequence into a
sorted sequence.
Sorting Networks: Bitonic Sort• Let s = 〈a0,a1,…,an-1〉 be a bitonic sequence such that a0 ≤ a1 ≤ ··· ≤ an/2-1
and an/2 ≥ an/2+1 ≥ ··· ≥ an-1.
• Consider the following subsequences of s:
s1 = 〈min{a0,an/2},min{a1,an/2+1},…,min{an/2-1,an-1}〉
s2 = 〈max{a0,an/2},max{a1,an/2+1},…,max{an/2-1,an-1}〉
(1)
• Note that s1 and s2 are both bitonic and each element of s1 is less than
every element in s2.
• We can apply the procedure recursively on s1 and s2 to get the sorted
sequence.
Sorting Networks: Bitonic Sort
Merging a 16-element bitonic sequence through a series of log 16
bitonic splits.
Sorting Networks: Bitonic Sort• We can easily build a sorting network to implement this bitonic merge
algorithm.
• Such a network is called a bitonic merging network.
• The network contains log n columns. Each column contains n/2
comparators and performs one step of the bitonic merge.
• We denote a bitonic merging network with n inputs by ⊕BM[n].
• Replacing the ⊕ comparators by Ө comparators results in a decreasing
output sequence; such a network is denoted by ӨBM[n].
Sorting Networks: Bitonic Sort
A bitonic merging network for n = 16. The input wires are numbered 0,1,…, n
- 1, and the binary representation of these numbers is shown. Each column of
comparators is drawn separately; the entire figure represents a ⊕BM[16]
bitonic merging network. The network takes a bitonic sequence and outputs it
in sorted order.
Sorting Networks: Bitonic SortHow do we sort an unsorted sequence using a bitonic merge?
• We must first build a single bitonic sequence from the given sequence.
• A sequence of length 2 is a bitonic sequence.
• A bitonic sequence of length 4 can be built by sorting the first two elements
using ⊕BM[2] and next two, using ӨBM[2].
• This process can be repeated to generate larger bitonic sequences.
Sorting Networks: Bitonic Sort
A schematic representation of a network that converts an input
sequence into a bitonic sequence. In this example, ⊕BM[k] and
ӨBM[k] denote bitonic merging networks of input size k that use ⊕
and Ө comparators, respectively. The last merging network
(⊕BM[16]) sorts the input. In this example, n = 16.
Sorting Networks: Bitonic Sort
The comparator network that transforms an input sequence of 16
unordered numbers into a bitonic sequence.
Sorting Networks: Bitonic Sort• The depth of the network is Θ(log2
n).
• Each stage of the network contains n/2 comparators. A serial
implementation of the network would have complexity Θ(nlog2
n).
Mapping Bitonic Sort to Hypercubes• Consider the case of one item per processor. The question becomes one of
how the wires in the bitonic network should be mapped to the hypercube
interconnect.
• Note from our earlier examples that the compare-exchange operation is
performed between two wires only if their labels differ in exactly one bit!
• This implies a direct mapping of wires to processors. All communication is
nearest neighbor!
Mapping Bitonic Sort to Hypercubes
Communication during the last stage of bitonic sort. Each wire is mapped
to a hypercube process; each connection represents a compare-
exchange between processes.
Mapping Bitonic Sort to Hypercubes
Communication characteristics of bitonic sort on a hypercube. During
each stage of the algorithm, processes communicate along the
dimensions shown.
Mapping Bitonic Sort to Hypercubes
Parallel formulation of bitonic sort on a hypercube with n = 2d
processes.
Mapping Bitonic Sort to Hypercubes
• During each step of the algorithm, every process performs a
compare-exchange operation (single nearest neighbor
communication of one word).
• Since each step takes Θ(1) time, the parallel time is
Tp = Θ(log2
n) (2)
• This algorithm is cost optimal w.r.t. its serial counterpart, but not
w.r.t. the best sorting algorithm.
Mapping Bitonic Sort to Meshes
• The connectivity of a mesh is lower than that of a hypercube, so we
must expect some overhead in this mapping.
• Consider the row-major shuffled mapping of wires to processors.
Mapping Bitonic Sort to Meshes
Different ways of mapping the input wires of the bitonic sorting network
to a mesh of processes: (a) row-major mapping, (b) row-major snakelike
mapping, and (c) row-major shuffled mapping.
Mapping Bitonic Sort to Meshes
The last stage of the bitonic sort algorithm for n = 16 on a mesh, using
the row-major shuffled mapping. During each step, process pairs
compare-exchange their elements. Arrows indicate the pairs of
processes that perform compare-exchange operations.
Mapping Bitonic Sort to Meshes• In the row-major shuffled mapping, wires that differ at the ith
least-
significant bit are mapped onto mesh processes that are 2(i-1)/2
communication links away.
• The total amount of communication performed by each process is
. The total computation performed by each process is
Θ(log2
n).
• The parallel runtime is:
• This is not cost optimal.
 
)(or,72
log
1 1
2/)1(
nn
n
i
i
j
j
Θ≈∑ ∑= =
−
Block of Elements Per Processor
• Each process is assigned a block of n/p elements.
• The first step is a local sort of the local block.
• Each subsequent compare-exchange operation is replaced by a
compare-split operation.
• We can effectively view the bitonic network as having (1 + log p)
(log p)/2 steps.
Block of Elements Per Processor: Hypercube
• Initially the processes sort their n/p elements (using merge sort) in time
Θ((n/p)log(n/p)) and then perform Θ(log2
p) compare-split steps.
• The parallel run time of this formulation is
• Comparing to an optimal sort, the algorithm can efficiently use up to
processes.
• The isoefficiency function due to both communication and extra work is
Θ(plog p
log2
p) .
)2( logn
p Θ=
Block of Elements Per Processor: Mesh
• The parallel runtime in this case is given by:
• This formulation can efficiently use up to p = Θ(log2
n) processes.
• The isoefficiency function is
Performance of Parallel Bitonic Sort
The performance of parallel formulations of bitonic sort for n elements
on p processes.
Bubble Sort and its VariantsThe sequential bubble sort algorithm compares and exchanges
adjacent elements in the sequence to be sorted:
Sequential bubble sort algorithm.
Bubble Sort and its Variants
• The complexity of bubble sort is Θ(n2
).
• Bubble sort is difficult to parallelize since the algorithm has no
concurrency.
• A simple variant, though, uncovers the concurrency.
Odd-Even Transposition
Sequential odd-even transposition sort algorithm.
Odd-Even Transposition
Sorting n = 8 elements, using the odd-even transposition sort
algorithm. During each phase, n = 8 elements are compared.
Odd-Even Transposition
• After n phases of odd-even exchanges, the sequence is sorted.
• Each phase of the algorithm (either odd or even) requires Θ(n)
comparisons.
• Serial complexity is Θ(n2
).
Parallel Odd-Even Transposition
• Consider the one item per processor case.
• There are n iterations, in each iteration, each processor does one
compare-exchange.
• The parallel run time of this formulation is Θ(n).
• This is cost optimal with respect to the base serial algorithm but not
the optimal one.
Parallel Odd-Even Transposition
Parallel formulation of odd-even transposition.
Parallel Odd-Even Transposition
• Consider a block of n/p elements per processor.
• The first step is a local sort.
• In each subsequent step, the compare exchange operation is
replaced by the compare split operation.
• The parallel run time of the formulation is
Parallel Odd-Even Transposition
• The parallel formulation is cost-optimal for p = O(log n).
• The isoefficiency function of this parallel formulation is Θ(p2p
).
Shellsort
• Let n be the number of elements to be sorted and p be the number
of processes.
• During the first phase, processes that are far away from each other in
the array compare-split their elements.
• During the second phase, the algorithm switches to an odd-even
transposition sort.
Parallel Shellsort
• Initially, each process sorts its block of n/p elements internally.
• Each process is now paired with its corresponding process in the reverse
order of the array. That is, process Pi, where i < p/2, is paired with
process Pp-i-1.
• A compare-split operation is performed.
• The processes are split into two groups of size p/2 each and the
process repeated in each group.
Parallel Shellsort
An example of the first phase of parallel shellsort on an eight-process
array.
Parallel Shellsort• Each process performs d = log p compare-split operations.
• With O(p) bisection width, each communication can be performed in time
Θ(n/p) for a total time of Θ((nlog p)/p).
• In the second phase, l odd and even phases are performed, each requiring
time Θ(n/p).
• The parallel run time of the algorithm is:
Quicksort
• Quicksort is one of the most common sorting algorithms for
sequential computers because of its simplicity, low overhead, and
optimal average complexity.
• Quicksort selects one of the entries in the sequence to be the pivot
and divides the sequence into two - one with all elements less than
the pivot and other greater.
• The process is recursively applied to each of the sublists.
Quicksort
The sequential quicksort algorithm.
Quicksort
Example of the quicksort algorithm sorting a sequence of size n = 8.
Quicksort
• The performance of quicksort depends critically on the quality of the
pivot.
• In the best case, the pivot divides the list in such a way that the
larger of the two lists does not have more than αn elements (for
some constant α).
• In this case, the complexity of quicksort is O(nlog n).
Parallelizing Quicksort
• Lets start with recursive decomposition - the list is partitioned
serially and each of the subproblems is handled by a different
processor.
• The time for this algorithm is lower-bounded by Ω(n)!
• Can we parallelize the partitioning step - in particular, if we can use n
processors to partition a list of length n around a pivot in O(1) time,
we have a winner.
• This is difficult to do on real machines, though.
Parallelizing Quicksort: PRAM
Formulation• We assume a CRCW (concurrent read, concurrent write) PRAM with
concurrent writes resulting in an arbitrary write succeeding.
• The formulation works by creating pools of processors. Every processor is
assigned to the same pool initially and has one element.
• Each processor attempts to write its element to a common location (for the
pool).
• Each processor tries to read back the location. If the value read back is
greater than the processor's value, it assigns itself to the `left' pool, else, it
assigns itself to the `right' pool.
• Each pool performs this operation recursively.
• Note that the algorithm generates a tree of pivots. The depth of the tree is
the expected parallel runtime. The average value is O(log n).
Parallelizing Quicksort: PRAM
Formulation
A binary tree generated by the execution of the quicksort algorithm. Each
level of the tree represents a different array-partitioning iteration. If
pivot selection is optimal, then the height of the tree is Θ(log n), which
is also the number of iterations.
Parallelizing Quicksort: PRAM Formulation
The execution of the PRAM algorithm on the array shown in (a).
Parallelizing Quicksort: Shared Address Space
Formulation
• Consider a list of size n equally divided across p processors.
• A pivot is selected by one of the processors and made known to all
processors.
• Each processor partitions its list into two, say Li and Ui, based on the
selected pivot.
• All of the Li lists are merged and all of the Ui lists are merged
separately.
• The set of processors is partitioned into two (in proportion of the size
of lists L and U). The process is recursively applied to each of the
lists.
Shared Address Space Formulation
Parallelizing Quicksort: Shared Address Space
Formulation
• The only thing we have not described is the global reorganization
(merging) of local lists to form L and U.
• The problem is one of determining the right location for each element in
the merged list.
• Each processor computes the number of elements locally less than and
greater than pivot.
• It computes two sum-scans to determine the starting location for its
elements in the merged L and U lists.
• Once it knows the starting locations, it can write its elements safely.
Parallelizing Quicksort: Shared Address Space
Formulation
Efficient global rearrangement of the array.
Parallelizing Quicksort: Shared Address Space
Formulation
• The parallel time depends on the split and merge time, and the quality
of the pivot.
• The latter is an issue independent of parallelism, so we focus on the first
aspect, assuming ideal pivot selection.
• The algorithm executes in four steps: (i) determine and broadcast the
pivot; (ii) locally rearrange the array assigned to each process; (iii)
determine the locations in the globally rearranged array that the local
elements will go to; and (iv) perform the global rearrangement.
• The first step takes time Θ(log p), the second, Θ(n/p) , the third,
Θ(log p) , and the fourth, Θ(n/p).
• The overall complexity of splitting an n-element array is Θ(n/p) +
Θ(log p).
Parallelizing Quicksort: Shared Address Space
Formulation
• The process recurses until there are p lists, at which point, the lists are
sorted locally.
• Therefore, the total parallel time is:
• The corresponding isoefficiency is Θ(plog2
p) due to broadcast and scan
operations.
Parallelizing Quicksort: Message Passing Formulation
• A simple message passing formulation is based on the recursive halving
of the machine.
• Assume that each processor in the lower half of a p processor ensemble
is paired with a corresponding processor in the upper half.
• A designated processor selects and broadcasts the pivot.
• Each processor splits its local list into two lists, one less (Li), and other
greater (Ui) than the pivot.
• A processor in the low half of the machine sends its list Ui to the paired
processor in the other half. The paired processor sends its list Li.
• It is easy to see that after this step, all elements less than the pivot are in
the low half of the machine and all elements greater than the pivot are
in the high half.
Parallelizing Quicksort: Message Passing Formulation
• The above process is recursed until each processor has its own local list,
which is sorted locally.
• The time for a single reorganization is Θ(log p) for broadcasting the pivot
element, Θ(n/p) for splitting the locally assigned portion of the array,
Θ(n/p) for exchange and local reorganization.
• We note that this time is identical to that of the corresponding shared
address space formulation.
• It is important to remember that the reorganization of elements is a
bandwidth sensitive operation.
Graph Algorithms
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003
Topic Overview
• Definitions and Representation
• Minimum Spanning Tree: Prim's Algorithm
• Single-Source Shortest Paths: Dijkstra's Algorithm
• All-Pairs Shortest Paths
• Transitive Closure
• Connected Components
• Algorithms for Sparse Graphs
Definitions and Representation
• An undirected graph G is a pair (V,E), where V is a finite set of points
called vertices and E is a finite set of edges.
• An edge e ∈ E is an unordered pair (u,v), where u,v ∈ V.
• In a directed graph, the edge e is an ordered pair (u,v). An edge (u,v)
is incident from vertex u and is incident to vertex v.
• A path from a vertex v to a vertex u is a sequence <v0,v1,v2,…,vk> of
vertices where v0 = v, vk = u, and (vi, vi+1) ∈ E for I = 0, 1,…, k-1.
• The length of a path is defined as the number of edges in the path.
Definitions and Representation
a) An undirected graph and (b) a directed graph.
Definitions and Representation
• An undirected graph is connected if every pair of vertices is
connected by a path.
• A forest is an acyclic graph, and a tree is a connected acyclic graph.
• A graph that has weights associated with each edge is called a
weighted graph.
Definitions and Representation
• Graphs can be represented by their adjacency matrix or an edge (or
vertex) list.
• Adjacency matrices have a value ai,j = 1 if nodes i and j share an edge;
0 otherwise. In case of a weighted graph, ai,j = wi,j, the weight of the
edge.
• The adjacency list representation of a graph G = (V,E) consists of an
array Adj[1..|V|] of lists. Each list Adj[v] is a list of all vertices
adjacent to v.
• For a grapn with n nodes, adjacency matrices take Θ(n2
) space and
adjacency list takes Θ(|E|) space.
Definitions and Representation
An undirected graph and its adjacency matrix representation.
An undirected graph and its adjacency list representation.
Minimum Spanning Tree
• A spanning tree of an undirected graph G is a subgraph of G that is a
tree containing all the vertices of G.
• In a weighted graph, the weight of a subgraph is the sum of the
weights of the edges in the subgraph.
• A minimum spanning tree (MST) for a weighted undirected graph is a
spanning tree with minimum weight.
Minimum Spanning Tree
An undirected graph and its minimum spanning tree.
Minimum Spanning Tree: Prim's
Algorithm
• Prim's algorithm for finding an MST is a greedy algorithm.
• Start by selecting an arbitrary vertex, include it into the current MST.
• Grow the current MST by inserting into it the vertex closest to one of
the vertices already in current MST.
Minimum Spanning Tree: Prim's Algorithm
Prim's minimum spanning tree algorithm.
Minimum Spanning Tree: Prim's
Algorithm
Prim's sequential minimum spanning tree algorithm.
Prim's Algorithm: Parallel Formulation
• The algorithm works in n outer iterations - it is hard to execute these
iterations concurrently.
• The inner loop is relatively easy to parallelize. Let p be the number of
processes, and let n be the number of vertices.
• The adjacency matrix is partitioned in a 1-D block fashion, with distance
vector d partitioned accordingly.
• In each step, a processor selects the locally closest node, followed by a
global reduction to select globally closest node.
• This node is inserted into MST, and the choice broadcast to all
processors.
• Each processor updates its part of the d vector locally.
Prim's Algorithm: Parallel Formulation
The partitioning of the distance array d and the adjacency matrix A among p
processes.
Prim's Algorithm: Parallel Formulation
• The cost to select the minimum entry is O(n/p + log p).
• The cost of a broadcast is O(log p).
• The cost of local updation of the d vector is O(n/p).
• The parallel time per iteration is O(n/p + log p).
• The total parallel time is given by O(n2
/p + n log p).
• The corresponding isoefficiency is O(p2
log2
p).

Weitere ähnliche Inhalte

Was ist angesagt?

daa-unit-3-greedy method
daa-unit-3-greedy methoddaa-unit-3-greedy method
daa-unit-3-greedy method
hodcsencet
 

Was ist angesagt? (20)

daa-unit-3-greedy method
daa-unit-3-greedy methoddaa-unit-3-greedy method
daa-unit-3-greedy method
 
backtracking algorithms of ada
backtracking algorithms of adabacktracking algorithms of ada
backtracking algorithms of ada
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitives
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed system
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
 
Parallel searching
Parallel searchingParallel searching
Parallel searching
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Advanced topics in artificial neural networks
Advanced topics in artificial neural networksAdvanced topics in artificial neural networks
Advanced topics in artificial neural networks
 
Deadlock
DeadlockDeadlock
Deadlock
 
Data decomposition techniques
Data decomposition techniquesData decomposition techniques
Data decomposition techniques
 
Chap1 slides
Chap1 slidesChap1 slides
Chap1 slides
 
Code generation
Code generationCode generation
Code generation
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Amortized Analysis of Algorithms
Amortized Analysis of Algorithms Amortized Analysis of Algorithms
Amortized Analysis of Algorithms
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Chapter 4 pc
Chapter 4 pcChapter 4 pc
Chapter 4 pc
 
Greedy method by Dr. B. J. Mohite
Greedy method by Dr. B. J. MohiteGreedy method by Dr. B. J. Mohite
Greedy method by Dr. B. J. Mohite
 
knowledge representation using rules
knowledge representation using rulesknowledge representation using rules
knowledge representation using rules
 
Dempster shafer theory
Dempster shafer theoryDempster shafer theory
Dempster shafer theory
 

Ähnlich wie All-Reduce and Prefix-Sum Operations

nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 
cis97003
cis97003cis97003
cis97003
perfj
 

Ähnlich wie All-Reduce and Prefix-Sum Operations (20)

Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
 
chap4_slides.ppt
chap4_slides.pptchap4_slides.ppt
chap4_slides.ppt
 
Chapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication OperationChapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication Operation
 
Chap8 slides
Chap8 slidesChap8 slides
Chap8 slides
 
densematrix.ppt
densematrix.pptdensematrix.ppt
densematrix.ppt
 
Lec 6-bp
Lec 6-bpLec 6-bp
Lec 6-bp
 
Basic Communication
Basic CommunicationBasic Communication
Basic Communication
 
1535 graph algorithms
1535 graph algorithms1535 graph algorithms
1535 graph algorithms
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
CS8451 - Design and Analysis of Algorithms
CS8451 - Design and Analysis of AlgorithmsCS8451 - Design and Analysis of Algorithms
CS8451 - Design and Analysis of Algorithms
 
Companding & Pulse Code Modulation
Companding & Pulse Code ModulationCompanding & Pulse Code Modulation
Companding & Pulse Code Modulation
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
cis97003
cis97003cis97003
cis97003
 

Mehr von Syed Zaid Irshad

Basic Concept of Information Technology
Basic Concept of Information TechnologyBasic Concept of Information Technology
Basic Concept of Information Technology
Syed Zaid Irshad
 
Introduction to ICS 1st Year Book
Introduction to ICS 1st Year BookIntroduction to ICS 1st Year Book
Introduction to ICS 1st Year Book
Syed Zaid Irshad
 

Mehr von Syed Zaid Irshad (20)

Operating System.pdf
Operating System.pdfOperating System.pdf
Operating System.pdf
 
DBMS_Lab_Manual_&_Solution
DBMS_Lab_Manual_&_SolutionDBMS_Lab_Manual_&_Solution
DBMS_Lab_Manual_&_Solution
 
Data Structure and Algorithms.pptx
Data Structure and Algorithms.pptxData Structure and Algorithms.pptx
Data Structure and Algorithms.pptx
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptx
 
Professional Issues in Computing
Professional Issues in ComputingProfessional Issues in Computing
Professional Issues in Computing
 
Reduce course notes class xi
Reduce course notes class xiReduce course notes class xi
Reduce course notes class xi
 
Reduce course notes class xii
Reduce course notes class xiiReduce course notes class xii
Reduce course notes class xii
 
Introduction to Database
Introduction to DatabaseIntroduction to Database
Introduction to Database
 
C Language
C LanguageC Language
C Language
 
Flowchart
FlowchartFlowchart
Flowchart
 
Algorithm Pseudo
Algorithm PseudoAlgorithm Pseudo
Algorithm Pseudo
 
Computer Programming
Computer ProgrammingComputer Programming
Computer Programming
 
ICS 2nd Year Book Introduction
ICS 2nd Year Book IntroductionICS 2nd Year Book Introduction
ICS 2nd Year Book Introduction
 
Security, Copyright and the Law
Security, Copyright and the LawSecurity, Copyright and the Law
Security, Copyright and the Law
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
Data Communication
Data CommunicationData Communication
Data Communication
 
Information Networks
Information NetworksInformation Networks
Information Networks
 
Basic Concept of Information Technology
Basic Concept of Information TechnologyBasic Concept of Information Technology
Basic Concept of Information Technology
 
Introduction to ICS 1st Year Book
Introduction to ICS 1st Year BookIntroduction to ICS 1st Year Book
Introduction to ICS 1st Year Book
 
Using the set operators
Using the set operatorsUsing the set operators
Using the set operators
 

Kürzlich hochgeladen

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 

Kürzlich hochgeladen (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 

All-Reduce and Prefix-Sum Operations

  • 1. All-Reduce and Prefix-Sum Operations • In all-reduce, each node starts with a buffer of size m and the final results of the operation are identical buffers of size m on each node that are formed by combining the original p buffers using an associative operator. • Identical to all-to-one reduction followed by a one-to-all broadcast. This formulation is not the most efficient. Uses the pattern of all-to- all broadcast, instead. The only difference is that message size does not increase here. Time for this operation is (ts + twm) log p. • Different from all-to-all reduction, in which p simultaneous all-to-one reductions take place, each with a different destination for the result.
  • 2. The Prefix-Sum Operation • Given p numbers n0,n1,…,np-1 (one on each node), the problem is to compute the sums sk = ∑i k =0 ni for all k between 0 and p-1 . • Initially, nk resides on the node labeled k, and at the end of the procedure, the same node holds Sk.
  • 3. The Prefix-Sum Operation Computing prefix sums on an eight-node hypercube. At each node, square brackets show the local prefix sum accumulated in the result buffer and parentheses enclose the contents of the outgoing message buffer for the next step.
  • 4. The Prefix-Sum Operation • The operation can be implemented using the all-to-all broadcast kernel. • We must account for the fact that in prefix sums the node with label k uses information from only the k-node subset whose labels are less than or equal to k. • This is implemented using an additional result buffer. The content of an incoming message is added to the result buffer only if the message comes from a node with a smaller label than the recipient node. • The contents of the outgoing message (denoted by parentheses in the figure) are updated with every incoming message.
  • 5. The Prefix-Sum Operation Prefix sums on a d-dimensional hypercube.
  • 6. Scatter and Gather • In the scatter operation, a single node sends a unique message of size m to every other node (also called a one-to-all personalized communication). • In the gather operation, a single node collects a unique message from each node. • While the scatter operation is fundamentally different from broadcast, the algorithmic structure is similar, except for differences in message sizes (messages get smaller in scatter and stay constant in broadcast). • The gather operation is exactly the inverse of the scatter operation and can be executed as such.
  • 7. Gather and Scatter Operations Scatter and gather operations.
  • 8. Example of the Scatter Operation The scatter operation on an eight-node hypercube.
  • 9. Cost of Scatter and Gather • There are log p steps, in each step, the machine size halves and the data size halves. • We have the time for this operation to be: • This time holds for a linear array as well as a 2-D mesh. • These times are asymptotically optimal in message size.
  • 10. All-to-All Personalized Communication • Each node has a distinct message of size m for every other node. • This is unlike all-to-all broadcast, in which each node sends the same message to all other nodes. • All-to-all personalized communication is also known as total exchange.
  • 12. All-to-All Personalized Communication: Example • Consider the problem of transposing a matrix. • Each processor contains one full row of the matrix. • The transpose operation in this case is identical to an all-to-all personalized communication operation.
  • 13. All-to-All Personalized Communication: Example All-to-all personalized communication in transposing a 4 x 4 matrix using four processes.
  • 14. All-to-All Personalized Communication on a Ring • Each node sends all pieces of data as one consolidated message of size m(p – 1) to one of its neighbors. • Each node extracts the information meant for it from the data received, and forwards the remaining (p – 2) pieces of size m each to the next node. • The algorithm terminates in p – 1 steps. • The size of the message reduces by m at each step.
  • 15. All-to-All Personalized Communication on a Ring All-to-all personalized communication on a six-node ring. The label of each message is of the form {x,y}, where x is the label of the node that originally owned the message, and y is the label of the node that is the final destination of the message. The label ({x1,y1}, {x2,y2},…, {xn,yn}, indicates a message that is formed by concatenating n individual messages.
  • 16. All-to-All Personalized Communication on a Ring: Cost • We have p – 1 steps in all. • In step i, the message size is m(p – i). • The total time is given by: • The tw term in this equation can be reduced by a factor of 2 by communicating messages in both directions.
  • 17. All-to-All Personalized Communication on a Mesh • Each node first groups its p messages according to the columns of their destination nodes. • All-to-all personalized communication is performed independently in each row with clustered messages of size m√p. • Messages in each node are sorted again, this time according to the rows of their destination nodes. • All-to-all personalized communication is performed independently in each column with clustered messages of size m√p.
  • 18. All-to-All Personalized Communication on a Mesh The distribution of messages at the beginning of each phase of all-to-all personalized communication on a 3 x 3 mesh. At the end of the second phase, node i has messages ({0,i},…, {8,i}), where 0 ≤ i ≤ 8. The groups of nodes communicating together in each phase are enclosed in dotted boundaries.
  • 19. All-to-All Personalized Communication on a Mesh: Cost • Time for the first phase is identical to that in a ring with √p processors, i.e., (ts + twmp/2)(√p – 1). • Time in the second phase is identical to the first phase. Therefore, total time is twice of this time, i.e., • It can be shown that the time for rearrangement is less much less than this communication time.
  • 20. All-to-All Personalized Communication on a Hypercube • Generalize the mesh algorithm to log p steps. • At any stage in all-to-all personalized communication, every node holds p packets of size m each. • While communicating in a particular dimension, every node sends p/2 of these packets (consolidated as one message). • A node must rearrange its messages locally before each of the log p communication steps.
  • 21. All-to-All Personalized Communication on a Hypercube An all-to-all personalized communication algorithm on a three-dimensional hypercube.
  • 22. All-to-All Personalized Communication on a Hypercube: Cost • We have log p iterations and mp/2 words are communicated in each iteration. Therefore, the cost is: • This is not optimal!
  • 23. All-to-All Personalized Communication on a Hypercube: Optimal Algorithm • Each node simply performs p – 1 communication steps, exchanging m words of data with a different node in every step. • A node must choose its communication partner in each step so that the hypercube links do not suffer congestion. • In the jth communication step, node i exchanges data with node (i XOR j). • In this schedule, all paths in every communication step are congestion-free, and none of the bidirectional links carry more than one message in the same direction.
  • 24. All-to-All Personalized Communication on a Hypercube: Optimal Algorithm Seven steps in all-to-all personalized communication on an eight-node hypercube.
  • 25. All-to-All Personalized Communication on a Hypercube: Optimal Algorithm A procedure to perform all-to-all personalized communication on a d- dimensional hypercube. The message Mi,j initially resides on node i and is destined for node j.
  • 26. All-to-All Personalized Communication on a Hypercube: Cost Analysis of Optimal Algorithm • There are p – 1 steps and each step involves non-congesting message transfer of m words. • We have: • This is asymptotically optimal in message size.
  • 27. Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”, Addison Wesley, 2003.
  • 28. Topic Overview • Matrix-Vector Multiplication • Matrix-Matrix Multiplication • Solving a System of Linear Equations
  • 29. Matix Algorithms: Introduction • Due to their regular structure, parallel computations involving matrices and vectors readily lend themselves to data-decomposition. • Typical algorithms rely on input, output, or intermediate data decomposition. • Most algorithms use one- and two-dimensional block, cyclic, and block-cyclic partitionings.
  • 30. Matrix-Vector Multiplication • We aim to multiply a dense n x n matrix A with an n x 1 vector x to yield the n x 1 result vector y. • The serial algorithm requires n2 multiplications and additions.
  • 31. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • The n x n matrix is partitioned among n processors, with each processor storing complete row of the matrix. • The n x 1 vector x is distributed such that each process owns one of its elements.
  • 32. Matrix-Vector Multiplication: Rowwise 1-D Partitioning Multiplication of an n x n matrix with an n x 1 vector using rowwise block 1-D partitioning. For the one-row-per-process case, p = n.
  • 33. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • Since each process starts with only one element of x , an all-to-all broadcast is required to distribute all the elements to all the processes. • Process Pi now computes . • The all-to-all broadcast and the computation of y[i] both take time Θ(n) . Therefore, the parallel time is Θ(n) .
  • 34. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • Consider now the case when p < n and we use block 1D partitioning. • Each process initially stores n=p complete rows of the matrix and a portion of the vector of size n=p. • The all-to-all broadcast takes place among p processes and involves messages of size n=p. • This is followed by n=p local dot products. • Thus, the parallel run time of this procedure is This is cost-optimal.
  • 35. Matrix-Vector Multiplication: Rowwise 1-D Partitioning Scalability Analysis: • We know that T0 = pTP - W, therefore, we have, • For isoefficiency, we have W = KT0, where K = E/(1 – E) for desired efficiency E. • From this, we have W = O(p2 ) (from the tw term). • There is also a bound on isoefficiency because of concurrency. In this case, p < n, therefore, W = n2 = Ω(p2 ). • Overall isoefficiency is W = O(p2 ).
  • 36. Matrix-Vector Multiplication: 2-D Partitioning • The n x n matrix is partitioned among n2 processors such that each processor owns a single element. • The n x 1 vector x is distributed only in the last column of n processors.
  • 37. Matrix-Vector Multiplication: 2-D Partitioning Matrix-vector multiplication with block 2-D partitioning. For the one-element-per-process case, p = n2 if the matrix size is n x n .
  • 38. Matrix-Vector Multiplication: 2-D Partitioning • We must first align the vector with the matrix appropriately. • The first communication step for the 2-D partitioning aligns the vector x along the principal diagonal of the matrix. • The second step copies the vector elements from each diagonal process to all the processes in the corresponding column using n simultaneous broadcasts among all processors in the column. • Finally, the result vector is computed by performing an all-to-one reduction along the columns.
  • 39. Matrix-Vector Multiplication: 2-D Partitioning • Three basic communication operations are used in this algorithm: one-to-one communication to align the vector along the main diagonal, one-to-all broadcast of each vector element among the n processes of each column, and all-to-one reduction in each row. • Each of these operations takes Θ(log n) time and the parallel time is Θ(log n) . • The cost (process-time product) is Θ(n2 log n) ; hence, the algorithm is not cost-optimal.
  • 40. Matrix-Vector Multiplication: 2-D Partitioning • When using fewer than n2 processors, each process owns an block of the matrix. • The vector is distributed in portions of elements in the last process-column only. • In this case, the message sizes for the alignment, broadcast, and reduction are all . • The computation is a product of an submatrix with a vector of length .
  • 41. Matrix-Vector Multiplication: 2-D Partitioning • The first alignment step takes time • The broadcast and reductions take time • Local matrix-vector products take time • Total time is
  • 42. Matrix-Vector Multiplication: 2-D Partitioning • Scalability Analysis: • • Equating T0 with W, term by term, for isoefficiency, we have, as the dominant term. • The isoefficiency due to concurrency is O(p). • The overall isoefficiency is (due to the network bandwidth). • For cost optimality, we have, . For this, we have,
  • 43. Matrix-Matrix Multiplication • Consider the problem of multiplying two n x n dense, square matrices A and B to yield the product matrix C =A x B. • The serial complexity is O(n3 ). • We do not consider better serial algorithms (Strassen's method), although, these can be used as serial kernels in the parallel algorithms. • A useful concept in this case is called block operations. In this view, an n x n matrix A can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) x (n/q) submatrix. • In this view, we perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices.
  • 44. Matrix-Matrix Multiplication • Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j < ) of size each. • Process Pi,jinitially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. • Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < . • All-to-all broadcast blocks of A along rows and B along columns. • Perform local submatrix multiplication.
  • 45. Matrix-Matrix Multiplication • The two broadcasts take time • The computation requires multiplications of sized submatrices. • The parallel run time is approximately • The algorithm is cost optimal and the isoefficiency is O(p1.5 ) due to bandwidth term tw and concurrency. • Major drawback of the algorithm is that it is not memory optimal.
  • 46. Matrix-Matrix Multiplication: Cannon's Algorithm • In this algorithm, we schedule the computations of the processes of the ith row such that, at any given time, each process is using a different block Ai,k. • These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh Ai,k after each rotation.
  • 47. Matrix-Matrix Multiplication: Cannon's Algorithm Communication steps in Cannon's algorithm on 16 processes.
  • 48. Matrix-Matrix Multiplication: Cannon's Algorithm • Align the blocks of A and B in such a way that each process multiplies its local submatrices. This is done by shifting all submatrices Ai,j to the left (with wraparound) by i steps and all submatrices Bi,j up (with wraparound) by j steps. • Perform local block multiplication. • Each block of A moves one step left and each block of B moves one step up (again with wraparound). • Perform next block multiplication, add to partial result, repeat until all blocks have been multiplied.
  • 49. Matrix-Matrix Multiplication: Cannon's Algorithm • In the alignment step, since the maximum distance over which a block shifts is , the two shift operations require a total of time. • Each of the single-step shifts in the compute-and-shift phase of the algorithm takes time. • The computation time for multiplying matrices of size is . • The parallel time is approximately: • The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm, except, this is memory optimal.
  • 50. Matrix-Matrix Multiplication: DNS Algorithm • Uses a 3-D partitioning. • Visualize the matrix multiplication algorithm as a cube . matrices A and B come in two orthogonal faces and result C comes out the other orthogonal face. • Each internal node in the cube represents a single add-multiply operation (and thus the complexity). • DNS algorithm partitions this cube using a 3-D block scheme.
  • 51. Matrix-Matrix Multiplication: DNS Algorithm • Assume an n x n x n mesh of processors. • Move the columns of A and rows of B and perform broadcast. • Each processor computes a single add-multiply. • This is followed by an accumulation along the C dimension. • Since each add-multiply takes constant time and accumulation and broadcast takes log n time, the total runtime is log n. • This is not cost optimal. It can be made cost optimal by using n / log n processors along the direction of accumulation.
  • 52. Matrix-Matrix Multiplication: DNS Algorithm The communication steps in the DNS algorithm while multiplying 4 x 4 matrices A and B on 64 processes.
  • 53. Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n3 processors. • Assume that the number of processes p is equal to q3 for some q < n. • The two matrices are partitioned into blocks of size (n/q) x(n/q). • Each matrix can thus be regarded as a q x q two-dimensional square array of blocks. • The algorithm follows from the previous one, except, in this case, we operate on blocks rather than on individual elements.
  • 54. Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n3 processors. • The first one-to-one communication step is performed for both A and B, and takes time for each matrix. • The two one-to-all broadcasts take time for each matrix. • The reduction takes time . • Multiplication of submatrices takes time. • The parallel time is approximated by: • The isoefficiency function is .
  • 55. Solving a System of Linear Equations • Consider the problem of solving linear equations of the kind: • This is written as Ax = b, where A is an n x n matrix with A[i, j] = ai,j, b is an n x 1 vector [ b0, b1, … , bn ]T , and x is the solution.
  • 56. Solving a System of Linear Equations Two steps in solution are: reduction to triangular form, and back- substitution. The triangular form is as: We write this as: Ux = y . A commonly used method for transforming a given matrix into an upper-triangular matrix is Gaussian Elimination.
  • 58. Gaussian Elimination • The computation has three nested loops - in the kth iteration of the outer loop, the algorithm performs (n-k)2 computations. Summing from k = 1..n, we have roughly (n3 /3) multiplications-subtractions. A typical computation in Gaussian elimination.
  • 59. Parallel Gaussian Elimination • Assume p = n with each row assigned to a processor. • The first step of the algorithm normalizes the row. This is a serial operation and takes time (n-k) in the kth iteration. • In the second step, the normalized row is broadcast to all the processors. This takes time . • Each processor can independently eliminate this row from its own. This requires (n-k-1) multiplications and subtractions. • The total parallel time can be computed by summing from k = 1 … n-1 as • The formulation is not cost optimal because of the tw term.
  • 60. Parallel Gaussian Elimination Gaussian elimination steps during the iteration corresponding k = 3 for an 8 x 8 matrix partitioned rowwise among eight processes.
  • 61. Parallel Gaussian Elimination: Pipelined Execution • In the previous formulation, the (k+1)st iteration starts only after all the computation and communication for the kth iteration is complete. • In the pipelined version, there are three steps - normalization of a row, communication, and elimination. These steps are performed in an asynchronous fashion. • A processor Pk waits to receive and eliminate all rows prior to k. • Once it has done this, it forwards its own row to processor Pk+1.
  • 62. Parallel Gaussian Elimination: Pipelined Execution Pipelined Gaussian elimination on a 5 x 5 matrix partitioned withone row per process.
  • 63. Parallel Gaussian Elimination: Pipelined Execution • The total number of steps in the entire pipelined procedure is Θ(n). • In any step, either O(n) elements are communicated between directly-connected processes, or a division step is performed on O(n) elements of a row, or an elimination step is performed on O(n) elements of a row. • The parallel time is therefore O(n2 ) . • This is cost optimal.
  • 64. Parallel Gaussian Elimination: Pipelined Execution The communication in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix distributed among four processes using block 1-D partitioning.
  • 65. Parallel Gaussian Elimination: Block 1D with p < n • The above algorithm can be easily adapted to the case when p < n. • In the kth iteration, a processor with all rows belonging to the active part of the matrix performs (n – k -1) / np multiplications and subtractions. • In the pipelined version, for n > p, computation dominates communication. • The parallel time is given by: or approximately, n3 /p. • While the algorithm is cost optimal, the cost of the parallel algorithm is higher than the sequential run time by a factor of 3/2.
  • 66. Parallel Gaussian Elimination: Block 1D with p < n Computation load on different processes in block and cyclic 1-D partitioning of an 8 x 8 matrix on four processes during the Gaussian elimination iteration corresponding to k = 3.
  • 67. Parallel Gaussian Elimination: Block 1D with p < n • The load imbalance problem can be alleviated by using a cyclic mapping. • In this case, other than processing of the last p rows, there is no load imbalance. • This corresponds to a cumulative load imbalance overhead of O(n2 p) (instead of O(n3 ) in the previous case).
  • 68. Gaussian Elimination with Partial Pivoting • For numerical stability, one generally uses partial pivoting. • In the k th iteration, we select a column i (called the pivot column) such that A[k, i] is the largest in magnitude among all A[k, i] such that k ≤ j < n. • The k th and the i th columns are interchanged. • Simple to implement with row-partitioning and does not add overhead since the division step takes the same time as computing the max. • Column-partitioning, however, requires a global reduction, adding a log p term to the overhead. • Pivoting precludes the use of pipelining.
  • 69. Gaussian Elimination with Partial Pivoting: 2-D Partitioning • Partial pivoting restricts use of pipelining, resulting in performance loss. • This loss can be alleviated by restricting pivoting to specific columns. • Alternately, we can use faster algorithms for broadcast.
  • 70. Solving a Triangular System: Back-Substitution • The upper triangular matrix U undergoes back-substitution to determine the vector x. A serial algorithm for back-substitution.
  • 71. Solving a Triangular System: Back-Substitution • The algorithm performs approximately n2 /2 multiplications and subtractions. • Since complexity of this part is asymptotically lower, we should optimize the data distribution for the factorization part. • Consider a rowwise block 1-D mapping of the n x n matrix U with vector y distributed uniformly. • The value of the variable solved at a step can be pipelined back. • Each step of a pipelined implementation requires a constant amount of time for communication and Θ(n/p) time for computation. • The parallel run time of the entire algorithm is Θ(n2 /p).
  • 72. Solving a Triangular System: Back-Substitution • If the matrix is partitioned by using 2-D partitioning on a logical mesh of processes, and the elements of the vector are distributed along one of the columns of the process mesh, then only the processes containing the vector perform any computation. • Using pipelining to communicate the appropriate elements of U to the process containing the corresponding elements of y for the substitution step (line 7), the algorithm can be executed in time. • While this is not cost optimal, since this does not dominate the overall computation, the cost optimality is determined by the factorization.
  • 73. Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.
  • 74. Topic Overview • Issues in Sorting on Parallel Computers • Sorting Networks • Bubble Sort and its Variants • Quicksort • Bucket and Sample Sort • Other Sorting Algorithms
  • 75. Sorting: Overview • One of the most commonly used and well-studied kernels. • Sorting can be comparison-based or noncomparison-based. • The fundamental operation of comparison-based sorting is compare- exchange. • The lower bound on any comparison-based sort of n numbers is Θ(nlog n) . • We focus here on comparison-based sorting algorithms.
  • 76. Sorting: Basics What is a parallel sorted sequence? Where are the input and output lists stored? • We assume that the input and output lists are distributed. • The sorted list is partitioned with the property that each partitioned list is sorted and each element in processor Pi's list is less than that in Pj's list if i < j.
  • 77. Sorting: Parallel Compare Exchange Operation A parallel compare-exchange operation. Processes Pi and Pj send their elements to each other. Process Pi keeps min{ai,aj}, and Pj keeps max{ai, aj}.
  • 78. Sorting: BasicsWhat is the parallel counterpart to a sequential comparator? • If each processor has one element, the compare exchange operation stores the smaller element at the processor with smaller id. This can be done in ts + tw time. • If we have more than one element per processor, we call this operation a compare split. Assume each of two processors have n/p elements. • After the compare-split operation, the smaller n/p elements are at processor Pi and the larger n/p elements at Pj, where i < j. • The time for a compare-split operation is (ts+ twn/p), assuming that the two partial lists were initially sorted.
  • 79. Sorting: Parallel Compare Split Operation A compare-split operation. Each process sends its block of size n/p to the other process. Each process merges the received block with its own block and retains only the appropriate half of the merged block. In this example, process Pi retains the smaller elements and process Pi retains the larger elements.
  • 80. Sorting Networks • Networks of comparators designed specifically for sorting. • A comparator is a device with two inputs x and y and two outputs x' and y'. For an increasing comparator, x' = min{x,y} and y' = min{x,y}; and vice-versa. • We denote an increasing comparator by ⊕ and a decreasing comparator by Ө. • The speed of the network is proportional to its depth.
  • 81. Sorting Networks: Comparators A schematic representation of comparators: (a) an increasing comparator, and (b) a decreasing comparator.
  • 82. Sorting Networks A typical sorting network. Every sorting network is made up of a series of columns, and each column contains a number of comparators connected in parallel.
  • 83. Sorting Networks: Bitonic Sort • A bitonic sorting network sorts n elements in Θ(log2 n) time. • A bitonic sequence has two tones - increasing and decreasing, or vice versa. Any cyclic rotation of such networks is also considered bitonic. • 〈1,2,4,7,6,0〉 is a bitonic sequence, because it first increases and then decreases. 〈8,9,2,1,0,4〉 is another bitonic sequence, because it is a cyclic shift of 〈0,4,8,9,2,1〉. • The kernel of the network is the rearrangement of a bitonic sequence into a sorted sequence.
  • 84. Sorting Networks: Bitonic Sort• Let s = 〈a0,a1,…,an-1〉 be a bitonic sequence such that a0 ≤ a1 ≤ ··· ≤ an/2-1 and an/2 ≥ an/2+1 ≥ ··· ≥ an-1. • Consider the following subsequences of s: s1 = 〈min{a0,an/2},min{a1,an/2+1},…,min{an/2-1,an-1}〉 s2 = 〈max{a0,an/2},max{a1,an/2+1},…,max{an/2-1,an-1}〉 (1) • Note that s1 and s2 are both bitonic and each element of s1 is less than every element in s2. • We can apply the procedure recursively on s1 and s2 to get the sorted sequence.
  • 85. Sorting Networks: Bitonic Sort Merging a 16-element bitonic sequence through a series of log 16 bitonic splits.
  • 86. Sorting Networks: Bitonic Sort• We can easily build a sorting network to implement this bitonic merge algorithm. • Such a network is called a bitonic merging network. • The network contains log n columns. Each column contains n/2 comparators and performs one step of the bitonic merge. • We denote a bitonic merging network with n inputs by ⊕BM[n]. • Replacing the ⊕ comparators by Ө comparators results in a decreasing output sequence; such a network is denoted by ӨBM[n].
  • 87. Sorting Networks: Bitonic Sort A bitonic merging network for n = 16. The input wires are numbered 0,1,…, n - 1, and the binary representation of these numbers is shown. Each column of comparators is drawn separately; the entire figure represents a ⊕BM[16] bitonic merging network. The network takes a bitonic sequence and outputs it in sorted order.
  • 88. Sorting Networks: Bitonic SortHow do we sort an unsorted sequence using a bitonic merge? • We must first build a single bitonic sequence from the given sequence. • A sequence of length 2 is a bitonic sequence. • A bitonic sequence of length 4 can be built by sorting the first two elements using ⊕BM[2] and next two, using ӨBM[2]. • This process can be repeated to generate larger bitonic sequences.
  • 89. Sorting Networks: Bitonic Sort A schematic representation of a network that converts an input sequence into a bitonic sequence. In this example, ⊕BM[k] and ӨBM[k] denote bitonic merging networks of input size k that use ⊕ and Ө comparators, respectively. The last merging network (⊕BM[16]) sorts the input. In this example, n = 16.
  • 90. Sorting Networks: Bitonic Sort The comparator network that transforms an input sequence of 16 unordered numbers into a bitonic sequence.
  • 91. Sorting Networks: Bitonic Sort• The depth of the network is Θ(log2 n). • Each stage of the network contains n/2 comparators. A serial implementation of the network would have complexity Θ(nlog2 n).
  • 92. Mapping Bitonic Sort to Hypercubes• Consider the case of one item per processor. The question becomes one of how the wires in the bitonic network should be mapped to the hypercube interconnect. • Note from our earlier examples that the compare-exchange operation is performed between two wires only if their labels differ in exactly one bit! • This implies a direct mapping of wires to processors. All communication is nearest neighbor!
  • 93. Mapping Bitonic Sort to Hypercubes Communication during the last stage of bitonic sort. Each wire is mapped to a hypercube process; each connection represents a compare- exchange between processes.
  • 94. Mapping Bitonic Sort to Hypercubes Communication characteristics of bitonic sort on a hypercube. During each stage of the algorithm, processes communicate along the dimensions shown.
  • 95. Mapping Bitonic Sort to Hypercubes Parallel formulation of bitonic sort on a hypercube with n = 2d processes.
  • 96. Mapping Bitonic Sort to Hypercubes • During each step of the algorithm, every process performs a compare-exchange operation (single nearest neighbor communication of one word). • Since each step takes Θ(1) time, the parallel time is Tp = Θ(log2 n) (2) • This algorithm is cost optimal w.r.t. its serial counterpart, but not w.r.t. the best sorting algorithm.
  • 97. Mapping Bitonic Sort to Meshes • The connectivity of a mesh is lower than that of a hypercube, so we must expect some overhead in this mapping. • Consider the row-major shuffled mapping of wires to processors.
  • 98. Mapping Bitonic Sort to Meshes Different ways of mapping the input wires of the bitonic sorting network to a mesh of processes: (a) row-major mapping, (b) row-major snakelike mapping, and (c) row-major shuffled mapping.
  • 99. Mapping Bitonic Sort to Meshes The last stage of the bitonic sort algorithm for n = 16 on a mesh, using the row-major shuffled mapping. During each step, process pairs compare-exchange their elements. Arrows indicate the pairs of processes that perform compare-exchange operations.
  • 100. Mapping Bitonic Sort to Meshes• In the row-major shuffled mapping, wires that differ at the ith least- significant bit are mapped onto mesh processes that are 2(i-1)/2 communication links away. • The total amount of communication performed by each process is . The total computation performed by each process is Θ(log2 n). • The parallel runtime is: • This is not cost optimal.   )(or,72 log 1 1 2/)1( nn n i i j j Θ≈∑ ∑= = −
  • 101. Block of Elements Per Processor • Each process is assigned a block of n/p elements. • The first step is a local sort of the local block. • Each subsequent compare-exchange operation is replaced by a compare-split operation. • We can effectively view the bitonic network as having (1 + log p) (log p)/2 steps.
  • 102. Block of Elements Per Processor: Hypercube • Initially the processes sort their n/p elements (using merge sort) in time Θ((n/p)log(n/p)) and then perform Θ(log2 p) compare-split steps. • The parallel run time of this formulation is • Comparing to an optimal sort, the algorithm can efficiently use up to processes. • The isoefficiency function due to both communication and extra work is Θ(plog p log2 p) . )2( logn p Θ=
  • 103. Block of Elements Per Processor: Mesh • The parallel runtime in this case is given by: • This formulation can efficiently use up to p = Θ(log2 n) processes. • The isoefficiency function is
  • 104. Performance of Parallel Bitonic Sort The performance of parallel formulations of bitonic sort for n elements on p processes.
  • 105. Bubble Sort and its VariantsThe sequential bubble sort algorithm compares and exchanges adjacent elements in the sequence to be sorted: Sequential bubble sort algorithm.
  • 106. Bubble Sort and its Variants • The complexity of bubble sort is Θ(n2 ). • Bubble sort is difficult to parallelize since the algorithm has no concurrency. • A simple variant, though, uncovers the concurrency.
  • 107. Odd-Even Transposition Sequential odd-even transposition sort algorithm.
  • 108. Odd-Even Transposition Sorting n = 8 elements, using the odd-even transposition sort algorithm. During each phase, n = 8 elements are compared.
  • 109. Odd-Even Transposition • After n phases of odd-even exchanges, the sequence is sorted. • Each phase of the algorithm (either odd or even) requires Θ(n) comparisons. • Serial complexity is Θ(n2 ).
  • 110. Parallel Odd-Even Transposition • Consider the one item per processor case. • There are n iterations, in each iteration, each processor does one compare-exchange. • The parallel run time of this formulation is Θ(n). • This is cost optimal with respect to the base serial algorithm but not the optimal one.
  • 111. Parallel Odd-Even Transposition Parallel formulation of odd-even transposition.
  • 112. Parallel Odd-Even Transposition • Consider a block of n/p elements per processor. • The first step is a local sort. • In each subsequent step, the compare exchange operation is replaced by the compare split operation. • The parallel run time of the formulation is
  • 113. Parallel Odd-Even Transposition • The parallel formulation is cost-optimal for p = O(log n). • The isoefficiency function of this parallel formulation is Θ(p2p ).
  • 114. Shellsort • Let n be the number of elements to be sorted and p be the number of processes. • During the first phase, processes that are far away from each other in the array compare-split their elements. • During the second phase, the algorithm switches to an odd-even transposition sort.
  • 115. Parallel Shellsort • Initially, each process sorts its block of n/p elements internally. • Each process is now paired with its corresponding process in the reverse order of the array. That is, process Pi, where i < p/2, is paired with process Pp-i-1. • A compare-split operation is performed. • The processes are split into two groups of size p/2 each and the process repeated in each group.
  • 116. Parallel Shellsort An example of the first phase of parallel shellsort on an eight-process array.
  • 117. Parallel Shellsort• Each process performs d = log p compare-split operations. • With O(p) bisection width, each communication can be performed in time Θ(n/p) for a total time of Θ((nlog p)/p). • In the second phase, l odd and even phases are performed, each requiring time Θ(n/p). • The parallel run time of the algorithm is:
  • 118. Quicksort • Quicksort is one of the most common sorting algorithms for sequential computers because of its simplicity, low overhead, and optimal average complexity. • Quicksort selects one of the entries in the sequence to be the pivot and divides the sequence into two - one with all elements less than the pivot and other greater. • The process is recursively applied to each of the sublists.
  • 120. Quicksort Example of the quicksort algorithm sorting a sequence of size n = 8.
  • 121. Quicksort • The performance of quicksort depends critically on the quality of the pivot. • In the best case, the pivot divides the list in such a way that the larger of the two lists does not have more than αn elements (for some constant α). • In this case, the complexity of quicksort is O(nlog n).
  • 122. Parallelizing Quicksort • Lets start with recursive decomposition - the list is partitioned serially and each of the subproblems is handled by a different processor. • The time for this algorithm is lower-bounded by Ω(n)! • Can we parallelize the partitioning step - in particular, if we can use n processors to partition a list of length n around a pivot in O(1) time, we have a winner. • This is difficult to do on real machines, though.
  • 123. Parallelizing Quicksort: PRAM Formulation• We assume a CRCW (concurrent read, concurrent write) PRAM with concurrent writes resulting in an arbitrary write succeeding. • The formulation works by creating pools of processors. Every processor is assigned to the same pool initially and has one element. • Each processor attempts to write its element to a common location (for the pool). • Each processor tries to read back the location. If the value read back is greater than the processor's value, it assigns itself to the `left' pool, else, it assigns itself to the `right' pool. • Each pool performs this operation recursively. • Note that the algorithm generates a tree of pivots. The depth of the tree is the expected parallel runtime. The average value is O(log n).
  • 124. Parallelizing Quicksort: PRAM Formulation A binary tree generated by the execution of the quicksort algorithm. Each level of the tree represents a different array-partitioning iteration. If pivot selection is optimal, then the height of the tree is Θ(log n), which is also the number of iterations.
  • 125. Parallelizing Quicksort: PRAM Formulation The execution of the PRAM algorithm on the array shown in (a).
  • 126. Parallelizing Quicksort: Shared Address Space Formulation • Consider a list of size n equally divided across p processors. • A pivot is selected by one of the processors and made known to all processors. • Each processor partitions its list into two, say Li and Ui, based on the selected pivot. • All of the Li lists are merged and all of the Ui lists are merged separately. • The set of processors is partitioned into two (in proportion of the size of lists L and U). The process is recursively applied to each of the lists.
  • 127. Shared Address Space Formulation
  • 128. Parallelizing Quicksort: Shared Address Space Formulation • The only thing we have not described is the global reorganization (merging) of local lists to form L and U. • The problem is one of determining the right location for each element in the merged list. • Each processor computes the number of elements locally less than and greater than pivot. • It computes two sum-scans to determine the starting location for its elements in the merged L and U lists. • Once it knows the starting locations, it can write its elements safely.
  • 129. Parallelizing Quicksort: Shared Address Space Formulation Efficient global rearrangement of the array.
  • 130. Parallelizing Quicksort: Shared Address Space Formulation • The parallel time depends on the split and merge time, and the quality of the pivot. • The latter is an issue independent of parallelism, so we focus on the first aspect, assuming ideal pivot selection. • The algorithm executes in four steps: (i) determine and broadcast the pivot; (ii) locally rearrange the array assigned to each process; (iii) determine the locations in the globally rearranged array that the local elements will go to; and (iv) perform the global rearrangement. • The first step takes time Θ(log p), the second, Θ(n/p) , the third, Θ(log p) , and the fourth, Θ(n/p). • The overall complexity of splitting an n-element array is Θ(n/p) + Θ(log p).
  • 131. Parallelizing Quicksort: Shared Address Space Formulation • The process recurses until there are p lists, at which point, the lists are sorted locally. • Therefore, the total parallel time is: • The corresponding isoefficiency is Θ(plog2 p) due to broadcast and scan operations.
  • 132. Parallelizing Quicksort: Message Passing Formulation • A simple message passing formulation is based on the recursive halving of the machine. • Assume that each processor in the lower half of a p processor ensemble is paired with a corresponding processor in the upper half. • A designated processor selects and broadcasts the pivot. • Each processor splits its local list into two lists, one less (Li), and other greater (Ui) than the pivot. • A processor in the low half of the machine sends its list Ui to the paired processor in the other half. The paired processor sends its list Li. • It is easy to see that after this step, all elements less than the pivot are in the low half of the machine and all elements greater than the pivot are in the high half.
  • 133. Parallelizing Quicksort: Message Passing Formulation • The above process is recursed until each processor has its own local list, which is sorted locally. • The time for a single reorganization is Θ(log p) for broadcasting the pivot element, Θ(n/p) for splitting the locally assigned portion of the array, Θ(n/p) for exchange and local reorganization. • We note that this time is identical to that of the corresponding shared address space formulation. • It is important to remember that the reorganization of elements is a bandwidth sensitive operation.
  • 134. Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003
  • 135. Topic Overview • Definitions and Representation • Minimum Spanning Tree: Prim's Algorithm • Single-Source Shortest Paths: Dijkstra's Algorithm • All-Pairs Shortest Paths • Transitive Closure • Connected Components • Algorithms for Sparse Graphs
  • 136. Definitions and Representation • An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite set of edges. • An edge e ∈ E is an unordered pair (u,v), where u,v ∈ V. • In a directed graph, the edge e is an ordered pair (u,v). An edge (u,v) is incident from vertex u and is incident to vertex v. • A path from a vertex v to a vertex u is a sequence <v0,v1,v2,…,vk> of vertices where v0 = v, vk = u, and (vi, vi+1) ∈ E for I = 0, 1,…, k-1. • The length of a path is defined as the number of edges in the path.
  • 137. Definitions and Representation a) An undirected graph and (b) a directed graph.
  • 138. Definitions and Representation • An undirected graph is connected if every pair of vertices is connected by a path. • A forest is an acyclic graph, and a tree is a connected acyclic graph. • A graph that has weights associated with each edge is called a weighted graph.
  • 139. Definitions and Representation • Graphs can be represented by their adjacency matrix or an edge (or vertex) list. • Adjacency matrices have a value ai,j = 1 if nodes i and j share an edge; 0 otherwise. In case of a weighted graph, ai,j = wi,j, the weight of the edge. • The adjacency list representation of a graph G = (V,E) consists of an array Adj[1..|V|] of lists. Each list Adj[v] is a list of all vertices adjacent to v. • For a grapn with n nodes, adjacency matrices take Θ(n2 ) space and adjacency list takes Θ(|E|) space.
  • 140. Definitions and Representation An undirected graph and its adjacency matrix representation. An undirected graph and its adjacency list representation.
  • 141. Minimum Spanning Tree • A spanning tree of an undirected graph G is a subgraph of G that is a tree containing all the vertices of G. • In a weighted graph, the weight of a subgraph is the sum of the weights of the edges in the subgraph. • A minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum weight.
  • 142. Minimum Spanning Tree An undirected graph and its minimum spanning tree.
  • 143. Minimum Spanning Tree: Prim's Algorithm • Prim's algorithm for finding an MST is a greedy algorithm. • Start by selecting an arbitrary vertex, include it into the current MST. • Grow the current MST by inserting into it the vertex closest to one of the vertices already in current MST.
  • 144. Minimum Spanning Tree: Prim's Algorithm Prim's minimum spanning tree algorithm.
  • 145. Minimum Spanning Tree: Prim's Algorithm Prim's sequential minimum spanning tree algorithm.
  • 146. Prim's Algorithm: Parallel Formulation • The algorithm works in n outer iterations - it is hard to execute these iterations concurrently. • The inner loop is relatively easy to parallelize. Let p be the number of processes, and let n be the number of vertices. • The adjacency matrix is partitioned in a 1-D block fashion, with distance vector d partitioned accordingly. • In each step, a processor selects the locally closest node, followed by a global reduction to select globally closest node. • This node is inserted into MST, and the choice broadcast to all processors. • Each processor updates its part of the d vector locally.
  • 147. Prim's Algorithm: Parallel Formulation The partitioning of the distance array d and the adjacency matrix A among p processes.
  • 148. Prim's Algorithm: Parallel Formulation • The cost to select the minimum entry is O(n/p + log p). • The cost of a broadcast is O(log p). • The cost of local updation of the d vector is O(n/p). • The parallel time per iteration is O(n/p + log p). • The total parallel time is given by O(n2 /p + n log p). • The corresponding isoefficiency is O(p2 log2 p).