SlideShare a Scribd company logo
1 of 150
Download to read offline
Parallel Processing

             Dr. Pierre Vignéras
    http://www.vigneras.name/pierre




This work is licensed under a Creative Commons
        Attribution-Share Alike 2.0 France.
                        See
http://creativecommons.org/licenses/by-sa/2.0/fr/
                     for details

                Dr. Pierre Vignéras
Text & Reference Books

        ●   Fundamentals of Parallel Processing
        Harry F. Jordan & Gita Alaghband
        Pearson Education
        ISBN: 81-7808-992-0
        Parallel & Distributed Computing Handbook
        ●

        Albert Y. H. Zomaya
        McGraw-Hill Series on Computer Engineering
        ISBN: 0-07-073020-2

        Parallel Programming (2nd edition)
        ●
Books




        Barry Wilkinson and Michael Allen
        Pearson/Prentice Hall
        ISBN: 0-13-140563-2
                         Dr. Pierre Vignéras
Class, Quiz & Exam Rules
        ●   No entry after the first 5 minutes 
        ●   No exit before the end of the class
        ●   Announced Quiz
            –   At the beginning of a class
            –   Fixed timing (you may suffer if you arrive late) 
            –   Spread Out (do it quickly to save your time)
            –   Papers that are not strictly in front of you will be 
                considered as done
            –   Cheaters will get '­1' mark
Rules




            –   Your head should be in front of your paper!

                                  Dr. Pierre Vignéras
Grading Policy
                 ●   (Temptative, can be slightly modified)
                     –   Quiz: 5 %
                     –   Assignments: 5%
                     –   Project: 15%  
                     –   Mid: 25 %
Grading Policy




                     –   Final: 50 %




                                          Dr. Pierre Vignéras
Outline
             (Important points)
●   Basics
    –   [SI|MI][SD|MD], 
    –   Interconnection Networks
    –   Language Specific Constructions
●   Parallel Complexity
    –   Size and Depth of an algorithm
    –   Speedup & Efficiency
●   SIMD Programming
    –   Matrix Addition & Multiplication
    –   Gauss Elimination

                           Dr. Pierre Vignéras
Basics


          ●
            Flynn's Classification
          ●
            Interconnection Networks
Outline




          ●
           Language Specific Constructions


                     Dr. Pierre Vignéras     6
History
            ●   Since
                –   Computers are based on the Von Neumann 
                    Architecture 
                –   Languages are based on either Turing Machines 
                    (imperative languages), Lambda Calculus 
                    (functional languages) or Boolean Algebra (Logic 
                    Programming)
            ●   Sequential processing is the way both 
I. Basics




                computers and languages are designed
            ●   But doing several things at once is an obvious 
                way to increase speed

                                    Dr. Pierre Vignéras
Definitions
            ●   Parallel Processing objective is execution speed 
                improvement 
            ●   The manner is by doing several things at once
            ●   The instruments are 
                –   Architecture,
                –   Languages,
                –   Algorithm
I. Basics




            ●   Parallel Processing is an old field
                –   but always limited by the wide sequential world for 
                    various reasons (financial, psychological, ...) 

                                     Dr. Pierre Vignéras
Flynn Architecture
                           Characterization
            ●   Instruction Stream (what to do)
                –   S: Single, M: Multiple
            ●   Data Stream (on what)
                –   I: Instruction, D: Data
            ●   Four possibilities
                –   SISD: usual computer
                –   SIMD: “Vector Computers” (Cray)
I. Basics




                –   MIMD: Multi­Processor Computers
                –   MISD: Not very useful (sometimes presented as 
                    pipelined SIMD)

                                     Dr. Pierre Vignéras
SISD Architecture
            ●   Traditional Sequential Computer
            ●   Parallelism at the Architecture Level:
                –   Interruption
                –   IO processor
                –   Pipeline


                Flow of Instructions   I8     I7     I6      I5
I. Basics




                                       IF    OF EX STR            I4 I3 I2 I1

                                        4-stages pipeline

                                       Dr. Pierre Vignéras
Problem with pipelines
            ●   Conditional branches need complete flush
                –   The next instruction depends on the final result of 
                    a condition
            ●   Dependencies 
                –   Wait (inserting NOP in pipeline stages) until a 
                    previous result becomes available
            ●   Solutions:
                –   Branch prediction, out­of­order execution, ...
I. Basics




            ●   Consequences:
                –   Circuitry Complexities (cost)
                –   Performance
                                     Dr. Pierre Vignéras
Exercises: Real Life Examples
            ●   Intel: 
                    ●   Pentium 4: 
                    ●   Pentium 3 (=M=Core Duo):  
            ●   AMD
                    ●   Athlon: 
                    ●   Opteron: 
            ●   IBM
                    ●   Power 6:
I. Basics




                    ●   Cell:
            ●   Sun
                    ●   UltraSparc:
                    ●   T1:
                                       Dr. Pierre Vignéras
Instruction Level Parallelism
            ●   Multiple Computing Unit that can be used 
                simultaneously (scalar)
                –   Floating Point Unit
                –   Arithmetic and Logic Unit
            ●   Reordering of instructions
                –   By the compiler
                –   By the hardware
I. Basics




            ●   Lookahead (prefetching) & Scoreboarding 
                (resolve conflicts)
                –   Original Semantic is Guaranteed

                                      Dr. Pierre Vignéras
Reordering: Example
            ●   EXP=A+B+C+(DxExF)+G+H
                                             +                                               +

                                         + H                                     +                   H
                Height=5




                                                     Height=4
                                    +        G                         +                         x
                               +         x                       +         +             x           F

                           +       C x       F                  A B        C G       D       E
I. Basics




                    A B D E
                                                                     4 Steps required if
                 5 Steps required if
                                                                hardware can do 2 adds and
             hardware can do 1 add and
                                                                1 multiplication simultaneously
            1 multiplication simultaneously
                                                 Dr. Pierre Vignéras
SIMD Architecture
            ●   Often (especially in scientific applications), 
                one instruction is applied on different data
                        for (int i = 0; i < n; i++) {
                            r[i]=a[i]+b[i];
                        }
            ●   Using vector operations (add, mul, div, ...)
            ●   Two versions
                –   True SIMD: n Arithmetic Unit
I. Basics




                     ●   n high­level operations at a time
                –   Pipelines SIMD: Arithmetic Pipelined (depth n)
                     ●   n non­identical low­level operations at a time


                                            Dr. Pierre Vignéras
Real Life Examples
            ●   Cray: pipelined SIMD vector supercomputers
            ●   Modern Micro­Processor Instruction Set 
                contains SIMD instructions
                –   Intel: MMX, SSE{1,2,3}
                –   AMD: 3dnow
                –   IBM: Altivec
                –   Sun: VSI
I. Basics




            ●   Very efficient
                –   2002: winner of the TOP500 list was the Japanese 
                    Earth Simulator, a vector supercomputer

                                    Dr. Pierre Vignéras
MIMD Architecture
            ●   Two forms
                –   Shared memory: any processor can access any 
                    memory region
                –   Distributed memory: a distinct memory is 
                    associated with each processor
            ●   Usage:
                –   Multiple sequential programs can be executed 
                    simultaneously
I. Basics




                –   Different parts of a program are executed 
                    simultaneously on each processor
                     ●   Need cooperation!

                                        Dr. Pierre Vignéras
Warning: Definitions
            ●   Multiprocessors: computers capable of 
                running multiple instructions streams 
                simultaneously to cooperatively execute a 
                single program
            ●   Multiprogramming: sharing of computing 
                resources by independent jobs
            ●   Multiprocessing: 
                –   running a possibly sequential program on a 
I. Basics




                    multiprocessor (not our concern here)
                –   running a program consisting of multiple 
                    cooperating processes

                                    Dr. Pierre Vignéras
Shared vs Distributed


                                                        M
            CPU   CPU         CPU
                                                       CPU

                   Switch            M      CPU       Switch   CPU    M


                  Memory
I. Basics




              Shared Memory                      Distributed Memory


                                Dr. Pierre Vignéras
MIMD
                Cooperation/Communication
            ●   Shared memory
                –   read/write in the shared memory by hardware
                –   synchronisation with locking primitives 
                    (semaphore, monitor, etc.)
                –   Variable Latency
            ●   Distributed memory
                –   explicit message passing in software 
I. Basics




                     ●   synchronous, asynchronous, grouped, remote method 
                         invocation
                –   Large message may mask long and variable latency


                                        Dr. Pierre Vignéras
Interconnection Networks
I. Basics




                     Dr. Pierre Vignéras
“Hard” Network Metrics
            ●   Bandwidth (B): maximum number of bytes the 
                network can transport
            ●   Latency (L): transmission time of a single 
                message 
                –    4 components
                     ●   software overhead (so): msg packing/unpacking
                     ●   routing delay (rd): time to establish the route
                     ●   contention time (ct): network is busy
I. Basics




                     ●   channel delay (cd): msg size/bandwidth
                –   so+ct depends on program behaviour
                –   rd+cd depends on hardware
                                         Dr. Pierre Vignéras
“Soft” Network Metrics
            ●   Diameter (r): longest path between two nodes
                                               r
                Average Distance (da):    1
                                              ∑ d.N d
            ●

                                        N −1 d =1
                where d is the distance between two nodes defined as the number of links in 
                the shortest path between them, N the total number of nodes and Nd , the 
                number of nodes with a distance d from a given node.
            ●   Connectivity (P): the number of nodes that can 
                be reached from a given node in one hop.
                Concurrency (C): the number of independent 
I. Basics




            ●

                connections that a network can make.
                 –   bus: C = 1; linear: C = N­1(if processor can send and 
                     receive simultaneously, otherwise, C = ⌊N/2⌋)
                                           Dr. Pierre Vignéras
Influence of the network
            ●   Bus
                –   low cost, 
                –   low concurrency degree: 1 message at a time
            ●   Fully connected
                –   high cost: N(N­1)/2 links
                –   high concurrency degree: N messages at a time
            ●   Star 
I. Basics




                –   Low cost (high availability)
                –   Exercice: Concurrency Degree?


                                     Dr. Pierre Vignéras
Parallel Responsibilities

                                   Human writes a          Parallel Algorithm
                         Human implements using a             Language
                          Compiler translates into a       Parallel Program
                Runtime/Libraries are required in a         Parallel Process
             Human & Operating System map to the               Hardware
                      The hardware is leading the          Parallel Execution

            Many actors are involved in the execution      We usually consider
I. Basics




                 of a program (even sequential).           explicit parallelism:
                  Parallelism can be introduced             when parallelism is
                    automatically by compilers,             introduced at the
            runtime/libraries, OS and in the hardware        language level

                                     Dr. Pierre Vignéras
SIMD Pseudo-Code
            ●   SIMD
                <indexed variable> = <indexed expression>, (<index range>);
                Example:
                C[i,j] == A[i,j] + B[i,j], (0 ≤ i ≤ N-1)

                    j                         j                              j
                i                    i                                i
                        C        =                A             +                B


                    j=0                      j=1                             j = N-1
I. Basics




                    = +                     = +                 ...          = +
                C[i] A[i] B[i]           C[i] A[i] B[i]                   C[i] A[i] B[i]
                                          Dr. Pierre Vignéras
SIMD Matrix Multiplication
            ●   General Formula
                                       N −1
                                cij = ∑ a ik b kj
                                        k =0

                for (i = 0; i < N; i++) {
                  // SIMD Initialisation
                  C[i,j] = 0, (0 <= j < N);
                  for (k = 0; k < N; k++) {
I. Basics




                     // SIMD computation
                     C[i,j]=C[i,j]+A[i,k]*B[k,j], (0 <= j < N);
                  }
                }

                                  Dr. Pierre Vignéras
SIMD C Code (GCC)
            ●   gcc vector type & operations
                   ●   Vector of 8 bytes long, composed of 4 short elements 
                       typedef short v4hi __attribute__ ((vector_size (8)));
                   ●   Use union to access individual elements
                       union vec { v4hi v, short s[4]; };
                   ●   Usual operations on vector variables
                       v4hi v, w, r; r = v+w*w/v;

            ●   gcc builtins functions
                   ●   #include <mmintrin.h> // MMX only
I. Basics




                   ●   Compile with ­mmmx
                   ●   Use the provided functions
                       _mm_setr_pi16 (short w0, short w1, short w2, short w3);
                       _mm_add_pi16 (__m64 m1, __m64 m2);
                       ...


                                       Dr. Pierre Vignéras
MIMD Shared Memory
                         Pseudo Code
            ●   New instructions:
                –   fork <label>   Start a new process executing at 
                    <label>;
                –   join <integer> Join <integer> processes into one;
                –   shared <variable list> make the storage class of the 
                    variables shared;
                –   private <variable list> make the storage class of the 
                    variables private.
I. Basics




            ●   Sufficient? What about synchronization?
                –   We will focus on that later on...


                                        Dr. Pierre Vignéras
MIMD Shared Memory Matrix
                      Multiplication
            ●   private i,j,k;
                shared A[N,N], B[N,N], C[N,N], N;
                for (j = 0; j < N - 2; j++) fork DOCOL;
                // Only the Original process reach this point
                j = N-1;
                DOCOL: // Executed by N process for each column
                for (i = 0; i < N; i++) {
                  C[i,j] = 0;
                  for (k = 0; k < N; k++) {
                     // SIMD computation
                     C[i,j]=C[i,j]+A[i,k]*B[k,j];
I. Basics




                  }
                }
                join N;   // Wait for the other process


                                 Dr. Pierre Vignéras
MIMD Shared Memory C Code
            ●   Process based: 
                –   fork()/wait()
            ●   Thread Based: 
                –   Pthread library
                –   Solaris Thread library
            ●   Specific Framework: OpenMP
            ●   Exercice: read the following articles:
I. Basics




                –   “Imperative Concurrent Object­Oriented Languages”, 
                    M.Philippsen, 1995
                –   “Threads Cannot Be Implemented As a Library”,
                    Hans­J.Boehm, 2005

                                         Dr. Pierre Vignéras
MIMD Distributed Memory
                     Pseudo Code
            ●   Basic instructions send()/receive()
            ●   Many possibilities:
                –   non­blocking send needs message buffering
                –   non­blocking receive needs failure return 
                –   blocking send and receive both needs termination 
                    detection
            ●   Most used combinations 
                         blocking send and receive
I. Basics




                     ●


                     ●   non­blocking send and blocking receive
            ●   Grouped communications
                –   scatter/gather, broadcast, multicast, ...
                                        Dr. Pierre Vignéras
MIMD Distributed Memory
                         C Code
            ●   Traditional Sockets
                –   Bad Performance
                –   Grouped communication limited
            ●   Parallel Virtual Machine (PVM)
                –   Old but good and portable
            ●   Message Passing Interface (MPI)
                –   The standard used in parallel programming  today
I. Basics




                –   Very efficient implementations
                     ●   Vendors of supercomputers provide their own highly 
                         optimized MPI implementation
                     ●   Open­source implementation available
                                        Dr. Pierre Vignéras
Performance


          ●
              Size and Depth
              Speedup and Efficiency
Outline




          ●




                   Dr. Pierre Vignéras   34
Parallel Program
                                 Characteristics
                 ●   s = V[0];
                     for (i = 1; i < N; i++) {
                       s = s + V[i];
                     }
                                          V[0]      V[1]         V[2]   V[3]   ...   V[N-1]
I. Performance




                                                      +
                                                                  +
                     Data Dependence Graph
                     ● Size = N-1 (nb op)
                                                                         +
                     ● Depth = N-1 (nb op in the

                       longest path from                                               +
                       any input to any output)
                                                                                       s
                                           Dr. Pierre Vignéras                             35
Parallel Program
                                        Characteristics
                 V[0]           V[1]       V[2]       V[3]                       V[N-2] V[N-1]


                            +                     +                                    +
I. Performance




                                       +                                     +



                                                                        +
                        ●   Size = N-1
                        ●   Depth =⌈log2N⌉
                                                                        s

                                                       Dr. Pierre Vignéras                  36
Parallel Program
                                    Characteristics
                 ●   Size and Depth does not take into account 
                     many things
                     –   memory operation (allocation, indexing, ...)
                     –   communication between processes
                          ●   read/write for shared memory MIMD
I. Performance




                          ●   send/receive for distributed memory MIMD
                     –   Load balancing when the number of processors is 
                         less than the number of parallel task
                     –   etc...
                 ●   Inherent Characteristics of a Parallel Machine 
                     independent of any specific architecture
                                            Dr. Pierre Vignéras          37
Prefix Problem
                 ●    for (i = 1; i < N; i++) {
                        V[i] = V[i-1] + V[i];
                      }
                     V[0]   V[1]   V[2]   V[3]   ...    V[N-1]

                             +
                                                                   ●   Size = N-1
I. Performance




                                                                   ●   Depth =N-1
                                    +

                                           +
                                                                 How to make this
                                                            +    program parallel?



                     V[0]   V[1]   V[2]   V[3]   ...    V[N-1]
                                          Dr. Pierre Vignéras                        38
Prefix Solution 1
Upper/Lower Prefix Algorithm
●    Divide and Conquer
      –    Divide a problem of size N into 2 equivalent 
           problems of size N/2
                                      Apply the same algorithm
    V[0]                       V[N-1]       for sub-prefix.
             ...               ...    When N=2, use the direct,
                                      sequential implementation
    ⌈N/2⌉ Prefix      ⌊N/2⌋ Prefix
                                                    Size has increased!
                                                    N-2+N/2 > N-1 (N>2)

                                                      Depth = ⌈log2N⌉
             ...      +   +    ...   +
                                                     Size (N=2k) = k.2k-1
                                                          =(N/2).log2N
    V[0]                         V[N-1]
                              Dr. Pierre Vignéras                           39
ul
                     Analysis of P
●   Proof by induction
●   Only for N a power of 2
    –   Assume from construction that size increase 
        smoothly with N
●   Example: N=1,048,576=220
    –   Sequential Prefix Algorithm:
         ●   1,048,575 operations in 1,048,575 time unit steps
    –   Upper/Lower Prefix Algorithm 
         ●   10,485,760 operations in 20 time unit steps




                             Dr. Pierre Vignéras                 40
Other Prefix Parallel
                  Algorithms
●   Pul requires N/2 processors available!
    –   In our example, 524,288 processors!
●   Other algorithms (exercice: study them)
    –    Odd/Even Prefix Algorithm
         ●   Size = 2N­log2(N)­2  (N=2k) 
         ●   Depth = 2log2(N)­2
    –    Ladner and Fischer's Parallel Prefix
         ●   Size = 4N­4.96N0.69 + 1
         ●   Depth = log2(N)
●   Exercice: implement them in C
                              Dr. Pierre Vignéras   41
Benchmarks
         (Inspired by Pr. Ishfaq Ahmad Slides)

●   A benchmark is a performance testing program 
    –   supposed to capture processing and data 
        movement characteristics of a class of applications.
●   Used to measure and to predict performance  
    of computer systems, and to reveal their 
    architectural weakness and strong points.
●   Benchmark suite: set of benchmark programs.
●   Benchmark family: set of benchmark suites.
●   Classification: 
    –   scientific computing, commercial, applications, 
        network services, multimedia applications, ...

                         Dr. Pierre Vignéras               42
Benchmark Examples

     Type          Name              Measuring
                           Numerical computing (linear al-
                  LINPACK
                                       gebra)
Micro-Benchmark           System calls and data movement
                 LMBENCH
                                operations in Unix
                  STREAM        Memory bandwidth
                    NAS      Parallel computing (CFD)
                PARKBENCH       Parallel computing
                    SPEC     A mixed benchmark family
Macro-Benchmark
                   Splash       Parallel computing
                    STAP        Signal processing
                     TPC      Commercial applications


                      Dr. Pierre Vignéras                    43
SPEC Benchmark Family
●   Standard Performance Evaluation Corporation
    –   Started with benchmarks that measure CPU 
        performance
    –   Has extended to client/server computing, 
        commercial applications, I/O subsystems, etc.
●   Visit http://www.spec.org/ for more 
    informations




                        Dr. Pierre Vignéras             44
Performance Metrics
●   Execution time 
    –   Generally in seconds
    –   Real time, User Time, System time?
         ●   Real time
●   Instruction Count
    –   Generally in MIPS or BIPS
    –   Dynamic: number of executed instructions, not the 
        number of assembly line code!
●   Number of Floating Point Operations Executed
    –   Mflop/s, Gflop/s
    –   For scientific applications where flop dominates
                           Dr. Pierre Vignéras             45
Memory Performance
●   Three parameters:
    –   Capacity,
    –   Latency,
    –   Bandwidth




                    Dr. Pierre Vignéras   46
Parallelism and Interaction
             Overhead
●   Time to execute a parallel program is
             T = Tcomp+Tpar+Tinter
    –   Tcomp: time of the effective computation
    –   Tpar: parallel overhead
            –   Process management(creation, termination, context switching)
            –   Grouping operations (creation, destruction of group)
            –   Process Inquiry operation (Id, Group Id, Group size)
    –   Tinter: interaction overhead
            –   Synchronization (locks, barrier, events, ...)
            –   Aggregation (reduction and scan)
            –   Communication (p2p, collective, read/write shared variables)

                               Dr. Pierre Vignéras                             47
Measuring Latency
               Ping-Pong
for (i=0; i < Runs; i++)
  if (my_node_id == 0) {     /* sender */
     tmp = Second();
     start_time = Second();
     send an m-byte message to node 1;
     receive an m-byte message from node 1;
     end_time = Second();
     timer_overhead = start_time - tmp ;
     total_time = end_time - start_time - timer_overhead;
     communication_time[i] = total_time / 2 ;
  } else if (my_node_id ==1) {/* receiver */
     receive an m-byte message from node 0;
     send an m-byte message to node 0;
  }
}



                        Dr. Pierre Vignéras                 48
Measuring Latency
      Hot-Potato (fire-brigade)
              Method
●   n nodes are involved (instead of just two)
●   Node 0 sends an m­bytes message to node 1
●   On receipt, node 1 immediately sends the 
    same message to node 2 and so on.
●   Finally node n­1 sends the message back to 
    node 0
●   The total time is divided by n to get the point­
    to­point  average communication time.




                     Dr. Pierre Vignéras               49
Collective Communication
           Performance
for (i = 0; i < Runs; i++) {
  barrier synchronization;
  tmp = Second();
  start_time = Second();
  for (j = 0; j < Iterations; j++)
    the_collective_routine_being_measured;
  end_time = Second();
  timer_overhead = start_time - tmp;
  total_time = end_time - start_time - timer_overhead;
  local_time = total_time / Iterations;
  communication_time[i] = max {all n local_time};
}



                   How to get all n local time?

                      Dr. Pierre Vignéras                50
Point-to-Point
        Communication Overhead
●   Linear function of the message length m (in 
    bytes)
    –    t(m) = t0 + m/r∞
    –   t0: start­up time
    –   r∞: asymptotic bandwidth
                                                   t
          ●   Ideal Situation!
          ●   Many things are left out
              ● Topology metrics
                                                   t0
              ● Network Contention


                                                        m
                             Dr. Pierre Vignéras            51
Collective Communication
●   Broadcast: 1 process sends an m­bytes message to all 
    n process
●   Multicast: 1 process sends an m­bytes message to p < 
    n process
●   Gather: 1 process receives an m­bytes from each of 
    the n process (mn bytes received)
●   Scatter: 1 process sends a distinct m­bytes message to 
    each of the n process (mn bytes sent)
●   Total exchange: every process sends a distinct m­
    bytes message to each of the n process (mn2 bytes 
    sent)
●   Circular Shift: process i sends an m­bytes message to 
    process i+1 (process n­1 to process 0)
                       Dr. Pierre Vignéras               52
Collective Communication
                 Overhead
●   Function of both m and n
    –   start­up latency depends only on n
        T(m,n) =  t0(n) + m/r∞(n)
        ●   Many problems here! For a broadcast for example,
            how is done the communication at the link layer?
              ● Sequential?

                ● Process 1 sends 1 m-bytes message to process 2,

                  then to process 3, and so on... (n steps required)
              ● Truly Parallel (1 step required)

              ● Tree based (log (n) steps required)
                                 2
        ●   Again, many things are left out
            ● Topology metrics

            ● Network Contention




                              Dr. Pierre Vignéras                      53
Speedup
                    (Problem-oriented)

●   For a given algorithm, let Tp be time to perform 
    the computation using p processors
    –   T∞: depth of the algorithm
●   How much time we gain by using a parallel 
    algorithm?
    –   Speedup: Sp=T1/Tp
    –   Issue: what is T1?
         ●   Time taken by the execution of the best sequential 
             algorithm on the same input data
         ●   We note T1=Ts

                             Dr. Pierre Vignéras
Speedup consequence

●   Best Parallel Time Achievable is: Ts/p
●   Best Speedup is bounded by Sp≤Ts/(Ts/p)=p
    –   Called Linear Speedup
●   Conditions
    –   Workload can be divided into p equal parts
    –   No overhead at all
●   Ideal situation, usually impossible to achieve!
●   Objective of a parallel algorithm
    –   Being as close to a linear speedup as possible


                         Dr. Pierre Vignéras
Biased Speedup
                (Algorithm-oriented)
●   We usually don't know what is the best 
    sequential algorithm except for some simple 
    problems (sorting, searching, ...)
●   In this case, T1
    –   is the time taken by the execution of the same 
        parallel algorithm on 1 processor.
●   Clearly biased!
    –   The parallel algorithm and the best sequential 
        algorithm are often very, very different
●   This speedup should be taken with that 
    limitation in mind!
                         Dr. Pierre Vignéras
Biased Speedup Consequence
●   Since we compare our parallel algorithm on p 
    processors with a suboptimal sequential 
    (parallel algorithm on only 1 processor) we can 
    have: Sp>p
    –   Superlinear speedup
●   Can also be the consequence of using
    –   a unique feature of the system architecture that 
        favours parallel formation
    –   indeterminate nature of the algorithm
    –   Extra­memory due to the use of p computers (less 
        swap)
                         Dr. Pierre Vignéras
Efficiency
●   How long processors are being used on the 
    computation?
    –   Efficiency: Ep=Sp/p (expressed in %)
    –   Also represents how much benefits we gain by 
        using p processors
●   Bounded by 100%
    –   Superlinear speedup of course can give more than 
        100% but it is biased




                         Dr. Pierre Vignéras
Amdhal's Law: fixed problem
               size
              (1967)
●   For a problem with a workload W, we assume 
    that we can divide it into two parts:
    –   W = W + (1­)W
         percent of W must be executed sequentially , and the 
        remaining 1­ can be executed by n nodes simultaneously with 
        100% of efficiency
    –   Ts(W)=Ts(W+(1­)W)=Ts(W)+Ts((1­)W)=Ts(W)+(1­)Ts(W)

    –   Tp(W)=Tp(W+(1­)W)=Ts(W)+Tp((1­)W)=Ts(W)+(1­)Ts(W)/p


                    Ts            p    1
        S p=                =           p ∞
                    1−T s 1 p−1 
              T s
                       p
                            Dr. Pierre Vignéras                     59
Amdhal's Law: fixed problem
           size
                              Ts
      αTs                              (1-α)Ts
 serial section                    parallelizable sections
                                           ...




                                                             P processors
                    ...

                  (1-α)Ts/p
            Tp
                     Dr. Pierre Vignéras                                    60
Amdhal's Law Consequences



    S(p)                              S(α)


                              p=100
           α=5%

           α=10%

           α=20%
                          p=10
           α=40%


            Dr. Pierre Vignéras              61
Amdhal's Law Consequences
●   For a fixed workload, and without any 
    overhead, the maximal speedup as an upper 
    bound of 1/
●   Taking into account the overhead To
    –   Performance is limited not only by the sequential 
        bottleneck but also by the average overhead
                          Ts                 p
           S p=                     =
                       1−T s                 pT o
                 T s          T o 1 p−1
                            p                   Ts
                                1
                         S p        p  ∞
                                 To
                              
                                 Ts
                         Dr. Pierre Vignéras                 62
Amdhal's Law Consequences
●   The sequential bottleneck cannot be solved 
    just by increasing the number of processors in 
    a system.
    –   Very pessimistic view on parallel processing during 
        two decades .
●   Major assumption: the problem size 
    (workload) is fixed.

                    Conclusion don't use parallel
               programming on a small sized problem!


                         Dr. Pierre Vignéras              63
Gustafson's Law: fixed time
              (1988)
●   As the machine size increase, the problem size 
    may increase with execution time unchanged
●   Sequential case: W=W+(1­)W
     –   Ts(W)=Ts(W)+(1­)Ts(W)

     –   Tp(W)=Ts(W)+(1­)Ts(W)/p

     –   To have a constant time, the parallel workload 
         should be increased
●   Parallel case (p­processor):  W'=W+p(1­)W.
     –   Ts(W')= Ts(W)+p(1­)Ts(W)
●   Assumption: sequential part is constant: it does not increase 
    with the size of the problem

                             Dr. Pierre Vignéras                     64
Fixed time speedup

        T s W '  T s W '   T s W  p 1−T s W 
S  p=           =          =
        T p W '  T s W      T s W 1−T s W 

S  p=1− p



  Fixed time speedup is a linear function of p,
           if the workload is scaled up
       to maintain a fixed execution time

                     Dr. Pierre Vignéras                    65
Gustafson's Law
           Consequences
                  α=5%                              p=100
S(p)                           S(α)
                                           p=50

          α=10%
              α=20%                      p=25




  α=40%                                           p=10


                   Dr. Pierre Vignéras                   66
Fixed time speedup with
        overhead
        T s W '       T s W '      1− p
S  p=            =                 =
        T p W '  T s W T o  p      T o  p
                                       1
                                          T s W 

● Best fixed speedup is achieved with
  a low overhead as expected
● T : function of the number of processors
   o
● Quite difficult in practice to make it a

  decreasing function

                  Dr. Pierre Vignéras                67
SIMD Programming


          ●
              Matrix Addition & Multiplication
              Gauss Elimination
Outline




          ●




                        Dr. Pierre Vignéras      68
Matrix Vocabulary
●   A matrix is dense  if most of its elements are 
    non­zero
●   A matrix is sparse if a significant number of its 
    elements are zero
    –   Space efficient representation available
    –   Efficient algorithms available
●   Usual matrix size in parallel programming
    –   Lower than 1000x1000 for a dense matrix
    –   Bigger than 1000x1000 for a sparse matrix
    –   Growing with the performance of computer 
        architectures
                         Dr. Pierre Vignéras         69
Matrix Addition

          C = AB , c i , j =a i , j bi , j    0≤i≤n , 0≤ j≤m

// Sequential
for (i = 0; i < n; i++) {             Need to map the vectored
  for (j = 0; j < m; j++) {           addition of m elements a[i]+b[i]
     c[i,j] = a[i,j]+b[i,j];          to our underlying architecture
  }                                   made of p processing elements
}


// SIMD
                                        Complexity:
for (i = 0; i < n; i++) {               – Sequential case: n.m additions
  c[i,j] = a[i,j]+b[i,j];              – SIMD case: n.m/p
     (0 <= j < m)
                                        Speedup: S(p)=p
}

                          Dr. Pierre Vignéras                            70
Matrix Multiplication

          C = AB , A[n , l ] ; B [l , m] ;C [n , m]
                 l −1
         ci , j = ∑ a i , k b k , j   0≤i≤n ,0≤ j≤m
                 k =0

for (i = 0; i < n; i++) {
  for (j = 0; j < m; j++) {
     c[i,j] = 0;
     for (k = 0; k < l; k++) {
        c[i,j] = c[i,j] + a[i,k]*b[k,j];
     }
  }
                   Complexity:
                   – n.m initialisations
}
                        – n.m.l additions and multiplications
                        Overhead: Memory indexing!
                        – Use a temporary variable sum for c[i,j]!


                              Dr. Pierre Vignéras                    71
SIMD Matrix Multiplication

  for (i = 0; i < n; i++) {
    c[i,j] = 0; (0 <= j < m) // SIMD Op
       for (k = 0; k < l; k++) {
          c[i,j] = c[i,j] + a[i,k]*b[k,j];
                          (0 <= j < m)
       }
    }     Complexity (n = m = l for simplicity):
          – n2/p initialisations (p<=n)
  }          3
            – n /p additions and multiplications (p<=n)
            Overhead: Memory indexing!
            – Use a temporary vector for c[i]!


Algorithm in O(n) with p=n2
– Assign one processor for
                                     Algorithm in O(lg(n)) with p=n3
       each element of C             – Parallelize the inner loop!
– Initialisation in O(1)
                                     – Lowest bound
– Computation in O(n)

                           Dr. Pierre Vignéras                         72
Gauss Elimination
●   Matrix (n x n) represents a system of n linear 
    equations of n unknowns
a n−1,0 x 0a n−1,1 x 1 a n−1,2 x 2 ...a n−1, n−1 x n−1 =b n−1
a n−2,0 x 0a n−2,1 x 1a n−2,2 x 2 ...a n−2, n−1 x n−1 =b n−2
                              ...
      a 1,0 x 0a 1,1 x 1 a 1,2 x 2...a 1, n−1 x n−1 =b1
      a 0,0 x 0 a 0,1 x 1 a 0,2 x 2...a 0, n−1 x n−1=b0




                                                                                          
                                a n−1,0 a n−1,1 a n−1,2             ... a n−1, n−1 b n−1
                                a n−2,0 a n−2,1 a n−2,2             ... a n−2, n−1 bn−2
              Represented by M = ...      ...     ...               ...    ...      ...
                                 a 1,0   a 1,1   a 1,2              ... a 1, n−1    b1
                                 a 0,0   a 0,1   a 0,2              ... a 0, n−1    b0
                                      Dr. Pierre Vignéras                             73
Gauss Elimination Principle
●   Transform the linear system into a triangular 
    system of equations.




    At the ith row, each row j below is replaced by:
                      {row j} + {row i}.(-aj, i/ai, i)




                             Dr. Pierre Vignéras         74
Gauss Elimination
          Sequential & SIMD
             Algorithms
for (i=0; i<n-1; i++){ // for each row except the last
  for (j=i+1; j<n; j++) // step through subsequent rows
     m=a[j,i]/a[i,i];    // compute multiplier
     for (k=i; k<n; k++) {
        a[j,k]=a[j,k]-a[i,k]*m;
     }                              Complexity: O(n3)
  }
}


for (i=0; i<n-1; i++){ SIMD Version Complexity: O(n3/p)
  for (j=i+1; j<n; j++)
     m=a[j,i]/a[i,i];
     a[j,k]=a[j,k]-a[i,k]*m; (i<=k<n)
  }
}
                      Dr. Pierre Vignéras                 75
Other SIMD Applications
●   Image Processing
●   Climate and Weather Prediction
●   Molecular Dynamics
●   Semi­Conductor Design
●   Fluid Flow Modelling
●   VLSI­Design
●   Database Information Retrieval
●   ...




                   Dr. Pierre Vignéras   76
MIMD SM Programming

           ●
               Process, Threads & Fibers
           ●
               Pre- & self-scheduling
               Common Issues
Outline




           ●


           ●
               OpenMP

                      Dr. Pierre Vignéras   77
Process, Threads, Fibers
●   Process: heavy weight
    –   Instruction Pointer, Stacks, Registers
    –   memory, file handles, sockets, device handles, and 
        windows
●   (kernel) Thread: medium weight
    –   Instruction Pointer, Stacks, Registers
    –   Known and scheduled preemptively by the kernel
●   (user thread) Fiber: light weight
    –   Instruction Pointer, Stacks, Registers
    –   Unkwown by the kernel and scheduled 
        cooperatively in user space
                         Dr. Pierre Vignéras             78
Multithreaded Process

Code        Data    Files               Code      Data        Files

                                         Reg      Reg         Reg
Registers          Stack
                                        Stack     Stack       Stack
                   thread




       Process                                   Process
    Single-Threaded                          Multi-Threaded
                       Dr. Pierre Vignéras                            79
Kernel Thread
●   Provided and scheduled by the kernel
    –   Solaris, Linux, Mac OS X, AIX, Windows XP
    –   Incompatibilities: POSIX PThread Standard
●   Scheduling Question: Contention Scope
    –   The time slice q can be  given to:
         ●   the whole process (PTHREAD_PROCESS_SCOPE)
         ●   Each individual thread (PTHREAD_SYSTEM_SCOPE)
●   Fairness?
    –   With a PTHREAD_SYSTEM_SCOPE, the more 
        threads a process has, the more CPU it has!


                           Dr. Pierre Vignéras               80
Fibers (User Threads)
●   Implemented entirely in user space (also called 
    green theads)
●   Advantages:
    –    Performance: no need to cross the kernel (context 
        switch) to schedule another thread
    –   Scheduler can be customized for the application
●   Disadvantages:
    –   Cannot directly use multi­processor
    –   Any system call blocks the entire process
         ●   Solutions: Asynchronous I/O, Scheduler Activation 
●   Systems: NetBSD, Windows XP (FiberThread)
                             Dr. Pierre Vignéras                  81
Mapping
●   Threads are expressed by the application in 
    user space.
●   1­1: any thread expressed in user space is 
    mapped to a kernel thread
    –   Linux, Windows XP, Solaris > 8 
●   n­1: n threads expressed in user space are 
    mapped to one kernel thread
    –   Any user­space only thread library
●   n­m: n threads expressed in user space are 
    mapped to m kernel threads (n>>m)
    –   AIX, Solaris < 9, IRIX, HP­UX, TRU64 UNIX
                        Dr. Pierre Vignéras         82
Mapping
User Space       User Space          User Space




Kernel Space    Kernel Space         Kernel Space
    1-1                n-1               n-m

               Dr. Pierre Vignéras                83
MIMD SM Matrix Addition
         Self-Scheduling
●   private i,j,k;
    shared A[N,N], B[N,N], C[N,N], N;
    for (i = 0; i < N - 2; i++) fork DOLINE;
    // Only the Original process reach this point
    i = N-1; // Really Required?
    DOLINE: // Executed by N process for each line
    for (j = 0; j < N; j++) {
         // SIMD computation
         C[i,j]=A[i,j]+B[i,j];Complexity: O(n2/p)
      }                       If SIMD is used: O(n2/(p.m))
    }                         m:number of SIMD PEs
    join N;   // Wait for the other process
                 Issue: if n > p, self-scheduling.
            The OS decides who will do what. (Dynamic)
                       In this case, overhead

                         Dr. Pierre Vignéras                 84
MIMD SM Matrix Addition
          Pre-Scheduling
●   Basic idea: divide the whole matrix by p=s*s 
    sub­matrix
                                          submatrices
        s



        C            =           A                +      B
s




Each of the p threads compute:
                 =                                +
             Ideal Situation! No communication Overhead!
            Pre-Scheduling: decide statically who will do what.
                            Dr. Pierre Vignéras                   85
Parallel Primality Testing
                (from Herlihy and Shavit Slides)

●   Challenge
    –   Print primes from 1 to 1010
●   Given
    –   Ten­processor multiprocessor
    –   One thread per processor
●   Goal
    –   Get ten­fold speedup (or close)




                         Dr. Pierre Vignéras       86
Load Balancing
●   Split the work evenly
●   Each thread tests range of 109

    1        109 2·109   …                          1010


        P0     P1   …                              P9
             Code for thread i:
             void thread(int i) {
              for (j = i*109+1, j<(i+1)*109; j++) {
                 if (isPrime(j)) print(j);
               }
             }

                             Dr. Pierre Vignéras
Issues
●   Larger Num ranges have fewer primes
●   Larger numbers harder to test
    Thread workloads
                                                          ed
●

    –   Uneven
                                                  je   ct
    –   Hard to predict                         re
●   Need dynamic load balancing




                          Dr. Pierre Vignéras
Shared Counter




    19
   each thread takes a number
   18

   17

    Dr. Pierre Vignéras
Procedure for Thread i


  // Global (shared) variable
  static long value = 1;

  void thread(int i) {
    long j = 0;
    while (j < 1010) {
      j = inc();
      if (isPrime(j)) print(j);
    }
  }



          Dr. Pierre Vignéras
Counter Implementation


   // Global (shared) variable
   static long value = 1; r,
                       so
                      oces sor
                  nipr ces
   long inc() u
             for { ltipro
       OK r mu
        return value++;
   }   n o t fo




           Dr. Pierre Vignéras
Counter Implementation


   // Global (shared) variable
   static long value = 1;

   long inc() {
        return value++; = value;
                  temp
   }              value = value + 1;
                  return temp;




           Dr. Pierre Vignéras
Uh-Oh

Value…              2                 3         2

         read 1             read 2
                  write 2            write 3



         read 1                                write 2


             time


             Dr. Pierre Vignéras
Challenge



// Global (shared) variable
static long value = 1;

long inc() {
    temp = value;
    value = temp + 1;
       Make these steps
    return temp;
                             atomic (indivisib
}




       Dr. Pierre Vignéras
An Aside: C/Posix Threads API

     // Global (shared) variable
     static long value = 1;
     static pthread_mutex_t mutex;
     // somewhere before
     pthread_mutex_init(&mutex, NULL);

     long inc() {
       pthread_mutex_lock(&mutex);
       temp = value;
                            Critical section
       value = temp + 1;
       pthread_mutex_unlock(&mutex);
       return temp;
     }

               Dr. Pierre Vignéras
Correctness
●   Mutual Exclusion
    –   Never two or more in critical section
    –   This is a safety property
●   No Lockout (lockout­freedom)
    –   If someone wants in, someone gets in
    –   This is a liveness property




                          Dr. Pierre Vignéras
Multithreading Issues
●   Coordination
    –   Locks, Condition Variables, Volatile
●   Cancellation (interruption)
    –   Asynchronous (immediate)
         ●   Equivalent to 'kill ­KILL' but much more dangerous!!
              –   What happens to lock, open files, buffers, pointers, ...?
         ●   Avoid it as much as you can!!
    –   Deferred
         ●   Target thread should check periodically if it should be 
             cancelled
         ●   No guarantee that it will actually stop running!

                                  Dr. Pierre Vignéras                         97
General Multithreading
              Guidelines
●   Use good software engineering
    –   Always check error status
●   Always use a while(condition) loop when  
    waiting on a condition variable
●   Use a thread pool for efficiency
    –   Reduce the number of created thread
    –   Use thread specific data so one thread can holds its 
        own (private) data 
●   The Linux kernel schedules tasks only, 
    –   Process == a task that shares nothing
    –   Thread == a task that shares everything
                         Dr. Pierre Vignéras               98
General Multithreading
              Guidelines
●   Give the biggest priorities to I/O bounded 
    threads
    –   Usually blocked, should be responsive
●   CPU­Bounded threads should usually have the 
    lowest priority
●   Use Asynchronous I/O when possible, it is 
    usually more efficient than the combination of 
    threads+ synchronous I/O
●   Use event­driven programming to keep 
    indeterminism as low as possible
●   Use threads when you really need it!
                        Dr. Pierre Vignéras       99
OpenMP
●   Design as superset of common languages for 
    MIMD­SM programming (C, C++, Fortran)
       ●   UNIX, Windows version
●   fork()/join() execution model
       ●   OpenMP programs always start with one thread
●   relaxed­consistency memory model
       ●   each thread is allowed to have its own temporary view of 
           the memory
●   Compiler directives in C/C++ are called 
    pragma (pragmatic information)
       ●   Preprocessor directives:
            #pragma omp <directive> [arguments]

                           Dr. Pierre Vignéras                  100
OpenMP




Dr. Pierre Vignéras   101
OpenMP
                  Thread Creation
●   #pragma omp parallel [arguments]
    <C instructions>   // Parallel region
●   A team of thread is created to execute the 
    parallel region
    –   the original thread becomes the master thread 
        (thread ID = 0 for the duration of the region)
    –   numbers of threads determined by
         ●   Arguments (if, num_threads)
         ●   Implementation (nested parallelism support, dynamic 
             adjustment)
         ●   Environment variables (OMP_NUM_THREADS)

                            Dr. Pierre Vignéras                 102
OpenMP
                    Work Sharing
●   Distributes the execution of a  parallel region 
    among team members
    –   Does not launch threads
    –   No barrier on entry
    –   Implicit barrier on exit (unless nowait specified)
●   Work­Sharing Constructs
    –   Sections 
    –   Loop
    –   Single 
    –   Workshare (Fortran)
                        Dr. Pierre Vignéras              103
OpenMP
Work Sharing - Section

double e, pi, factorial, product;
int i;
e = 1;   // Compute e using Taylor expansion
factorial = 1;
for (i = 1; i<num_steps; i++) {
    factorial *= i;
    e += 1.0/factorial;
} // e done
pi = 0; // Compute pi/4 using Taylor expansion
for (i = 0; i < num_steps*10; i++) {
    pi += 1.0/(i*4.0 + 1.0);
    pi -= 1.0/(i*4.0 + 3.0);
}                               Independent Loops
pi = pi * 4.0; // pi done           Ideal Case!
return e * pi;


               Dr. Pierre Vignéras                  104
OpenMP
       Work Sharing - Section
double e, pi, factorial, product; int i;         Private Copies
#pragma omp parallel sections shared(e, pi) {      Except for
   #pragma omp section {                        shared variables
      e = 1; factorial = 1;
      for (i = 1; i<num_steps; i++) {
         factorial *= i; e += 1.0/factorial;
      }
   } /* e section */
                                                 Independent
   #pragma omp section {
                                                 sections are
      pi = 0;                                      declared
      for (i = 0; i < num_steps*10; i++) {
         pi += 1.0/(i*4.0 + 1.0);
         pi -= 1.0/(i*4.0 + 3.0);
      }
      pi = pi * 4.0;
   } /* pi section */                            Implicit Barrier
} /* omp sections */                                on exit
return e * pi;
                       Dr. Pierre Vignéras                      105
Talyor Example Analysis
   –   OpenMP Implementation: Intel Compiler
   –   System: Gentoo/Linux­2.6.18 
   –   Hardware: Intel Core Duo T2400 (1.83 Ghz) Cache 
       L2 2Mb (SmartCache)          Reason: Core Duo
Gcc:                                             Architecture?
   Standard: 13250 ms,
                                 Intel CC:
                                     Standard: 12530 ms
   Optimized (-Os -msse3
                                     Optimized (-O2 -xW
      -march=pentium-m
                                        -march=pentium4): 7040 ms
     -mfpmath=sse): 7690 ms


OpenMP Standard: 15340 ms                    Speedup: 0.6
OpenMP Optimized: 10460 ms                   Efficiency: 3%

                       Dr. Pierre Vignéras                       106
OpenMP
        Work Sharing - Section
●   Define a set of structured blocks that are to be 
    divided among, and executed by, the threads 
    in a team
●   Each structured block is executed once by one 
    of the threads in the team
●   The method of scheduling the structured 
    blocks among threads in the team is 
    implementation defined
●   There is an implicit barrier at the end of a 
    sections construct, unless a nowait clause is 
    specified
                     Dr. Pierre Vignéras           107
OpenMP
 Work Sharing - Loop
// LU Matrix Decomposition
for (k = 0; k<SIZE-1; k++) {
   for (n = k; n<SIZE; n++) {
      col[n] = A[n][k];
   }
   for (n = k+1; n<SIZE; n++) {
     A[k][n] /= col[k];
   }
   for (n = k+1; n<SIZE; n++) {
     row[n] = A[k][n];
   }
   for (i = k+1; i<SIZE; i++) {
      for (j = k+1; j<SIZE; j++) {
         A[i][j] = A[i][j] - row[i] * col[j];
      }
   }
}


               Dr. Pierre Vignéras              108
OpenMP
            Work Sharing - Loop
// LU Matrix Decomposition
for (k = 0; k<SIZE-1; k++) {
   ... // same as before
   #pragma omp parallel for shared(A, row, col)
   for (i = k+1; i<SIZE; i++) {
      for (j = k+1; j<SIZE; j++) {
         A[i][j] = A[i][j] - row[i] * col[j];
      }
   }
}


                                                 Loop should be in
  Private Copies                                  canonical form
    Except for         Implicit Barrier             (see spec)
 shared variables   at the end of a loop
(Memory Model)


                           Dr. Pierre Vignéras                       109
OpenMP
               Work Sharing - Loop
●   Specifies that the iterations of the associated 
    loop will be executed in parallel.
●   Iterations of the loop are distributed across 
    threads that already exist in the team
●   The for directive places restrictions on the 
    structure of the corresponding for­loop
    –   Canonical form (see specifications):
         ●   for (i = 0; i < N; i++) {...} is ok
         ●   For (i = f(); g(i);) {... i++} is not ok
●   There is an implicit barrier at the end of a loop 
    construct unless a nowait clause is specified.
                                 Dr. Pierre Vignéras    110
OpenMP
     Work Sharing - Reduction
●
    reduction(op: list)
    e.g: reduction(*: result)
●   A private copy is made for each variable 
    declared in 'list' for each thread of the 
    parallel region 
●   A final value is produced using the operator 
    'op' to combine all private copies
        #pragma omp parallel for reduction(*: res)
           for (i = 0; i < SIZE; i++) {
              res = res * a[i];
           }
        }


                        Dr. Pierre Vignéras          111
OpenMP
     Synchronisation -- Master
●   Master directive:
    #pragma omp master
    {...}
●   Define a section that is only to be executed by 
    the master thread
●   No implicit barrier: other threads in the team 
    do not wait




                     Dr. Pierre Vignéras          112
OpenMP
     Synchronisation -- Critical
●   Critical directive:
    #pragma omp critical [name]
    {...}
●   A thread waits at the beginning of a critical 
    region until no other thread is executing a 
    critical region with the same name.
●   Enforces exclusive access with respect to all 
    critical constructs with the same name in all 
    threads, not just in the current team.
●   Constructs without a name are considered to 
    have the same unspecified name. 
                     Dr. Pierre Vignéras         113
OpenMP
        Synchronisation -- Barrier
●   Critical directive:
    #pragma omp barrier
    {...}
    –   All threads of the team must execute the barrier 
        before any are allowed to continue execution
    –   Restrictions
         ●   Each barrier region must be encountered by all threads 
             in a team or by none at all.
         ●   Barrier in not part of C!
              –   If (i < n)
                       #pragma omp barrier
                   // Syntax error!

                              Dr. Pierre Vignéras                 114
OpenMP
     Synchronisation -- Atomic
●   Critical directive:
    #pragma omp atomic
    expr-stmt
       ●
           Where expr-stmt is: 
           x binop= expr, x++, ++x, x--, --x (x scalar, binop any non 
           overloaded binary operator: +,/,-,*,<<,>>,&,^,|) 
●   Only  load and store of the object designated 
    by x are atomic; evaluation of expr is not 
    atomic. 
●   Does not enforce exclusive access of same 
    storage location x.
●   Some restrictions (see specifications)
                           Dr. Pierre Vignéras                    115
OpenMP
         Synchronisation -- Flush
●   Critical directive:
    #pragma omp flush [(list)]
●   Makes a thread’s temporary view of memory 
    consistent with memory
●   Enforces an order on the memory operations 
    of the variables explicitly specified or implied.
●   Implied flush: 
    –   barrier, entry and exit of parallel region, critical
    –   Exit from work sharing region (unless nowait) 
    –   ...
●   Warning with pointers!
                          Dr. Pierre Vignéras                  116
MIMD DM Programming

           ●
               Message Passing Solutions
           ●
               Process Creation
               Send/Receive
Outline




           ●


           ●
               MPI

                      Dr. Pierre Vignéras   117
Message Passing Solutions
●   How to provide Message Passing to 
    programmers?
    –   Designing a special parallel programming language
         ●   Sun Fortress, E language, ...
    –   Extending the syntax of an existing sequential 
        language
         ●   ProActive, JavaParty
    –   Using a library
         ●   PVM, MPI, ... 




                              Dr. Pierre Vignéras         118
Requirements
●   Process creation
    –   Static: the number of processes is fixed and defined 
        at start­up (e.g. from the command line)
    –   Dynamic: process can be created and destroyed at 
        will during the execution of the program.
●   Message Passing
    –   Send and receive efficient primitives
    –   Blocking or Unblocking
    –   Grouped Communication



                         Dr. Pierre Vignéras              119
Process Creation
   Multiple Program, Multiple Data
                Model
               (MPMD)
  Source                                    Source
   File                                      File
                    Compile to
                   suit processor

Executable                                Executable
   File                                      File




Processor 0                              Processor p-1


                   Dr. Pierre Vignéras                 120
Process Creation
Single Program, Multiple Data Model
              (SPMD)
                        Source
                         File


                    Compile to
                   suit processor
Executable                                Executable
   File                                      File



Processor 0                              Processor p-1


                   Dr. Pierre Vignéras                 121
SPMD Code Sample

int pid = getPID();
if (pid == MASTER_PID) {                ●   Compilation for
   execute_master_code();
                                            heterogeneous
}else{
   execute_slave_code();                    processors still required
}




        ●   Master/Slave Architecture
        ●   GetPID() is provided by the
            parallel system (library, language, ...)
        ●   MASTER_PID should be well defined



                        Dr. Pierre Vignéras                             122
Dynamic Process Creation
Process 1
            Primitive required:
   :        spawn(location, executable)
   :
   :
spawn()                            Process 2
   :                                    :
   :                                    :
   :                                    :
   :                                    :
                                        :
                                        :
                                        :


             Dr. Pierre Vignéras               123
Send/Receive: Basic
●   Basic primitives:
    –   send(dst, msg);
    –   receive(src, msg);
●   Questions:
    –   What is the type of: src, dst?
         ●   IP:port ­­> Low performance but portable
         ●   Process ID ­­> Implementation dependent (efficient 
             non­IP implementation possible)
    –   What is the type of: msg?
         ●   Packing/Unpacking of complex messages
         ●   Encoding/Decoding for heterogeneous architectures
                            Dr. Pierre Vignéras                    124
Persistence and Synchronicity in Communication (3)
(from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen)




                              2-22.1




      a)   Persistent asynchronous communication
      b)   Persistent synchronous communication
Persistence and Synchronicity in Communication (4)
 (from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen)




                           2-22.2




a)   Transient asynchronous communication
b)   Receipt-based transient synchronous communication
Persistence and Synchronicity in Communication (5)
 (from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen)




a)   Delivery-based transient synchronous communication at message
     delivery
b)   Response-based transient synchronous communication
Send/Receive: Properties

    Message               System                   Precedence
 Synchronization        Requirements               Constraints
Send: non-blocking   Message Buffering         None, unless
Receive: non-        Failure return from       message is received
blocking             receive                   successfully

Send: non-blocking   Message Buffering         Actions preceding
                                               send occur before
Receive: blocking    Termination Detection     those following send
Send: blocking       Termination Detection     Actions preceding
Receive: non-        Failure return from       send occur before
blocking             receive                   those following send
Send: blocking       Termination Detection     Actions preceding
                                               send occur before
Receive: blocking    Termination Detection     those following send
                         Dr. Pierre Vignéras                          128
Receive Filtering
●   So far, we have the primitive:
    receive(src, msg);
●   And if one to receive a message:
    –   From any source?
    –   From a group of process only?
●   Wildcards are used in that cases
    –   A wildcard is a special value such as:
         ●   IP: broadast (e.g: 192.168.255.255), multicast address 
         ●   The API provides well defined wildcard (e.g: 
             MPI_ANY_SOURCE)


                             Dr. Pierre Vignéras                       129
The Message-Passing Interface (MPI)
 (from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen)


  Primitive                                Meaning

MPI_bsend       Append outgoing message to a local send buffer

MPI_send        Send a message and wait until copied to local or remote buffer

MPI_ssend       Send a message and wait until receipt starts

MPI_sendrecv    Send a message and wait for reply

MPI_isend       Pass reference to outgoing message, and continue

MPI_issend      Pass reference to outgoing message, and wait until receipt starts

MPI_recv        Receive a message; block if there are none

MPI_irecv       Check if there is an incoming message, but do not block


    Some of the most intuitive message-passing primitives of MPI.
MPI Tutorial
   (by William Gropp)




See William Gropp Slides




     Dr. Pierre Vignéras   131
Parallel Strategies


          ●
              Embarrassingly Parallel Computation
              Partitionning
Outline




          ●




                          Dr. Pierre Vignéras       132
Embarrassingly Parallel
            Computations
●   Problems that can be divided into different 
    independent subtasks
    –   No interaction between processes
    –   No data is really shared (but copies may exist)
    –   Result of each process have to be combined in 
        some ways (reduction).
●   Ideal situation!
●   Master/Slave architecture
    –   Master defines tasks and send them to slaves 
        (statically or dynamically)


                         Dr. Pierre Vignéras              133
Embarrassingly Parallel
    Computations Typical Example:
         Image Processing
●   Shifting, Scaling, Clipping, Rotating, ...
●   2 Partitioning Possibilities
              640                              640




                                     480
480




           Process                          Process

                      Dr. Pierre Vignéras             134
Image Shift Pseudo-Code
●   Master      h=Height/slaves;
                for(i=0,row=0;i<slaves;i++,row+=h)
                   send(row, Pi);
                for(i=0; i<Height*Width;i++){
                   recv(oldrow, oldcol,
                        newrow, newcol, PANY);
                    map[newrow,newcol]=map[oldrow,oldcol];
                }


             recv(row, Pmaster)
●   Slave
             for(oldrow=row;oldrow<(row+h);oldrow++)
                for(oldcol=0;oldcol<Width;oldcol++) {
                   newrow=oldrow+delta_x;
                   newcol=oldcol+delta_y;
                   send(oldrow,oldcol,newrow,newcol,Pmaster);
                }
                      Dr. Pierre Vignéras                 135
Image Shift Pseudo-Code
              Complexity
●   Hypothesis:
    –   2 computational steps/pixel
    –   n*n pixels
    –   p processes
●   Sequential: Ts=2n2=O(n2)
●   Parallel: T//=O(p+n2)+O(n2/p)=O(n2) (p fixed) 
    –   Communication: Tcomm=Tstartup+mTdata
        =p(Tstartup+1*Tdata)+n2(Tstartup+4Tdata)=O(p+n2)
    –   Computation: Tcomp=2(n2/p)=O(n2/p)

                          Dr. Pierre Vignéras              136
Image Shift Pseudo-Code
               Speedup
●   Speedup           Ts                           2n
                                                       2
                         =
                      T p 2 n 2 / p pT startup T data n 2 T startup 4T data 
●   Computation/Communication:
                                     2n 2 / p                        n2 / p
                                            2
                                                                =O        2
                                                                             
                  p T startupT data n T startup 4T data       pn
    –   Constant when p fixed and n grows!
    –   Not good: the largest, the better
         ●   Computation should hide the communication
         ●   Broadcasting is better
●   This problem is for Shared Memory!

                                Dr. Pierre Vignéras                                    137
Computing
the
Mandelbrot
Set
●Problem: we consider the limit of the following 
    sequence in the complex plane:                     z 0=0
                                                  z k 1= z k c
●   The constant c=a.i+b is given by the point (a,b) 
    in the complex plane
●   We compute zk for 0<k<n
    –   if for a given k, (|zk| > 2) we know the sequence 
        diverge to infinity – we plot c in color(k)
    –   if for (k == n), |zk|<2 we assume the sequence 
        converge we plot c in a black color.

                         Dr. Pierre Vignéras                       138
Mandelbrot Set: sequential
            code
int calculate(double cr, double ci) {
        double bx = cr, by = ci;
        double xsq, ysq;
        int cnt = 0;
                             Computes the number of iterations
        while (true) {       for the given complex point (cr, ci)
            xsq = bx * bx;
            ysq = by * by;
            if (xsq + ysq >= 4.0) break;
            by = (2 * bx * by) + ci;
            bx = xsq - ysq + cr;
            cnt++;
            if (cnt >= max_iterations) break;
        }

        return cnt;
    }


                       Dr. Pierre Vignéras                     139
Mandelbrot Set: sequential
          code


   y = startDoubleY;
   for (int j = 0; j < height; j++) {
      double x = startDoubleX;
      int offset = j * width;
      for (int i = 0; i < width; i++) {
         int iter = calculate(x, y);
         pixels[i + offset] = colors[iter];
         x += dx;
      }
      y -= dy;
   }




                Dr. Pierre Vignéras           140
Mandelbrot Set:
                    parallelization
●   Static task assignment:
    –   Cut the image into p areas, and assign them to each 
        p processor
         ●   Problem: some areas are easier to compute (less 
             iterations)
    –   Assign n2/p pixels per processor in a round robin or 
        random manner
         ●   On average, the number of iterations per  processor 
             should be approximately the same
         ●   Left as an exercice



                             Dr. Pierre Vignéras                    141
Mandelbrot set: load
                  balancing
●   Number of iterations per  pixel varies
    –   Performance of processors may also vary
●   Dynamic Load Balancing
         ●   We want each CPU to be 100% busy ideally
●   Approach: work­pool, a collection of tasks
    –   Sometimes, processes can add new tasks into the 
        work­pool
    –   Problems:
         ●    How to define a task to minimize the communication 
             overhead?
         ●   When a task should be given to processes to increase the 
             computation/communication ratio?
                             Dr. Pierre Vignéras                   142
Mandelbrot set: dynamic load
         balancing
●   Task == pixel
    –   Lot of communications, Good load balancing
●   Task == row
    –   Fewer communications, less load balanced 
●   Send a task on request
    –   Simple, low communication/computation ratio
●   Send a task in advance
    –   Complex to implement, good 
        communication/computation ratio
●   Problem: Termination Detection?

                       Dr. Pierre Vignéras            143
Mandelbrot Set: parallel code
       Master Code
 count = row = 0;
 for(k = 0, k < num_procs; k++) {           Pslave is the process
    send(row, Pk,data_tag);                 which sends the last
    count++, row++;                          received message.
 }
 do {
    recv(r, pixels, PANY, result_tag);
    count--;
    if (row < HEIGHT) {
       send(row, Pslave, data_tag);
       row++, count++;
    }else send(row, Pslave, terminator_tag);
    display(r, pixels);
 }while(count > 0);



                      Dr. Pierre Vignéras                      144
Mandelbrot Set: parallel code
        Slave Code

     recv(row, PMASTER, ANYTAG, source_tag);
     while(source_tag == data_tag) {
        double y = startDoubleYFrom(row);
        double x = startDoubleX;
        for (int i = 0; i < WIDTH; i++) {
           int iter = calculate(x, y);
           pixels[i] = colors[iter];
           x += dx;
        }
        send(row, pixels, PMASTER, result_tag);
          recv(row, PMASTER, ANYTAG, source_tag);
     };




                   Dr. Pierre Vignéras              145
Mandelbrot Set: parallel code
           Analysis
                                T s ≤Max iter . n 2
                               T p =T comm T comp
T comm=n  t startup2 t data  p−1t startup 1t data n  t startupnt data 
                                         n2
                               T comp ≤ Maxiter
                                         p                                       Only valid when
                                                    2
       Ts                              Max iter . n                           all pixels are black!
          =
       T p Max iter . n 2                                                           (Inequality of
                              n2 t data 2 n p−1t startup t data 
                     p                                                                 Ts and Tp)
                                         Max iter . n 2
               T comp                         p
                      = 2
               T comm n t data 2 n p−1t startup t data 


  ●   Speedup approaches p when Maxiter is high
  ●   Computation/Communication is in O(Maxiter)

                                      Dr. Pierre Vignéras                                       146
Partitioning
●   Basis of parallel programming
●   Steps:
    –   cut the initial problem into smaller part
    –   solve the smaller part in parallel
    –   combine the results into one 
●   Two ways
    –   Data partitioning
    –   Functional partitioning
●   Special case: divide and conquer
    –   subproblems are of the same form as the larger 
        problem
                         Dr. Pierre Vignéras              147
Partitioning Example:
        numerical integration
●   We want to compute:                  b

                                     ∫ f  x dx
                                         a




                   Dr. Pierre Vignéras             148
Quizz
Outline




          Dr. Pierre Vignéras   149
Readings
●   Describe and compare the following 
    mechanisms for initiating concurrency:
    –   Fork/Join
    –   Asynchronous Method Invocation with futures
    –   Cobegin
    –   Forall
    –   Aggregate
    –   Life Routine (aka Active Objects)



                        Dr. Pierre Vignéras           150

More Related Content

What's hot

Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
Uday Sharma
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3
Md. Mahedi Mahfuj
 

What's hot (20)

Operating system components
Operating system componentsOperating system components
Operating system components
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Unit II - 3 - Operating System - Process Synchronization
Unit II - 3 - Operating System - Process SynchronizationUnit II - 3 - Operating System - Process Synchronization
Unit II - 3 - Operating System - Process Synchronization
 
Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systems
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3
 
Omega example
Omega exampleOmega example
Omega example
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Deadlocks in operating system
Deadlocks in operating systemDeadlocks in operating system
Deadlocks in operating system
 
unit-4-dynamic programming
unit-4-dynamic programmingunit-4-dynamic programming
unit-4-dynamic programming
 
Data Structure and Algorithm - Divide and Conquer
Data Structure and Algorithm - Divide and ConquerData Structure and Algorithm - Divide and Conquer
Data Structure and Algorithm - Divide and Conquer
 
Memory : operating system ( Btech cse )
Memory : operating system ( Btech cse )Memory : operating system ( Btech cse )
Memory : operating system ( Btech cse )
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
CS6401 OPERATING SYSTEMS Unit 2
CS6401 OPERATING SYSTEMS Unit 2CS6401 OPERATING SYSTEMS Unit 2
CS6401 OPERATING SYSTEMS Unit 2
 
Types of algorithms
Types of algorithmsTypes of algorithms
Types of algorithms
 
RTOS - Real Time Operating Systems
RTOS - Real Time Operating SystemsRTOS - Real Time Operating Systems
RTOS - Real Time Operating Systems
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processing
 
Process synchronization
Process synchronizationProcess synchronization
Process synchronization
 
Real time Scheduling in Operating System for Msc CS
Real time Scheduling in Operating System for Msc CSReal time Scheduling in Operating System for Msc CS
Real time Scheduling in Operating System for Msc CS
 
Unit 1 chapter 1 Design and Analysis of Algorithms
Unit 1   chapter 1 Design and Analysis of AlgorithmsUnit 1   chapter 1 Design and Analysis of Algorithms
Unit 1 chapter 1 Design and Analysis of Algorithms
 
Operating system 31 multiple processor scheduling
Operating system 31 multiple processor schedulingOperating system 31 multiple processor scheduling
Operating system 31 multiple processor scheduling
 

Viewers also liked

Lecture 6
Lecture  6Lecture  6
Lecture 6
Mr SMAK
 
Instruction cycle
Instruction cycleInstruction cycle
Instruction cycle
Kumar
 
Parallel architecture-programming
Parallel architecture-programmingParallel architecture-programming
Parallel architecture-programming
Shaveta Banda
 
Addressing mode
Addressing modeAddressing mode
Addressing mode
ilakkiya
 
Pipeline Mechanism
Pipeline MechanismPipeline Mechanism
Pipeline Mechanism
Ashik Iqbal
 
Parallel computing
Parallel computingParallel computing
Parallel computing
virend111
 

Viewers also liked (20)

Parallel processing Concepts
Parallel processing ConceptsParallel processing Concepts
Parallel processing Concepts
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processing
 
Parallel Processing
Parallel ProcessingParallel Processing
Parallel Processing
 
pipelining
pipeliningpipelining
pipelining
 
Instruction cycle
Instruction cycleInstruction cycle
Instruction cycle
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
pipelining
pipeliningpipelining
pipelining
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Instruction cycle
Instruction cycleInstruction cycle
Instruction cycle
 
Instruction cycle
Instruction cycleInstruction cycle
Instruction cycle
 
Embarrassingly Parallel Computation for Occlusion Culling
Embarrassingly Parallel Computation for Occlusion CullingEmbarrassingly Parallel Computation for Occlusion Culling
Embarrassingly Parallel Computation for Occlusion Culling
 
Parallel architecture-programming
Parallel architecture-programmingParallel architecture-programming
Parallel architecture-programming
 
Introduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkIntroduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with spark
 
Addressing mode
Addressing modeAddressing mode
Addressing mode
 
Parallel Computing
Parallel Computing Parallel Computing
Parallel Computing
 
Pipeline Mechanism
Pipeline MechanismPipeline Mechanism
Pipeline Mechanism
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 

Similar to Parallel Processing

Data Structures and Algorithms
Data Structures and AlgorithmsData Structures and Algorithms
Data Structures and Algorithms
Pierre Vigneras
 
Creating Profiling Tools to Analyze and Optimize FiPy Presentation
Creating Profiling Tools to Analyze and Optimize FiPy PresentationCreating Profiling Tools to Analyze and Optimize FiPy Presentation
Creating Profiling Tools to Analyze and Optimize FiPy Presentation
dmurali2
 
Case Study of the Unexplained
Case Study of the UnexplainedCase Study of the Unexplained
Case Study of the Unexplained
shannomc
 
Cs4hs2008 track a-programming
Cs4hs2008 track a-programmingCs4hs2008 track a-programming
Cs4hs2008 track a-programming
Rashi Agarwal
 
Intrduction to the course v.3
Intrduction to the course v.3 Intrduction to the course v.3
Intrduction to the course v.3
Start Group
 

Similar to Parallel Processing (20)

Data Structures and Algorithms
Data Structures and AlgorithmsData Structures and Algorithms
Data Structures and Algorithms
 
Creating Profiling Tools to Analyze and Optimize FiPy Presentation
Creating Profiling Tools to Analyze and Optimize FiPy PresentationCreating Profiling Tools to Analyze and Optimize FiPy Presentation
Creating Profiling Tools to Analyze and Optimize FiPy Presentation
 
Case Study of the Unexplained
Case Study of the UnexplainedCase Study of the Unexplained
Case Study of the Unexplained
 
Cs4hs2008 track a-programming
Cs4hs2008 track a-programmingCs4hs2008 track a-programming
Cs4hs2008 track a-programming
 
Rolling Out An Enterprise Source Code Review Program
Rolling Out An Enterprise Source Code Review ProgramRolling Out An Enterprise Source Code Review Program
Rolling Out An Enterprise Source Code Review Program
 
Continuous Infrastructure First
Continuous Infrastructure FirstContinuous Infrastructure First
Continuous Infrastructure First
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Sensible scaling
Sensible scalingSensible scaling
Sensible scaling
 
Cloud accounting software uk
Cloud accounting software ukCloud accounting software uk
Cloud accounting software uk
 
Middle Out Design
Middle Out DesignMiddle Out Design
Middle Out Design
 
Mid review presentation
Mid review presentationMid review presentation
Mid review presentation
 
Automating MySQL operations with Puppet
Automating MySQL operations with PuppetAutomating MySQL operations with Puppet
Automating MySQL operations with Puppet
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPy
 
Cois240 lesson01
Cois240 lesson01Cois240 lesson01
Cois240 lesson01
 
Mid review presentation
Mid review presentationMid review presentation
Mid review presentation
 
Austin Python Learners Meetup - Everything you need to know about programming...
Austin Python Learners Meetup - Everything you need to know about programming...Austin Python Learners Meetup - Everything you need to know about programming...
Austin Python Learners Meetup - Everything you need to know about programming...
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 
Intrduction to the course v.3
Intrduction to the course v.3 Intrduction to the course v.3
Intrduction to the course v.3
 
Self-healing of operational workflow incidents on distributed computing infra...
Self-healing of operational workflow incidents on distributed computing infra...Self-healing of operational workflow incidents on distributed computing infra...
Self-healing of operational workflow incidents on distributed computing infra...
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Parallel Processing

  • 1. Parallel Processing Dr. Pierre Vignéras http://www.vigneras.name/pierre This work is licensed under a Creative Commons Attribution-Share Alike 2.0 France. See http://creativecommons.org/licenses/by-sa/2.0/fr/ for details Dr. Pierre Vignéras
  • 2. Text & Reference Books ● Fundamentals of Parallel Processing Harry F. Jordan & Gita Alaghband Pearson Education ISBN: 81-7808-992-0 Parallel & Distributed Computing Handbook ● Albert Y. H. Zomaya McGraw-Hill Series on Computer Engineering ISBN: 0-07-073020-2 Parallel Programming (2nd edition) ● Books Barry Wilkinson and Michael Allen Pearson/Prentice Hall ISBN: 0-13-140563-2 Dr. Pierre Vignéras
  • 3. Class, Quiz & Exam Rules ● No entry after the first 5 minutes  ● No exit before the end of the class ● Announced Quiz – At the beginning of a class – Fixed timing (you may suffer if you arrive late)  – Spread Out (do it quickly to save your time) – Papers that are not strictly in front of you will be  considered as done – Cheaters will get '­1' mark Rules – Your head should be in front of your paper! Dr. Pierre Vignéras
  • 4. Grading Policy ● (Temptative, can be slightly modified) – Quiz: 5 % – Assignments: 5% – Project: 15%   – Mid: 25 % Grading Policy – Final: 50 % Dr. Pierre Vignéras
  • 5. Outline (Important points) ● Basics – [SI|MI][SD|MD],  – Interconnection Networks – Language Specific Constructions ● Parallel Complexity – Size and Depth of an algorithm – Speedup & Efficiency ● SIMD Programming – Matrix Addition & Multiplication – Gauss Elimination Dr. Pierre Vignéras
  • 6. Basics ● Flynn's Classification ● Interconnection Networks Outline ● Language Specific Constructions Dr. Pierre Vignéras 6
  • 7. History ● Since – Computers are based on the Von Neumann  Architecture  – Languages are based on either Turing Machines  (imperative languages), Lambda Calculus  (functional languages) or Boolean Algebra (Logic  Programming) ● Sequential processing is the way both  I. Basics computers and languages are designed ● But doing several things at once is an obvious  way to increase speed Dr. Pierre Vignéras
  • 8. Definitions ● Parallel Processing objective is execution speed  improvement  ● The manner is by doing several things at once ● The instruments are  – Architecture, – Languages, – Algorithm I. Basics ● Parallel Processing is an old field – but always limited by the wide sequential world for  various reasons (financial, psychological, ...)  Dr. Pierre Vignéras
  • 9. Flynn Architecture Characterization ● Instruction Stream (what to do) – S: Single, M: Multiple ● Data Stream (on what) – I: Instruction, D: Data ● Four possibilities – SISD: usual computer – SIMD: “Vector Computers” (Cray) I. Basics – MIMD: Multi­Processor Computers – MISD: Not very useful (sometimes presented as  pipelined SIMD) Dr. Pierre Vignéras
  • 10. SISD Architecture ● Traditional Sequential Computer ● Parallelism at the Architecture Level: – Interruption – IO processor – Pipeline Flow of Instructions I8 I7 I6 I5 I. Basics IF OF EX STR I4 I3 I2 I1 4-stages pipeline Dr. Pierre Vignéras
  • 11. Problem with pipelines ● Conditional branches need complete flush – The next instruction depends on the final result of  a condition ● Dependencies  – Wait (inserting NOP in pipeline stages) until a  previous result becomes available ● Solutions: – Branch prediction, out­of­order execution, ... I. Basics ● Consequences: – Circuitry Complexities (cost) – Performance Dr. Pierre Vignéras
  • 12. Exercises: Real Life Examples ● Intel:  ● Pentium 4:  ● Pentium 3 (=M=Core Duo):   ● AMD ● Athlon:  ● Opteron:  ● IBM ● Power 6: I. Basics ● Cell: ● Sun ● UltraSparc: ● T1: Dr. Pierre Vignéras
  • 13. Instruction Level Parallelism ● Multiple Computing Unit that can be used  simultaneously (scalar) – Floating Point Unit – Arithmetic and Logic Unit ● Reordering of instructions – By the compiler – By the hardware I. Basics ● Lookahead (prefetching) & Scoreboarding  (resolve conflicts) – Original Semantic is Guaranteed Dr. Pierre Vignéras
  • 14. Reordering: Example ● EXP=A+B+C+(DxExF)+G+H + + + H + H Height=5 Height=4 + G + x + x + + x F + C x F A B C G D E I. Basics A B D E 4 Steps required if 5 Steps required if hardware can do 2 adds and hardware can do 1 add and 1 multiplication simultaneously 1 multiplication simultaneously Dr. Pierre Vignéras
  • 15. SIMD Architecture ● Often (especially in scientific applications),  one instruction is applied on different data    for (int i = 0; i < n; i++) { r[i]=a[i]+b[i]; } ● Using vector operations (add, mul, div, ...) ● Two versions – True SIMD: n Arithmetic Unit I. Basics ● n high­level operations at a time – Pipelines SIMD: Arithmetic Pipelined (depth n) ● n non­identical low­level operations at a time Dr. Pierre Vignéras
  • 16. Real Life Examples ● Cray: pipelined SIMD vector supercomputers ● Modern Micro­Processor Instruction Set  contains SIMD instructions – Intel: MMX, SSE{1,2,3} – AMD: 3dnow – IBM: Altivec – Sun: VSI I. Basics ● Very efficient – 2002: winner of the TOP500 list was the Japanese  Earth Simulator, a vector supercomputer Dr. Pierre Vignéras
  • 17. MIMD Architecture ● Two forms – Shared memory: any processor can access any  memory region – Distributed memory: a distinct memory is  associated with each processor ● Usage: – Multiple sequential programs can be executed  simultaneously I. Basics – Different parts of a program are executed  simultaneously on each processor ● Need cooperation! Dr. Pierre Vignéras
  • 18. Warning: Definitions ● Multiprocessors: computers capable of  running multiple instructions streams  simultaneously to cooperatively execute a  single program ● Multiprogramming: sharing of computing  resources by independent jobs ● Multiprocessing:  – running a possibly sequential program on a  I. Basics multiprocessor (not our concern here) – running a program consisting of multiple  cooperating processes Dr. Pierre Vignéras
  • 19. Shared vs Distributed M CPU CPU CPU CPU Switch M CPU Switch CPU M Memory I. Basics Shared Memory Distributed Memory Dr. Pierre Vignéras
  • 20. MIMD Cooperation/Communication ● Shared memory – read/write in the shared memory by hardware – synchronisation with locking primitives  (semaphore, monitor, etc.) – Variable Latency ● Distributed memory – explicit message passing in software  I. Basics ● synchronous, asynchronous, grouped, remote method  invocation – Large message may mask long and variable latency Dr. Pierre Vignéras
  • 21. Interconnection Networks I. Basics Dr. Pierre Vignéras
  • 22. “Hard” Network Metrics ● Bandwidth (B): maximum number of bytes the  network can transport ● Latency (L): transmission time of a single  message  –  4 components ● software overhead (so): msg packing/unpacking ● routing delay (rd): time to establish the route ● contention time (ct): network is busy I. Basics ● channel delay (cd): msg size/bandwidth – so+ct depends on program behaviour – rd+cd depends on hardware Dr. Pierre Vignéras
  • 23. “Soft” Network Metrics ● Diameter (r): longest path between two nodes r Average Distance (da): 1 ∑ d.N d ● N −1 d =1 where d is the distance between two nodes defined as the number of links in  the shortest path between them, N the total number of nodes and Nd , the  number of nodes with a distance d from a given node. ● Connectivity (P): the number of nodes that can  be reached from a given node in one hop. Concurrency (C): the number of independent  I. Basics ● connections that a network can make. – bus: C = 1; linear: C = N­1(if processor can send and  receive simultaneously, otherwise, C = ⌊N/2⌋) Dr. Pierre Vignéras
  • 24. Influence of the network ● Bus – low cost,  – low concurrency degree: 1 message at a time ● Fully connected – high cost: N(N­1)/2 links – high concurrency degree: N messages at a time ● Star  I. Basics – Low cost (high availability) – Exercice: Concurrency Degree? Dr. Pierre Vignéras
  • 25. Parallel Responsibilities Human writes a Parallel Algorithm Human implements using a Language Compiler translates into a Parallel Program Runtime/Libraries are required in a Parallel Process Human & Operating System map to the Hardware The hardware is leading the Parallel Execution Many actors are involved in the execution We usually consider I. Basics of a program (even sequential). explicit parallelism: Parallelism can be introduced when parallelism is automatically by compilers, introduced at the runtime/libraries, OS and in the hardware language level Dr. Pierre Vignéras
  • 26. SIMD Pseudo-Code ● SIMD <indexed variable> = <indexed expression>, (<index range>); Example: C[i,j] == A[i,j] + B[i,j], (0 ≤ i ≤ N-1) j j j i i i C = A + B j=0 j=1 j = N-1 I. Basics = + = + ... = + C[i] A[i] B[i] C[i] A[i] B[i] C[i] A[i] B[i] Dr. Pierre Vignéras
  • 27. SIMD Matrix Multiplication ● General Formula N −1 cij = ∑ a ik b kj k =0 for (i = 0; i < N; i++) { // SIMD Initialisation C[i,j] = 0, (0 <= j < N); for (k = 0; k < N; k++) { I. Basics // SIMD computation C[i,j]=C[i,j]+A[i,k]*B[k,j], (0 <= j < N); } } Dr. Pierre Vignéras
  • 28. SIMD C Code (GCC) ● gcc vector type & operations ● Vector of 8 bytes long, composed of 4 short elements  typedef short v4hi __attribute__ ((vector_size (8))); ● Use union to access individual elements union vec { v4hi v, short s[4]; }; ● Usual operations on vector variables v4hi v, w, r; r = v+w*w/v; ● gcc builtins functions ● #include <mmintrin.h> // MMX only I. Basics ● Compile with ­mmmx ● Use the provided functions _mm_setr_pi16 (short w0, short w1, short w2, short w3); _mm_add_pi16 (__m64 m1, __m64 m2); ... Dr. Pierre Vignéras
  • 29. MIMD Shared Memory Pseudo Code ● New instructions: – fork <label> Start a new process executing at  <label>; – join <integer> Join <integer> processes into one; – shared <variable list> make the storage class of the  variables shared; – private <variable list> make the storage class of the  variables private. I. Basics ● Sufficient? What about synchronization? – We will focus on that later on... Dr. Pierre Vignéras
  • 30. MIMD Shared Memory Matrix Multiplication ● private i,j,k; shared A[N,N], B[N,N], C[N,N], N; for (j = 0; j < N - 2; j++) fork DOCOL; // Only the Original process reach this point j = N-1; DOCOL: // Executed by N process for each column for (i = 0; i < N; i++) { C[i,j] = 0; for (k = 0; k < N; k++) { // SIMD computation C[i,j]=C[i,j]+A[i,k]*B[k,j]; I. Basics } } join N; // Wait for the other process Dr. Pierre Vignéras
  • 31. MIMD Shared Memory C Code ● Process based:  – fork()/wait() ● Thread Based:  – Pthread library – Solaris Thread library ● Specific Framework: OpenMP ● Exercice: read the following articles: I. Basics – “Imperative Concurrent Object­Oriented Languages”,  M.Philippsen, 1995 – “Threads Cannot Be Implemented As a Library”, Hans­J.Boehm, 2005 Dr. Pierre Vignéras
  • 32. MIMD Distributed Memory Pseudo Code ● Basic instructions send()/receive() ● Many possibilities: – non­blocking send needs message buffering – non­blocking receive needs failure return  – blocking send and receive both needs termination  detection ● Most used combinations  blocking send and receive I. Basics ● ● non­blocking send and blocking receive ● Grouped communications – scatter/gather, broadcast, multicast, ... Dr. Pierre Vignéras
  • 33. MIMD Distributed Memory C Code ● Traditional Sockets – Bad Performance – Grouped communication limited ● Parallel Virtual Machine (PVM) – Old but good and portable ● Message Passing Interface (MPI) – The standard used in parallel programming  today I. Basics – Very efficient implementations ● Vendors of supercomputers provide their own highly  optimized MPI implementation ● Open­source implementation available Dr. Pierre Vignéras
  • 34. Performance ● Size and Depth Speedup and Efficiency Outline ● Dr. Pierre Vignéras 34
  • 35. Parallel Program Characteristics ● s = V[0]; for (i = 1; i < N; i++) { s = s + V[i]; } V[0] V[1] V[2] V[3] ... V[N-1] I. Performance + + Data Dependence Graph ● Size = N-1 (nb op) + ● Depth = N-1 (nb op in the longest path from + any input to any output) s Dr. Pierre Vignéras 35
  • 36. Parallel Program Characteristics V[0] V[1] V[2] V[3] V[N-2] V[N-1] + + + I. Performance + + + ● Size = N-1 ● Depth =⌈log2N⌉ s Dr. Pierre Vignéras 36
  • 37. Parallel Program Characteristics ● Size and Depth does not take into account  many things – memory operation (allocation, indexing, ...) – communication between processes ● read/write for shared memory MIMD I. Performance ● send/receive for distributed memory MIMD – Load balancing when the number of processors is  less than the number of parallel task – etc... ● Inherent Characteristics of a Parallel Machine  independent of any specific architecture Dr. Pierre Vignéras 37
  • 38. Prefix Problem ● for (i = 1; i < N; i++) { V[i] = V[i-1] + V[i]; } V[0] V[1] V[2] V[3] ... V[N-1] + ● Size = N-1 I. Performance ● Depth =N-1 + + How to make this + program parallel? V[0] V[1] V[2] V[3] ... V[N-1] Dr. Pierre Vignéras 38
  • 39. Prefix Solution 1 Upper/Lower Prefix Algorithm ● Divide and Conquer – Divide a problem of size N into 2 equivalent  problems of size N/2 Apply the same algorithm V[0] V[N-1] for sub-prefix. ... ... When N=2, use the direct, sequential implementation ⌈N/2⌉ Prefix ⌊N/2⌋ Prefix Size has increased! N-2+N/2 > N-1 (N>2) Depth = ⌈log2N⌉ ... + + ... + Size (N=2k) = k.2k-1 =(N/2).log2N V[0] V[N-1] Dr. Pierre Vignéras 39
  • 40. ul Analysis of P ● Proof by induction ● Only for N a power of 2 – Assume from construction that size increase  smoothly with N ● Example: N=1,048,576=220 – Sequential Prefix Algorithm: ● 1,048,575 operations in 1,048,575 time unit steps – Upper/Lower Prefix Algorithm  ● 10,485,760 operations in 20 time unit steps Dr. Pierre Vignéras 40
  • 41. Other Prefix Parallel Algorithms ● Pul requires N/2 processors available! – In our example, 524,288 processors! ● Other algorithms (exercice: study them) –  Odd/Even Prefix Algorithm ● Size = 2N­log2(N)­2  (N=2k)  ● Depth = 2log2(N)­2 –  Ladner and Fischer's Parallel Prefix ● Size = 4N­4.96N0.69 + 1 ● Depth = log2(N) ● Exercice: implement them in C Dr. Pierre Vignéras 41
  • 42. Benchmarks (Inspired by Pr. Ishfaq Ahmad Slides) ● A benchmark is a performance testing program  – supposed to capture processing and data  movement characteristics of a class of applications. ● Used to measure and to predict performance   of computer systems, and to reveal their  architectural weakness and strong points. ● Benchmark suite: set of benchmark programs. ● Benchmark family: set of benchmark suites. ● Classification:  – scientific computing, commercial, applications,  network services, multimedia applications, ... Dr. Pierre Vignéras 42
  • 43. Benchmark Examples Type Name Measuring Numerical computing (linear al- LINPACK gebra) Micro-Benchmark System calls and data movement LMBENCH operations in Unix STREAM Memory bandwidth NAS Parallel computing (CFD) PARKBENCH Parallel computing SPEC A mixed benchmark family Macro-Benchmark Splash Parallel computing STAP Signal processing TPC Commercial applications Dr. Pierre Vignéras 43
  • 44. SPEC Benchmark Family ● Standard Performance Evaluation Corporation – Started with benchmarks that measure CPU  performance – Has extended to client/server computing,  commercial applications, I/O subsystems, etc. ● Visit http://www.spec.org/ for more  informations Dr. Pierre Vignéras 44
  • 45. Performance Metrics ● Execution time  – Generally in seconds – Real time, User Time, System time? ● Real time ● Instruction Count – Generally in MIPS or BIPS – Dynamic: number of executed instructions, not the  number of assembly line code! ● Number of Floating Point Operations Executed – Mflop/s, Gflop/s – For scientific applications where flop dominates Dr. Pierre Vignéras 45
  • 46. Memory Performance ● Three parameters: – Capacity, – Latency, – Bandwidth Dr. Pierre Vignéras 46
  • 47. Parallelism and Interaction Overhead ● Time to execute a parallel program is T = Tcomp+Tpar+Tinter – Tcomp: time of the effective computation – Tpar: parallel overhead – Process management(creation, termination, context switching) – Grouping operations (creation, destruction of group) – Process Inquiry operation (Id, Group Id, Group size) – Tinter: interaction overhead – Synchronization (locks, barrier, events, ...) – Aggregation (reduction and scan) – Communication (p2p, collective, read/write shared variables) Dr. Pierre Vignéras 47
  • 48. Measuring Latency Ping-Pong for (i=0; i < Runs; i++) if (my_node_id == 0) { /* sender */ tmp = Second(); start_time = Second(); send an m-byte message to node 1; receive an m-byte message from node 1; end_time = Second(); timer_overhead = start_time - tmp ; total_time = end_time - start_time - timer_overhead; communication_time[i] = total_time / 2 ; } else if (my_node_id ==1) {/* receiver */ receive an m-byte message from node 0; send an m-byte message to node 0; } } Dr. Pierre Vignéras 48
  • 49. Measuring Latency Hot-Potato (fire-brigade) Method ● n nodes are involved (instead of just two) ● Node 0 sends an m­bytes message to node 1 ● On receipt, node 1 immediately sends the  same message to node 2 and so on. ● Finally node n­1 sends the message back to  node 0 ● The total time is divided by n to get the point­ to­point  average communication time. Dr. Pierre Vignéras 49
  • 50. Collective Communication Performance for (i = 0; i < Runs; i++) { barrier synchronization; tmp = Second(); start_time = Second(); for (j = 0; j < Iterations; j++) the_collective_routine_being_measured; end_time = Second(); timer_overhead = start_time - tmp; total_time = end_time - start_time - timer_overhead; local_time = total_time / Iterations; communication_time[i] = max {all n local_time}; } How to get all n local time? Dr. Pierre Vignéras 50
  • 51. Point-to-Point Communication Overhead ● Linear function of the message length m (in  bytes) –  t(m) = t0 + m/r∞ – t0: start­up time – r∞: asymptotic bandwidth t ● Ideal Situation! ● Many things are left out ● Topology metrics t0 ● Network Contention m Dr. Pierre Vignéras 51
  • 52. Collective Communication ● Broadcast: 1 process sends an m­bytes message to all  n process ● Multicast: 1 process sends an m­bytes message to p <  n process ● Gather: 1 process receives an m­bytes from each of  the n process (mn bytes received) ● Scatter: 1 process sends a distinct m­bytes message to  each of the n process (mn bytes sent) ● Total exchange: every process sends a distinct m­ bytes message to each of the n process (mn2 bytes  sent) ● Circular Shift: process i sends an m­bytes message to  process i+1 (process n­1 to process 0) Dr. Pierre Vignéras 52
  • 53. Collective Communication Overhead ● Function of both m and n – start­up latency depends only on n T(m,n) =  t0(n) + m/r∞(n) ● Many problems here! For a broadcast for example, how is done the communication at the link layer? ● Sequential? ● Process 1 sends 1 m-bytes message to process 2, then to process 3, and so on... (n steps required) ● Truly Parallel (1 step required) ● Tree based (log (n) steps required) 2 ● Again, many things are left out ● Topology metrics ● Network Contention Dr. Pierre Vignéras 53
  • 54. Speedup (Problem-oriented) ● For a given algorithm, let Tp be time to perform  the computation using p processors – T∞: depth of the algorithm ● How much time we gain by using a parallel  algorithm? – Speedup: Sp=T1/Tp – Issue: what is T1? ● Time taken by the execution of the best sequential  algorithm on the same input data ● We note T1=Ts Dr. Pierre Vignéras
  • 55. Speedup consequence ● Best Parallel Time Achievable is: Ts/p ● Best Speedup is bounded by Sp≤Ts/(Ts/p)=p – Called Linear Speedup ● Conditions – Workload can be divided into p equal parts – No overhead at all ● Ideal situation, usually impossible to achieve! ● Objective of a parallel algorithm – Being as close to a linear speedup as possible Dr. Pierre Vignéras
  • 56. Biased Speedup (Algorithm-oriented) ● We usually don't know what is the best  sequential algorithm except for some simple  problems (sorting, searching, ...) ● In this case, T1 – is the time taken by the execution of the same  parallel algorithm on 1 processor. ● Clearly biased! – The parallel algorithm and the best sequential  algorithm are often very, very different ● This speedup should be taken with that  limitation in mind! Dr. Pierre Vignéras
  • 57. Biased Speedup Consequence ● Since we compare our parallel algorithm on p  processors with a suboptimal sequential  (parallel algorithm on only 1 processor) we can  have: Sp>p – Superlinear speedup ● Can also be the consequence of using – a unique feature of the system architecture that  favours parallel formation – indeterminate nature of the algorithm – Extra­memory due to the use of p computers (less  swap) Dr. Pierre Vignéras
  • 58. Efficiency ● How long processors are being used on the  computation? – Efficiency: Ep=Sp/p (expressed in %) – Also represents how much benefits we gain by  using p processors ● Bounded by 100% – Superlinear speedup of course can give more than  100% but it is biased Dr. Pierre Vignéras
  • 59. Amdhal's Law: fixed problem size (1967) ● For a problem with a workload W, we assume  that we can divide it into two parts: – W = W + (1­)W  percent of W must be executed sequentially , and the  remaining 1­ can be executed by n nodes simultaneously with  100% of efficiency – Ts(W)=Ts(W+(1­)W)=Ts(W)+Ts((1­)W)=Ts(W)+(1­)Ts(W) – Tp(W)=Tp(W+(1­)W)=Ts(W)+Tp((1­)W)=Ts(W)+(1­)Ts(W)/p Ts p 1 S p= =   p ∞ 1−T s 1 p−1   T s p Dr. Pierre Vignéras 59
  • 60. Amdhal's Law: fixed problem size Ts αTs (1-α)Ts serial section parallelizable sections ... P processors ... (1-α)Ts/p Tp Dr. Pierre Vignéras 60
  • 61. Amdhal's Law Consequences S(p) S(α) p=100 α=5% α=10% α=20% p=10 α=40% Dr. Pierre Vignéras 61
  • 62. Amdhal's Law Consequences ● For a fixed workload, and without any  overhead, the maximal speedup as an upper  bound of 1/ ● Taking into account the overhead To – Performance is limited not only by the sequential  bottleneck but also by the average overhead Ts p S p= = 1−T s pT o  T s T o 1 p−1 p Ts 1 S p  p  ∞ To  Ts Dr. Pierre Vignéras 62
  • 63. Amdhal's Law Consequences ● The sequential bottleneck cannot be solved  just by increasing the number of processors in  a system. – Very pessimistic view on parallel processing during  two decades . ● Major assumption: the problem size  (workload) is fixed. Conclusion don't use parallel programming on a small sized problem! Dr. Pierre Vignéras 63
  • 64. Gustafson's Law: fixed time (1988) ● As the machine size increase, the problem size  may increase with execution time unchanged ● Sequential case: W=W+(1­)W – Ts(W)=Ts(W)+(1­)Ts(W) – Tp(W)=Ts(W)+(1­)Ts(W)/p – To have a constant time, the parallel workload  should be increased ● Parallel case (p­processor):  W'=W+p(1­)W. – Ts(W')= Ts(W)+p(1­)Ts(W) ● Assumption: sequential part is constant: it does not increase  with the size of the problem Dr. Pierre Vignéras 64
  • 65. Fixed time speedup T s W '  T s W '   T s W  p 1−T s W  S  p= = = T p W '  T s W   T s W 1−T s W  S  p=1− p Fixed time speedup is a linear function of p, if the workload is scaled up to maintain a fixed execution time Dr. Pierre Vignéras 65
  • 66. Gustafson's Law Consequences α=5% p=100 S(p) S(α) p=50 α=10% α=20% p=25 α=40% p=10 Dr. Pierre Vignéras 66
  • 67. Fixed time speedup with overhead T s W '  T s W '  1− p S  p= = = T p W '  T s W T o  p T o  p 1 T s W  ● Best fixed speedup is achieved with a low overhead as expected ● T : function of the number of processors o ● Quite difficult in practice to make it a decreasing function Dr. Pierre Vignéras 67
  • 68. SIMD Programming ● Matrix Addition & Multiplication Gauss Elimination Outline ● Dr. Pierre Vignéras 68
  • 69. Matrix Vocabulary ● A matrix is dense  if most of its elements are  non­zero ● A matrix is sparse if a significant number of its  elements are zero – Space efficient representation available – Efficient algorithms available ● Usual matrix size in parallel programming – Lower than 1000x1000 for a dense matrix – Bigger than 1000x1000 for a sparse matrix – Growing with the performance of computer  architectures Dr. Pierre Vignéras 69
  • 70. Matrix Addition C = AB , c i , j =a i , j bi , j 0≤i≤n , 0≤ j≤m // Sequential for (i = 0; i < n; i++) { Need to map the vectored for (j = 0; j < m; j++) { addition of m elements a[i]+b[i] c[i,j] = a[i,j]+b[i,j]; to our underlying architecture } made of p processing elements } // SIMD Complexity: for (i = 0; i < n; i++) { – Sequential case: n.m additions c[i,j] = a[i,j]+b[i,j]; – SIMD case: n.m/p (0 <= j < m) Speedup: S(p)=p } Dr. Pierre Vignéras 70
  • 71. Matrix Multiplication C = AB , A[n , l ] ; B [l , m] ;C [n , m] l −1 ci , j = ∑ a i , k b k , j 0≤i≤n ,0≤ j≤m k =0 for (i = 0; i < n; i++) { for (j = 0; j < m; j++) { c[i,j] = 0; for (k = 0; k < l; k++) { c[i,j] = c[i,j] + a[i,k]*b[k,j]; } } Complexity: – n.m initialisations } – n.m.l additions and multiplications Overhead: Memory indexing! – Use a temporary variable sum for c[i,j]! Dr. Pierre Vignéras 71
  • 72. SIMD Matrix Multiplication for (i = 0; i < n; i++) { c[i,j] = 0; (0 <= j < m) // SIMD Op for (k = 0; k < l; k++) { c[i,j] = c[i,j] + a[i,k]*b[k,j]; (0 <= j < m) } } Complexity (n = m = l for simplicity): – n2/p initialisations (p<=n) } 3 – n /p additions and multiplications (p<=n) Overhead: Memory indexing! – Use a temporary vector for c[i]! Algorithm in O(n) with p=n2 – Assign one processor for Algorithm in O(lg(n)) with p=n3 each element of C – Parallelize the inner loop! – Initialisation in O(1) – Lowest bound – Computation in O(n) Dr. Pierre Vignéras 72
  • 73. Gauss Elimination ● Matrix (n x n) represents a system of n linear  equations of n unknowns a n−1,0 x 0a n−1,1 x 1 a n−1,2 x 2 ...a n−1, n−1 x n−1 =b n−1 a n−2,0 x 0a n−2,1 x 1a n−2,2 x 2 ...a n−2, n−1 x n−1 =b n−2 ... a 1,0 x 0a 1,1 x 1 a 1,2 x 2...a 1, n−1 x n−1 =b1 a 0,0 x 0 a 0,1 x 1 a 0,2 x 2...a 0, n−1 x n−1=b0   a n−1,0 a n−1,1 a n−1,2 ... a n−1, n−1 b n−1 a n−2,0 a n−2,1 a n−2,2 ... a n−2, n−1 bn−2 Represented by M = ... ... ... ... ... ... a 1,0 a 1,1 a 1,2 ... a 1, n−1 b1 a 0,0 a 0,1 a 0,2 ... a 0, n−1 b0 Dr. Pierre Vignéras 73
  • 74. Gauss Elimination Principle ● Transform the linear system into a triangular  system of equations. At the ith row, each row j below is replaced by: {row j} + {row i}.(-aj, i/ai, i) Dr. Pierre Vignéras 74
  • 75. Gauss Elimination Sequential & SIMD Algorithms for (i=0; i<n-1; i++){ // for each row except the last for (j=i+1; j<n; j++) // step through subsequent rows m=a[j,i]/a[i,i]; // compute multiplier for (k=i; k<n; k++) { a[j,k]=a[j,k]-a[i,k]*m; } Complexity: O(n3) } } for (i=0; i<n-1; i++){ SIMD Version Complexity: O(n3/p) for (j=i+1; j<n; j++) m=a[j,i]/a[i,i]; a[j,k]=a[j,k]-a[i,k]*m; (i<=k<n) } } Dr. Pierre Vignéras 75
  • 76. Other SIMD Applications ● Image Processing ● Climate and Weather Prediction ● Molecular Dynamics ● Semi­Conductor Design ● Fluid Flow Modelling ● VLSI­Design ● Database Information Retrieval ● ... Dr. Pierre Vignéras 76
  • 77. MIMD SM Programming ● Process, Threads & Fibers ● Pre- & self-scheduling Common Issues Outline ● ● OpenMP Dr. Pierre Vignéras 77
  • 78. Process, Threads, Fibers ● Process: heavy weight – Instruction Pointer, Stacks, Registers – memory, file handles, sockets, device handles, and  windows ● (kernel) Thread: medium weight – Instruction Pointer, Stacks, Registers – Known and scheduled preemptively by the kernel ● (user thread) Fiber: light weight – Instruction Pointer, Stacks, Registers – Unkwown by the kernel and scheduled  cooperatively in user space Dr. Pierre Vignéras 78
  • 79. Multithreaded Process Code Data Files Code Data Files Reg Reg Reg Registers Stack Stack Stack Stack thread Process Process Single-Threaded Multi-Threaded Dr. Pierre Vignéras 79
  • 80. Kernel Thread ● Provided and scheduled by the kernel – Solaris, Linux, Mac OS X, AIX, Windows XP – Incompatibilities: POSIX PThread Standard ● Scheduling Question: Contention Scope – The time slice q can be  given to: ● the whole process (PTHREAD_PROCESS_SCOPE) ● Each individual thread (PTHREAD_SYSTEM_SCOPE) ● Fairness? – With a PTHREAD_SYSTEM_SCOPE, the more  threads a process has, the more CPU it has! Dr. Pierre Vignéras 80
  • 81. Fibers (User Threads) ● Implemented entirely in user space (also called  green theads) ● Advantages: –  Performance: no need to cross the kernel (context  switch) to schedule another thread – Scheduler can be customized for the application ● Disadvantages: – Cannot directly use multi­processor – Any system call blocks the entire process ● Solutions: Asynchronous I/O, Scheduler Activation  ● Systems: NetBSD, Windows XP (FiberThread) Dr. Pierre Vignéras 81
  • 82. Mapping ● Threads are expressed by the application in  user space. ● 1­1: any thread expressed in user space is  mapped to a kernel thread – Linux, Windows XP, Solaris > 8  ● n­1: n threads expressed in user space are  mapped to one kernel thread – Any user­space only thread library ● n­m: n threads expressed in user space are  mapped to m kernel threads (n>>m) – AIX, Solaris < 9, IRIX, HP­UX, TRU64 UNIX Dr. Pierre Vignéras 82
  • 83. Mapping User Space User Space User Space Kernel Space Kernel Space Kernel Space 1-1 n-1 n-m Dr. Pierre Vignéras 83
  • 84. MIMD SM Matrix Addition Self-Scheduling ● private i,j,k; shared A[N,N], B[N,N], C[N,N], N; for (i = 0; i < N - 2; i++) fork DOLINE; // Only the Original process reach this point i = N-1; // Really Required? DOLINE: // Executed by N process for each line for (j = 0; j < N; j++) { // SIMD computation C[i,j]=A[i,j]+B[i,j];Complexity: O(n2/p) } If SIMD is used: O(n2/(p.m)) } m:number of SIMD PEs join N; // Wait for the other process Issue: if n > p, self-scheduling. The OS decides who will do what. (Dynamic) In this case, overhead Dr. Pierre Vignéras 84
  • 85. MIMD SM Matrix Addition Pre-Scheduling ● Basic idea: divide the whole matrix by p=s*s  sub­matrix submatrices s C = A + B s Each of the p threads compute: = + Ideal Situation! No communication Overhead! Pre-Scheduling: decide statically who will do what. Dr. Pierre Vignéras 85
  • 86. Parallel Primality Testing (from Herlihy and Shavit Slides) ● Challenge – Print primes from 1 to 1010 ● Given – Ten­processor multiprocessor – One thread per processor ● Goal – Get ten­fold speedup (or close) Dr. Pierre Vignéras 86
  • 87. Load Balancing ● Split the work evenly ● Each thread tests range of 109 1 109 2·109 … 1010 P0 P1 … P9 Code for thread i: void thread(int i) { for (j = i*109+1, j<(i+1)*109; j++) { if (isPrime(j)) print(j); } } Dr. Pierre Vignéras
  • 88. Issues ● Larger Num ranges have fewer primes ● Larger numbers harder to test Thread workloads ed ● – Uneven je ct – Hard to predict re ● Need dynamic load balancing Dr. Pierre Vignéras
  • 89. Shared Counter 19 each thread takes a number 18 17 Dr. Pierre Vignéras
  • 90. Procedure for Thread i // Global (shared) variable static long value = 1; void thread(int i) { long j = 0; while (j < 1010) { j = inc(); if (isPrime(j)) print(j); } } Dr. Pierre Vignéras
  • 91. Counter Implementation // Global (shared) variable static long value = 1; r, so oces sor nipr ces long inc() u for { ltipro OK r mu return value++; } n o t fo Dr. Pierre Vignéras
  • 92. Counter Implementation // Global (shared) variable static long value = 1; long inc() { return value++; = value; temp } value = value + 1; return temp; Dr. Pierre Vignéras
  • 93. Uh-Oh Value… 2 3 2 read 1 read 2 write 2 write 3 read 1 write 2 time Dr. Pierre Vignéras
  • 94. Challenge // Global (shared) variable static long value = 1; long inc() { temp = value; value = temp + 1; Make these steps return temp; atomic (indivisib } Dr. Pierre Vignéras
  • 95. An Aside: C/Posix Threads API // Global (shared) variable static long value = 1; static pthread_mutex_t mutex; // somewhere before pthread_mutex_init(&mutex, NULL); long inc() { pthread_mutex_lock(&mutex); temp = value; Critical section value = temp + 1; pthread_mutex_unlock(&mutex); return temp; } Dr. Pierre Vignéras
  • 96. Correctness ● Mutual Exclusion – Never two or more in critical section – This is a safety property ● No Lockout (lockout­freedom) – If someone wants in, someone gets in – This is a liveness property Dr. Pierre Vignéras
  • 97. Multithreading Issues ● Coordination – Locks, Condition Variables, Volatile ● Cancellation (interruption) – Asynchronous (immediate) ● Equivalent to 'kill ­KILL' but much more dangerous!! – What happens to lock, open files, buffers, pointers, ...? ● Avoid it as much as you can!! – Deferred ● Target thread should check periodically if it should be  cancelled ● No guarantee that it will actually stop running! Dr. Pierre Vignéras 97
  • 98. General Multithreading Guidelines ● Use good software engineering – Always check error status ● Always use a while(condition) loop when   waiting on a condition variable ● Use a thread pool for efficiency – Reduce the number of created thread – Use thread specific data so one thread can holds its  own (private) data  ● The Linux kernel schedules tasks only,  – Process == a task that shares nothing – Thread == a task that shares everything Dr. Pierre Vignéras 98
  • 99. General Multithreading Guidelines ● Give the biggest priorities to I/O bounded  threads – Usually blocked, should be responsive ● CPU­Bounded threads should usually have the  lowest priority ● Use Asynchronous I/O when possible, it is  usually more efficient than the combination of  threads+ synchronous I/O ● Use event­driven programming to keep  indeterminism as low as possible ● Use threads when you really need it! Dr. Pierre Vignéras 99
  • 100. OpenMP ● Design as superset of common languages for  MIMD­SM programming (C, C++, Fortran) ● UNIX, Windows version ● fork()/join() execution model ● OpenMP programs always start with one thread ● relaxed­consistency memory model ● each thread is allowed to have its own temporary view of  the memory ● Compiler directives in C/C++ are called  pragma (pragmatic information) ● Preprocessor directives:  #pragma omp <directive> [arguments] Dr. Pierre Vignéras 100
  • 102. OpenMP Thread Creation ● #pragma omp parallel [arguments] <C instructions>   // Parallel region ● A team of thread is created to execute the  parallel region – the original thread becomes the master thread  (thread ID = 0 for the duration of the region) – numbers of threads determined by ● Arguments (if, num_threads) ● Implementation (nested parallelism support, dynamic  adjustment) ● Environment variables (OMP_NUM_THREADS) Dr. Pierre Vignéras 102
  • 103. OpenMP Work Sharing ● Distributes the execution of a  parallel region  among team members – Does not launch threads – No barrier on entry – Implicit barrier on exit (unless nowait specified) ● Work­Sharing Constructs – Sections  – Loop – Single  – Workshare (Fortran) Dr. Pierre Vignéras 103
  • 104. OpenMP Work Sharing - Section double e, pi, factorial, product; int i; e = 1; // Compute e using Taylor expansion factorial = 1; for (i = 1; i<num_steps; i++) { factorial *= i; e += 1.0/factorial; } // e done pi = 0; // Compute pi/4 using Taylor expansion for (i = 0; i < num_steps*10; i++) { pi += 1.0/(i*4.0 + 1.0); pi -= 1.0/(i*4.0 + 3.0); } Independent Loops pi = pi * 4.0; // pi done Ideal Case! return e * pi; Dr. Pierre Vignéras 104
  • 105. OpenMP Work Sharing - Section double e, pi, factorial, product; int i; Private Copies #pragma omp parallel sections shared(e, pi) { Except for #pragma omp section { shared variables e = 1; factorial = 1; for (i = 1; i<num_steps; i++) { factorial *= i; e += 1.0/factorial; } } /* e section */ Independent #pragma omp section { sections are pi = 0; declared for (i = 0; i < num_steps*10; i++) { pi += 1.0/(i*4.0 + 1.0); pi -= 1.0/(i*4.0 + 3.0); } pi = pi * 4.0; } /* pi section */ Implicit Barrier } /* omp sections */ on exit return e * pi; Dr. Pierre Vignéras 105
  • 106. Talyor Example Analysis – OpenMP Implementation: Intel Compiler – System: Gentoo/Linux­2.6.18  – Hardware: Intel Core Duo T2400 (1.83 Ghz) Cache  L2 2Mb (SmartCache) Reason: Core Duo Gcc: Architecture? Standard: 13250 ms, Intel CC: Standard: 12530 ms Optimized (-Os -msse3 Optimized (-O2 -xW -march=pentium-m -march=pentium4): 7040 ms -mfpmath=sse): 7690 ms OpenMP Standard: 15340 ms Speedup: 0.6 OpenMP Optimized: 10460 ms Efficiency: 3% Dr. Pierre Vignéras 106
  • 107. OpenMP Work Sharing - Section ● Define a set of structured blocks that are to be  divided among, and executed by, the threads  in a team ● Each structured block is executed once by one  of the threads in the team ● The method of scheduling the structured  blocks among threads in the team is  implementation defined ● There is an implicit barrier at the end of a  sections construct, unless a nowait clause is  specified Dr. Pierre Vignéras 107
  • 108. OpenMP Work Sharing - Loop // LU Matrix Decomposition for (k = 0; k<SIZE-1; k++) { for (n = k; n<SIZE; n++) { col[n] = A[n][k]; } for (n = k+1; n<SIZE; n++) { A[k][n] /= col[k]; } for (n = k+1; n<SIZE; n++) { row[n] = A[k][n]; } for (i = k+1; i<SIZE; i++) { for (j = k+1; j<SIZE; j++) { A[i][j] = A[i][j] - row[i] * col[j]; } } } Dr. Pierre Vignéras 108
  • 109. OpenMP Work Sharing - Loop // LU Matrix Decomposition for (k = 0; k<SIZE-1; k++) { ... // same as before #pragma omp parallel for shared(A, row, col) for (i = k+1; i<SIZE; i++) { for (j = k+1; j<SIZE; j++) { A[i][j] = A[i][j] - row[i] * col[j]; } } } Loop should be in Private Copies canonical form Except for Implicit Barrier (see spec) shared variables at the end of a loop (Memory Model) Dr. Pierre Vignéras 109
  • 110. OpenMP Work Sharing - Loop ● Specifies that the iterations of the associated  loop will be executed in parallel. ● Iterations of the loop are distributed across  threads that already exist in the team ● The for directive places restrictions on the  structure of the corresponding for­loop – Canonical form (see specifications): ● for (i = 0; i < N; i++) {...} is ok ● For (i = f(); g(i);) {... i++} is not ok ● There is an implicit barrier at the end of a loop  construct unless a nowait clause is specified. Dr. Pierre Vignéras 110
  • 111. OpenMP Work Sharing - Reduction ● reduction(op: list) e.g: reduction(*: result) ● A private copy is made for each variable  declared in 'list' for each thread of the  parallel region  ● A final value is produced using the operator  'op' to combine all private copies #pragma omp parallel for reduction(*: res) for (i = 0; i < SIZE; i++) { res = res * a[i]; } } Dr. Pierre Vignéras 111
  • 112. OpenMP Synchronisation -- Master ● Master directive: #pragma omp master {...} ● Define a section that is only to be executed by  the master thread ● No implicit barrier: other threads in the team  do not wait Dr. Pierre Vignéras 112
  • 113. OpenMP Synchronisation -- Critical ● Critical directive: #pragma omp critical [name] {...} ● A thread waits at the beginning of a critical  region until no other thread is executing a  critical region with the same name. ● Enforces exclusive access with respect to all  critical constructs with the same name in all  threads, not just in the current team. ● Constructs without a name are considered to  have the same unspecified name.  Dr. Pierre Vignéras 113
  • 114. OpenMP Synchronisation -- Barrier ● Critical directive: #pragma omp barrier {...} – All threads of the team must execute the barrier  before any are allowed to continue execution – Restrictions ● Each barrier region must be encountered by all threads  in a team or by none at all. ● Barrier in not part of C! – If (i < n) #pragma omp barrier  // Syntax error! Dr. Pierre Vignéras 114
  • 115. OpenMP Synchronisation -- Atomic ● Critical directive: #pragma omp atomic expr-stmt ● Where expr-stmt is:  x binop= expr, x++, ++x, x--, --x (x scalar, binop any non  overloaded binary operator: +,/,-,*,<<,>>,&,^,|)  ● Only  load and store of the object designated  by x are atomic; evaluation of expr is not  atomic.  ● Does not enforce exclusive access of same  storage location x. ● Some restrictions (see specifications) Dr. Pierre Vignéras 115
  • 116. OpenMP Synchronisation -- Flush ● Critical directive: #pragma omp flush [(list)] ● Makes a thread’s temporary view of memory  consistent with memory ● Enforces an order on the memory operations  of the variables explicitly specified or implied. ● Implied flush:  – barrier, entry and exit of parallel region, critical – Exit from work sharing region (unless nowait)  – ... ● Warning with pointers! Dr. Pierre Vignéras 116
  • 117. MIMD DM Programming ● Message Passing Solutions ● Process Creation Send/Receive Outline ● ● MPI Dr. Pierre Vignéras 117
  • 118. Message Passing Solutions ● How to provide Message Passing to  programmers? – Designing a special parallel programming language ● Sun Fortress, E language, ... – Extending the syntax of an existing sequential  language ● ProActive, JavaParty – Using a library ● PVM, MPI, ...  Dr. Pierre Vignéras 118
  • 119. Requirements ● Process creation – Static: the number of processes is fixed and defined  at start­up (e.g. from the command line) – Dynamic: process can be created and destroyed at  will during the execution of the program. ● Message Passing – Send and receive efficient primitives – Blocking or Unblocking – Grouped Communication Dr. Pierre Vignéras 119
  • 120. Process Creation Multiple Program, Multiple Data Model (MPMD) Source Source File File Compile to suit processor Executable Executable File File Processor 0 Processor p-1 Dr. Pierre Vignéras 120
  • 121. Process Creation Single Program, Multiple Data Model (SPMD) Source File Compile to suit processor Executable Executable File File Processor 0 Processor p-1 Dr. Pierre Vignéras 121
  • 122. SPMD Code Sample int pid = getPID(); if (pid == MASTER_PID) { ● Compilation for execute_master_code(); heterogeneous }else{ execute_slave_code(); processors still required } ● Master/Slave Architecture ● GetPID() is provided by the parallel system (library, language, ...) ● MASTER_PID should be well defined Dr. Pierre Vignéras 122
  • 123. Dynamic Process Creation Process 1 Primitive required: : spawn(location, executable) : : spawn() Process 2 : : : : : : : : : : : Dr. Pierre Vignéras 123
  • 124. Send/Receive: Basic ● Basic primitives: – send(dst, msg); – receive(src, msg); ● Questions: – What is the type of: src, dst? ● IP:port ­­> Low performance but portable ● Process ID ­­> Implementation dependent (efficient  non­IP implementation possible) – What is the type of: msg? ● Packing/Unpacking of complex messages ● Encoding/Decoding for heterogeneous architectures Dr. Pierre Vignéras 124
  • 125. Persistence and Synchronicity in Communication (3) (from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen) 2-22.1 a) Persistent asynchronous communication b) Persistent synchronous communication
  • 126. Persistence and Synchronicity in Communication (4) (from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen) 2-22.2 a) Transient asynchronous communication b) Receipt-based transient synchronous communication
  • 127. Persistence and Synchronicity in Communication (5) (from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen) a) Delivery-based transient synchronous communication at message delivery b) Response-based transient synchronous communication
  • 128. Send/Receive: Properties Message System Precedence Synchronization Requirements Constraints Send: non-blocking Message Buffering None, unless Receive: non- Failure return from message is received blocking receive successfully Send: non-blocking Message Buffering Actions preceding send occur before Receive: blocking Termination Detection those following send Send: blocking Termination Detection Actions preceding Receive: non- Failure return from send occur before blocking receive those following send Send: blocking Termination Detection Actions preceding send occur before Receive: blocking Termination Detection those following send Dr. Pierre Vignéras 128
  • 129. Receive Filtering ● So far, we have the primitive: receive(src, msg); ● And if one to receive a message: – From any source? – From a group of process only? ● Wildcards are used in that cases – A wildcard is a special value such as: ● IP: broadast (e.g: 192.168.255.255), multicast address  ● The API provides well defined wildcard (e.g:  MPI_ANY_SOURCE) Dr. Pierre Vignéras 129
  • 130. The Message-Passing Interface (MPI) (from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen) Primitive Meaning MPI_bsend Append outgoing message to a local send buffer MPI_send Send a message and wait until copied to local or remote buffer MPI_ssend Send a message and wait until receipt starts MPI_sendrecv Send a message and wait for reply MPI_isend Pass reference to outgoing message, and continue MPI_issend Pass reference to outgoing message, and wait until receipt starts MPI_recv Receive a message; block if there are none MPI_irecv Check if there is an incoming message, but do not block Some of the most intuitive message-passing primitives of MPI.
  • 131. MPI Tutorial (by William Gropp) See William Gropp Slides Dr. Pierre Vignéras 131
  • 132. Parallel Strategies ● Embarrassingly Parallel Computation Partitionning Outline ● Dr. Pierre Vignéras 132
  • 133. Embarrassingly Parallel Computations ● Problems that can be divided into different  independent subtasks – No interaction between processes – No data is really shared (but copies may exist) – Result of each process have to be combined in  some ways (reduction). ● Ideal situation! ● Master/Slave architecture – Master defines tasks and send them to slaves  (statically or dynamically) Dr. Pierre Vignéras 133
  • 134. Embarrassingly Parallel Computations Typical Example: Image Processing ● Shifting, Scaling, Clipping, Rotating, ... ● 2 Partitioning Possibilities 640 640 480 480 Process Process Dr. Pierre Vignéras 134
  • 135. Image Shift Pseudo-Code ● Master h=Height/slaves; for(i=0,row=0;i<slaves;i++,row+=h) send(row, Pi); for(i=0; i<Height*Width;i++){ recv(oldrow, oldcol, newrow, newcol, PANY); map[newrow,newcol]=map[oldrow,oldcol]; } recv(row, Pmaster) ● Slave for(oldrow=row;oldrow<(row+h);oldrow++) for(oldcol=0;oldcol<Width;oldcol++) { newrow=oldrow+delta_x; newcol=oldcol+delta_y; send(oldrow,oldcol,newrow,newcol,Pmaster); } Dr. Pierre Vignéras 135
  • 136. Image Shift Pseudo-Code Complexity ● Hypothesis: – 2 computational steps/pixel – n*n pixels – p processes ● Sequential: Ts=2n2=O(n2) ● Parallel: T//=O(p+n2)+O(n2/p)=O(n2) (p fixed)  – Communication: Tcomm=Tstartup+mTdata =p(Tstartup+1*Tdata)+n2(Tstartup+4Tdata)=O(p+n2) – Computation: Tcomp=2(n2/p)=O(n2/p) Dr. Pierre Vignéras 136
  • 137. Image Shift Pseudo-Code Speedup ● Speedup Ts 2n 2 = T p 2 n 2 / p pT startup T data n 2 T startup 4T data  ● Computation/Communication: 2n 2 / p n2 / p 2 =O  2  p T startupT data n T startup 4T data  pn – Constant when p fixed and n grows! – Not good: the largest, the better ● Computation should hide the communication ● Broadcasting is better ● This problem is for Shared Memory! Dr. Pierre Vignéras 137
  • 138. Computing the Mandelbrot Set ●Problem: we consider the limit of the following  sequence in the complex plane:  z 0=0 z k 1= z k c ● The constant c=a.i+b is given by the point (a,b)  in the complex plane ● We compute zk for 0<k<n – if for a given k, (|zk| > 2) we know the sequence  diverge to infinity – we plot c in color(k) – if for (k == n), |zk|<2 we assume the sequence  converge we plot c in a black color. Dr. Pierre Vignéras 138
  • 139. Mandelbrot Set: sequential code int calculate(double cr, double ci) { double bx = cr, by = ci; double xsq, ysq; int cnt = 0; Computes the number of iterations while (true) { for the given complex point (cr, ci) xsq = bx * bx; ysq = by * by; if (xsq + ysq >= 4.0) break; by = (2 * bx * by) + ci; bx = xsq - ysq + cr; cnt++; if (cnt >= max_iterations) break; } return cnt; } Dr. Pierre Vignéras 139
  • 140. Mandelbrot Set: sequential code y = startDoubleY; for (int j = 0; j < height; j++) { double x = startDoubleX; int offset = j * width; for (int i = 0; i < width; i++) { int iter = calculate(x, y); pixels[i + offset] = colors[iter]; x += dx; } y -= dy; } Dr. Pierre Vignéras 140
  • 141. Mandelbrot Set: parallelization ● Static task assignment: – Cut the image into p areas, and assign them to each  p processor ● Problem: some areas are easier to compute (less  iterations) – Assign n2/p pixels per processor in a round robin or  random manner ● On average, the number of iterations per  processor  should be approximately the same ● Left as an exercice Dr. Pierre Vignéras 141
  • 142. Mandelbrot set: load balancing ● Number of iterations per  pixel varies – Performance of processors may also vary ● Dynamic Load Balancing ● We want each CPU to be 100% busy ideally ● Approach: work­pool, a collection of tasks – Sometimes, processes can add new tasks into the  work­pool – Problems: ●  How to define a task to minimize the communication  overhead? ● When a task should be given to processes to increase the  computation/communication ratio? Dr. Pierre Vignéras 142
  • 143. Mandelbrot set: dynamic load balancing ● Task == pixel – Lot of communications, Good load balancing ● Task == row – Fewer communications, less load balanced  ● Send a task on request – Simple, low communication/computation ratio ● Send a task in advance – Complex to implement, good  communication/computation ratio ● Problem: Termination Detection? Dr. Pierre Vignéras 143
  • 144. Mandelbrot Set: parallel code Master Code count = row = 0; for(k = 0, k < num_procs; k++) { Pslave is the process send(row, Pk,data_tag); which sends the last count++, row++; received message. } do { recv(r, pixels, PANY, result_tag); count--; if (row < HEIGHT) { send(row, Pslave, data_tag); row++, count++; }else send(row, Pslave, terminator_tag); display(r, pixels); }while(count > 0); Dr. Pierre Vignéras 144
  • 145. Mandelbrot Set: parallel code Slave Code recv(row, PMASTER, ANYTAG, source_tag); while(source_tag == data_tag) { double y = startDoubleYFrom(row); double x = startDoubleX; for (int i = 0; i < WIDTH; i++) { int iter = calculate(x, y); pixels[i] = colors[iter]; x += dx; } send(row, pixels, PMASTER, result_tag); recv(row, PMASTER, ANYTAG, source_tag); }; Dr. Pierre Vignéras 145
  • 146. Mandelbrot Set: parallel code Analysis T s ≤Max iter . n 2 T p =T comm T comp T comm=n  t startup2 t data  p−1t startup 1t data n  t startupnt data  n2 T comp ≤ Maxiter p Only valid when 2 Ts Max iter . n all pixels are black! = T p Max iter . n 2 (Inequality of n2 t data 2 n p−1t startup t data  p Ts and Tp) Max iter . n 2 T comp p = 2 T comm n t data 2 n p−1t startup t data  ● Speedup approaches p when Maxiter is high ● Computation/Communication is in O(Maxiter) Dr. Pierre Vignéras 146
  • 147. Partitioning ● Basis of parallel programming ● Steps: – cut the initial problem into smaller part – solve the smaller part in parallel – combine the results into one  ● Two ways – Data partitioning – Functional partitioning ● Special case: divide and conquer – subproblems are of the same form as the larger  problem Dr. Pierre Vignéras 147
  • 148. Partitioning Example: numerical integration ● We want to compute:  b ∫ f  x dx a Dr. Pierre Vignéras 148
  • 149. Quizz Outline Dr. Pierre Vignéras 149
  • 150. Readings ● Describe and compare the following  mechanisms for initiating concurrency: – Fork/Join – Asynchronous Method Invocation with futures – Cobegin – Forall – Aggregate – Life Routine (aka Active Objects) Dr. Pierre Vignéras 150