Más contenido relacionado


Chap 1(one) general introduction

  1. UBa/NAHPI-2020 DepartmentofComputer Engineering PARALLEL AND DISTRIBUTED COMPUTING By Malobe LOTTIN Cyrille .M Network and Telecoms Engineer PhD Student- ICT–U USA/CAMEROON Contact Phone:243004411/695654002
  2. CONTENT  Part 1- Introducing Parallel and Distributed Computing • Background Review of Parallel and Distributed Computing • INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING • Some keys terminologies • Why parallel Computing? • Parallel Computing: the Facts • Basic Design Computer Architecture: the von Neumann Architecture • Classification of Parallel Computers (SISD,SIMD,MISD,MIMD) • Assignment 1a  Part 2- Initiation to Parallel Programming Principles • High Performance Computing (HPC) • Speed: a need to solve Complexity • Some Case Studies Showing the need of Parallel Computing • Challenge of explicit Parallelism • General Structure of Parallel Programs • Introduction to the Amdahl's LAW • The GUSTAFSON’s LAW • SCALIBILITY • Fixed Size Versus Scale Size • Assignment 1b • Conclusion
  3. BACKGROUND • Interest on PARALLEL COMPUTING dates back to the late 1950’s. • Supercomputer surfaced throughout the 60’s and 70’s introducing shared memory multiprocessors, with multiple processors working side-by-side on shared data. • By 1980’s, a new kind of parallel computing was launched. - Introduction of a Supercomputer for scientific applications from 64 Intel 8086/8087 processors designed by Caltech Concurrent Computation. - Assurance that great performance could be achieved with Massive Parallel Processors (MPP). • By 1997, ASCI Red supercomputer computer break the barrier of one trillion floating point operations per second(FLOPS). • 1980’s introduced concept of CLUSTERS (Many computers operating as a single unit, performing the same task under the supervision of a Software), main system that competed and displaced MPPs from various applications.
  4. BACKGROUND • TODAY: Parallel computing is becoming mainstream based on multi-core processors  Chips manufacturers are increasing overall processing performance by adding additional CPU cores (Dual Core, Quad Core, etc.). WHY? • Increasing performance through parallel processing appears to be far more energy-efficient than increasing microprocessor clock frequencies. • Better performance is predicted by Moore’s LAW who believe in the ability of Transistors to empower systems. Consequence: We shall be going from few Cores to Many…. • Besides, Software development has been very active in the evolution of Parallel Computing. They field must run after if Parallel and distributed systems must expand ! • BUT, Parallel programs have been harder to write than sequential ones. Why? - Difficulties in synchronization of the multiple tasks that run by those program at the same time. • For MPP and Clusters: Application Programming Interfaces (API) converged to a single standard called MPI (Message Passive Interface) that handle Parallel Computing architectures. • For Shared Memory Multiprocessor Computing , convergence is towards two standards pthreadsand OpenMP. KEY CHALLENGE: Ensure effective transition of the software industry to parallel programming so that a new generation of systems can take place and offer a more powerful user-experience of Digital technologies, solutions and applications.
  5. INTRODUCTIONTO PARALLELCOMPUTING WHAT IS PARALLEL COMPUTING( Parallel Execution) ? Traditionally, software are written to operate following serial computation. That is: – RUN on a single computer having a single Central Processing Unit (CPU); – Brake any given problem into a discrete series of instructions. – Those Instructions are then executed one after another. – Important: Only one instruction may execute at any moment in time. Generally, serial processing compared to Parallel processing is as followed:
  6.  CASE OF SERIAL COMPUTATION What about executing these micro- programs Simultaneously ?  PARALLEL COMPUTATION • Here, Problem is broken into discrete parts that can be solved at the same time (Concurrently) • Discrete parts are then broken down to a series of instructions • These instructions from each part execute simultaneously on different processors under the supervision of an overall control/coordination mechanism INTRODUCTIONTO PARALLELCOMPUTING(Cont...) Function- solve Payroll Problem Run Micro- Program one after another
  7. Discrete Parts of the Problem INTRODUCTIONTO PARALLELCOMPUTING(Cont..) It means to compute in parallel: the problem must be broken apart into discrete pieces of work that can be solved simultaneously; at a give time t, instructions from multiple program should be able to be executed;  the time taken to solve the problem should be far more shorter than with serial computation( Single compute resource). WHO DOES THE COMPUTING? • A single computer with multiple processors/cores • Can also be an arbitrary number of such computers connected by a network
  8. There is a “jargon” used in the area of PARALLEL COMPUTING. Some key terminology are:  PARALLELISM: ability to execute parts of a computation concurrently  Supercomputing / High Performance Computing (HPC) : refers to world's fastest and largest computers with the ability to solve large problems  Node: A standalone «single» computer that will form the Super computer once network together.  Thread: a unit of execution consisting of a sequence of instructions that is managed by either the operating system or a runtime system.  CPU / Socket / Processor / Core: basically a singular execution component of a computer. Individual CPUs are subdivided into multiple cores that constitute individual execution unit. SOCKET expresses CPU with multiple cores. Anyway…. Terminology can be confusing. However, this is the center of computing operations.  Task: Program or program-like set of instructions that is executed by a processor. Parallelism involve multiple tasks running on multiple processors.  Shared Memory Architecture where all computer processors have direct (usually bus based) access to common physical memory.  Symmetric Multi-Processor (SMP) Shared memory hardware architecture where multiple processors share a single address space and have equal access to all resources.  Granularity (Grain Size) often refer to a given task and represent the measure of the amount of work (Or Computation) which is performed by that task. When a program is split into large task generating a large amount of computation in processors, it is called Coarse-grained parallelism. Otherwise, when the splitting generate small task with minimum requirement in processing, it is called FINE.  Massively Parallel refer to hardware of parallel systems with many processors (“many” = hundreds of thousands)  Pleasantly Parallel solving many similar but independent tasks simultaneously. Requires very little communication  Scalability a proportionate increase in parallel speedup with the addition of more processors CONCEPT AND TERMINOLOGY
  9. • TO Address limitations of serial computing:  Expensive in attempt to make single processing faster.  Serialization speed is directly dependent upon how fast data can move through hardware (Transmission bus). Must minimize distance between processing element to achieve improve speed Do not satisfy constraint of reality where event often happened consecutively. There is need for a solution that is suitable for modeling and simulating complex world Phenomena ( example: Modeling processes of assembling of Cars, or Jet, or Traffic during Rush hours,..) Also o Physical limitation of hardware components o Economical reasons – more complex = more expensive o Issues of Performance limits – double frequency <> double performance o Large applications – demand too much memory & time SO…. We need to :  Save time - wall clock time  Solve larger problems in the most efficient way  Provide concurrency (do multiple things at the same time) IT MEANS… with more parallelism, – We solve larger problems in the same time – AND, solve a fixed size problem in shorter time NOTE: if we agree that most stand alone computers handle multiple functional unit (L1 cache, L2 cache, branch, prefetch, decode, floating-point, graphics processing (GPU), integer, etc), have Multiple execution Unit or cores and Multiple hardware threat, THEN: ALL STAND ALONE COMPUTERs today can be Characterize as implementing PARALLEL COMPUTING. WHY PARALLEL COMPUTING ?
  10. Future of computing cannot be conceived without parallel processing .  Continuous development and expansion of the Internet and the improvement in network management schemes. Having better means available for a group of computers to cooperate in solving a computational problem will inevitably translate into a higher profile of clusters and grids in the landscape of parallel and distributed computing. Akl S.G., Nagy M. (2009) The increase boosting of computer power will provide more SCALING ability to parallel computing programs. PARALLEL COMPUTING: THE FACTS
  11. BasicDesignArchitectureofParallelComputing: thevonNeumannArchitecture From "hard wiring“ computers ( computers were programmed through wiring) to "stored-program computer" where both program instructions and data are kept in electronic memory, all computer basically have the same design comprising: • Four main components: Memory, Control Unit, Arithmetic Logic Unit and Input / Output  Read/write, Random Access Memory used to store both program instructions (coded data which tell the computer to do something and data (information to be used by the program)  Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task.  Arithmetic Unit performs basic arithmetic operations  Input / Output is the interface to the human operator NOTE: Parallel computers still follow this basic design, Only that Units are Multiplied. The basic, fundamental architecture remains the same
  12. • VARIOUS ways to classify parallel computers • THE MOST WIDELY USED CLASSIFICATION: Flynn's Taxonomy. - Classification here is made following two independent dimension: Instruction Stream and Data Stream. - No matter the dimension, only one possible state can be manifested: Single Operation (=instruction) or Multiple Operations (=instructions) CLASSIFICATIONOFPARALLELCOMPUTERS SISD Single Instruction Stream Single Data Stream (One CPU)+ Memory SIMD Single Instruction Stream Multiple Data Stream (One CPU) + Memories (+) MISD Multiple Instruction Streams Single Data Stream (Multiple CPU)+ Memory MIMD Multiple Instruction Streams Multiple Data Streams (Multiple CPU)+ Memory(+) SINGLE MULTIPLE INSTRUCTIONS DATA STREAM
  13. The SISD: Single Instruction (Only one instruction stream is being acted on by the CPU during any one clock cycle) stream, Single Data (Only one data stream is being used as input during any one clock cycle) stream. - This is the most popular (Common) Computer produced and used. Example: Workstations, PCs, etc. - Here, only one CPU is present and instruction operates on 1 data item at a time. - Execution is Deterministic CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…) MEMORYCPU Bus Instruction Data (operand)
  14. SIMD: Single Instruction stream, Multiple Data Stream • Computers here operates parallel computing and are known as VECTOR COMPUTER • First type of computers having a system with a massive amount of processors with computational power above Giga FLOP range. • Machines executes ONE instruction stream but on MULTIPLE (Different) data items considered as multiple data streams. • It means all processing units execute the same instruction at any given clock cycle (Instruction stream) with the flexibility that Each processing unit can operate on a different data element (Data stream) • This type of processing is suitable for problems requesting a high degree of regularity. Example: graphics/image processing • Processing method is characterized as Synchronous and Deterministic • They are two variety of such processing: Processors arrays and vectors pipelines CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…) Memory Bank 1 Memory Bank 2 CPU Instruction Data (operand) di dii
  15. With the Single Instruction stream, Multiple Data Stream (SIMD), • Only One CPU is Present is in the Computer. This CPU has: - One instruction register but Multiples Arithmetic Logic Unit (ALU) and uses multiple data buses to get multiple operands simultaneously. It uses multiple data buses to handle multiple operands simultaneously. - The memories are divided into multiple banks that are accessed independently. They are also multiple data buses for the CPU to access data simultaneously. Operational behavior  only 1 instruction is fetched by the CPU at the time  Some instructions, known as VECTOR INTRUCTIONS ( witch will fetch multiple operands simultaneously) , operate on multiple data Items at once. EXAMPLES OF SIMD COMPUTERS • Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2, ILLIAC IV • Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 • Most modern computer integrating graphics processor units (GPUs) employ SIMD instructions and execution units. CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
  16. MIMD: Multiple Instructions streams, Multiple Data Streams • Here, multiple instructions are executed at ONCE. So they are multiple Data streams. • Each instruction operates on its own data independently (Multiple operations on the same data is rare !!) • Two main type of MIMD computers: shared Memory and Message Passing Shared Memory MIMD  In shared Memory MIMD, Memory locations are all accessible by all the processors (CPUs): this type of computer is a Multi processor Computer  Most workstations and high end PC are Multi processor based today. CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…) CPU 1 CPU 2 CPU N+1 CPU n Memory Bank Instruction i Data (operand i) Buses Buses Instruction ii Data (operand k)
  17. Shared Memory MIMD computers are characterize by:  Multiple CPUs in a computer sharing the Same memory  Even though CPUs are coupled Tightly, Each CPU fetches ONE INSTRUCTION at a time t, and, different CPUs can fetch different instructions by so generating multiple instructions streams  The structure of the memory must be Multiple access points designed (that is, organized into multiple independent memory banks) so that multiple instructions/operands can be transferred simultaneously. This structuration help in avoiding Conflict of Access by CPUs on the same memory bank. Finally, an instruction operates one data item a time ! CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
  18.  Message Passing MIMD ( CLUSTER computing) • In This architecture, Multiple SISD computers are interconnected by a COMMUNICATION NETWORK • And, each CPU has its own private memory. It means there is not sharing of memory among the various CPUs • It is possible for programs running on different CPUs to exchange information if required. In that case, exchange is done through MESSAGES. • More cheaper to Manufacture Message Passing Computers than Shared-memory MIMD • There is a need of a dedicated HIGH SPEED NETWORK SWITCH to perform the interconnection role of the SISD computers • The MIMD Message Passing computers always provide Message Passing API (Aplication Programming Interface) so that programmers can be able to include in their programs, statements that permits exchange of messages. Example: the Message Passing Interface (MPI) CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
  19.  Message Passing MIMD ( CLUSTERcomputing) Operationnal architeture MEMORYCPU Bus Instruction Data (operand) MEMORYCPU Bus Instruction Data (operand) MEMORYCPU Bus Instruction Data (operand) MEMORYCPU Bus Instruction Data (operand) Data Exchange Data Exchange Data Exchange Data Exchange SWITCH SISD 1 SISD n-1 SISD n SISD n+1 CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
  20. MISD: Multiple Instructions stream, Single Data Stream Assignment 1a: Research on Multiple Instruction stream, Single Data stream (MISD) Parallel Computing. You will emphasize on: 1. Architecture design and modeling 2. Properties of such a design 3. Operational details 4. Practical example and Specifications. Submission Date: 28 October 2020 Time: 12 Pm Email: NOTE: Late submission = - 50% of the assignment Points. Acceptable ONCE. CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
  21. END OF PART 1 What are we saying ? Looking at it VIRTUALLY, All stand-alone computers today are parallel. from a hardware perspective, computers have:  Multiple functional units (floating point, integer, GPU, etc.)  Multiple execution units / cores  Multiple hardware threads
  22. CHECK YOUR PROGRESS ….. • Check Your Progress 1 1) What are various criteria for classification of parallel computers? ………………………………………………………………………………………… ………………………………………………………………………………………… 2) Define instruction and data streams. ………………………………………………………………………………………… ………………………………………………………………………………………… 3) State whether True or False for the following: If Is=Instruction Stream and Ds=Data Stream, a) SISD computers can be characterized as Is > 1 and Ds > 1 b) SIMD computers can be characterized as Is > 1 and Ds = 1 c) MISD computers can be characterized as Is = 1 and Ds = 1 d) MIMD computers can be characterized as Is > 1 and Ds > 1 4) Why do we need Parallel Computing ? ………………………………………………………………………………………… …………………………………………………………………………………………
  23. • Why do we need HPC ? 1. Save time and/or money: the more you allocate resources on a given task, the faster you expect to see it completed and save some money. Consider that Parallel clusters can be built from cheap, commodity components. 2. Solve larger problems: They are so many complex problems that can’t be solved with single Computer especially considering their limited computer memory. 3. Provide concurrency: A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things simultaneously. 4. Use of non-local resources: HPC will provide the flexibility to use compute resources on a wide area network, or even the Internet when local compute resources are scarce. SO…. High-performance computing (HPC) is the use of parallel processing for running advanced application programs efficiently, reliably and quickly. HIGHPERFORMANCECOMPUTING(HPC) CLOSE TO REALITY
  24. THE MOORE’S LAW PREDICTION  Statement [1965]: `The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.'' REVISION in [1975]: `There is no room left to squeeze anything out by being clever. Going forward from here we have to depend on the two size factors - bigger dies and finer dimensions.'' IT MEANS: - Prioritize minimum size and improve in power by increasing Transistors. That is…. - More transistors = ↑ opportunities for exploiting parallelism If one is to buy into Moore's law, the question still remains • how does one translate transistors into useful OPS (operations per second)? If Moore believes that : the transistor density of semiconductor chips would double roughly every 18 months. A tangible solution is to rely on parallelism, both implicit and explicit. TWO Possible way to implement parallelism:  Implicit parallelism: invisible to the programmer – pipelined execution of instructions, using conventional language such as C, Fortran or Pascal to write the source Program – code source program is sequential and translated into parallel object code by a Parallelizing Compiler that will detect and assign target machine resources. This is apply in programming shared multiprocessors and require less effort from the programmer.  Explicit parallelism – Long instruction words (VLIW) and require more effort by the programmer to develop a source program -- Made of bundles of independent instructions that can be issued together, reducing the burden on the compiler to detect parallelism, which will detect and assign target machine resources when needed.. Example: Intel Itanium processor 2000-2017 HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
  25. IS MOORE’S LAW STILL APPLICABLE ? • Up to early 2000s, transistor count was a valid indication of how much additional processing power could be packed into an area. • Moore’s law and Dennard scaling, when combined, held that more transistors could be packed into the same space and that those transistors would use less power while operating at a higher clock speed. ARGUMENTS  Because Classic Dennard scaling no longer occurs at each lower node, packing more transistors into a smaller space no longer guarantees lower total power consumption. Consequently, does no longer correlates directly to higher performance. • The Major Limiting Factor: Hot spot formation Possible way forward TODAY: - Focus on Improving CPU Cooling (One of the biggest barriers to higher CPU clock speeds is hot spots) by either ameliorating the efficiency of the Thermal interface Material(TIM) or improving lateral heat transfer within the CPU itself or making used of computational sprinting to increase thermal dissipation. HOWEVER: This won’t improve compute performance over sustained periods of time BUT it would speed latency-sensitive applications like web page loads or brief, intensive computations. HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
  26. 42 Years of Microprocessor Trend Data . Orange: Moore’s Law trend; Purpule: Dennard scaling breakdown; Green & Red: Immediate implications of Dennard scaling breakdown; Blue: Slowdown of ST increase in performances; Black: The age of increase parallelism HIGHPERFORMANCECOMPUTING(HPC)(Cont…) SOURCE: Karl Rupp. 42 Years of Microprocessor Trend Data. 2018/02/42-years-of-microprocessor-trend-data/, 2018. [Online]. ILLUSTRATION
  27. ORTHER HPC LIMITING FACTORS - Disparity between the clock rates growth of High-end Processors and Memory Access time: Clock rate (40%)/year over the past decade while DRAM (10%)/year over the same period. This is a significant performance bottleneck. This is issue is addressed by Parallel Computing by : • providing increased bandwidth to the memory system • offering higher aggregate caches. This explain why Some of the fastest growing applications of parallel computing utilize not their raw computational speed, rather their ability to pump data to memory and disk faster. HIGHPERFORMANCECOMPUTING(HPC)(Cont…) Source: How Multithreading Addresses the Memory Wall - Scientific Figure on ResearchGate. Available from: [accessed 22 Oct, 2020]
  28. PROCESSORS EVOLUTION: CASE OF INTEL PROCESSORS From 2017, 9 generations of Processors have been developed. HIGHPERFORMANCECOMPUTING(HPC)(Cont…) Intel core i10 processor Source: Retrieve from
  29. • differences between series of Processor Core i3, i5, i7 and i9. Generation Number of Cores Specifications Cache Size (MB) Core i3 2 physical cores - Cheapest processors - used INTEL® HYPER THREADING TECHNOLOGY creates a 2 physical cores and 2 more virtual operating system determines that the processor has 4 cores Memory 3-4MB Core i5 4 physical cores, some models have only 2 physical cores + 2 virtual Higher performance is achieved by the presence of 4 physical cores and increased volume of cache memory 4 or 8MB Core i7 4 to 8 physical cores , use INTEL® HYPER THREADING TECHNOLOGY Performance increased virtual cores and a large volume of cache memory. Processors for mobile devices can have 2 physical cores. from 8MB to 20MB Core i9 6 - 8 physical cores i9 series was conceived as a competitor to AMD game processors. More cores, more speed but not much. Since i9 is slightly better than i7. there is practically no sense in the development of this processor line. cache 10Mb-20Mb HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
  30.  ACHIEVING GREATER SPEED WILL HELP IN UNDERTANDING VARIOUS PHENOMENON APLICABLE TO DIFFERENT DOMAIN OF LIFE. • Science —understanding matter from elementary particles to cosmology —storm forecasting and climate prediction —understanding biochemical processes of living organisms • Engineering —multi-scale simulations of metal additive manufacturing processes —understanding quantum properties of materials —understanding reaction dynamics of heterogeneous catalysts —earthquake and structural modeling —pollution modeling and remediation planning —molecular nanotechnology • Business —computational finance - high frequency trading —information retrieval —data mining “big data” • Defense —nuclear weapons stewardship • Computers: — Embedded systems increasingly rely on distributed control algorithms. — Network intrusion detection, cryptography, etc. — Optimizing performance of modern automobile. — Networks, mail-servers, search engines… — Visualization SPEED:ANEEDTOSOLVECOMPLEXITY
  31. • Parallelism finds applications in very diverse domains for different motivating reasons. These range from improved application performance to cost considerations. CASE 1: Earthquake Simulation in Japan HOW DO WE PREVENT SUCH TO HAPPEN AGAIN ? • We need Computers , with ability to put together Computation Power in order to be able to simulate and calculate High level operations for better prediction of natural Phenomenon SOMECASESTUDIES SOURCE: Earthquake Research Institute, University of Tokyo Tonankai-Tokai Earthquake Scenario. Video Credit: The Earth Simulator Art Gallery, CD-ROM, March 2004
  32. SOMECASESTUDIES(Cont…) CASE 2: El Niño El Niño is an anomalous, yet periodic, warming of the central and eastern equatorial Pacific Ocean. For reasons still not well understood, every 2-7 years, this patch of ocean warms for six to 18 months El Niño was strong through the Northern Hemisphere winter 2015-16, with a transition to ENSO- neutral in May 2016. HOW DO WE EXPLAIN SUCH PHENOMENON ? May be bringing into collaboration various Processors of distributed computers (placed in different location ) can help provide an answer. Parallel Programming must therefore be able to develop compatible software that can contribute in collection and analyses of data.
  33. • Most of Parallelism concepts we shall study are from the Explicit orientation. Challenges related to Explicit Parallelism are:  Algorithm development is harder —complexity of specifying and coordinating concurrent activities  Software development is much harder —lack of standardized & effective development tools and programming models —subtle program errors: race conditions  Rapid pace of change in computer system architecture —a great parallel algorithm for one machine may not be suitable for another – example: homogeneous multicore processors vs. GPUs ChallengesofEXPLICITPARALLELISM
  34. Parallel science applications are often very sophisticated —e.g. adaptive algorithms may require dynamic load balancing • Multilevel parallelism is difficult to manage • Extreme scale exacerbates inefficiencies —algorithmic scalability losses —serialization and load imbalance —communication or I/O bottlenecks —insufficient or inefficient parallelization • Hard to achieve top performance even on individual nodes —contention for shared memory bandwidth —memory hierarchy utilization on multicore processors Challenges of PARALLELISM INGENERAL
  35. • IT IS NOT ALL ABOUT COMPUTATION. There is also a need to: Improve on Memory latency and bandwidth. Because, —CPU rates are > 200x faster than memory —bridge speed gap using memory hierarchy —more cores exacerbates demand  Improve on the Interprocessor communication Improve ON Input/output Correlation — I/O bandwidth to disk typically needs to grow linearly with the # processors ACHIEVINGHIGHPERFORMANCEONPARALLELSYSTEMS
  36. EXPLICITLY define tasks, work decomposition, data decomposition, communication, synchronization. EXAMPLE: MPI is a library for fully explicit parallelization. “It is either All or nothing”. IMPLICITLY define tasks only, rest implied; or define tasks and work decomposition rest implied; EXAMPLE OpenMP is a high-level parallel programming model, which is mostly an implicit model HOWDOWEEXPRESSPARALLELISMINAPROGRAM?
  37. All parallel programs contain: - Parallel sections And, - Serial sections (Serial sections are when work is being duplicated or no useful work is being done,(waiting for others)) We therefore need to Build efficient algorithms by avoiding: - Communication delay - Idling - Synchronization QUICKVIEWONTHESTRUCTUREOFAPARALLEL PROGRAM
  38. Generally, Parallel thinking is closer to us than what we believe. Daily, We try to do things simultaneously, avoiding to have delay in any. Parallel Computing thinking is not far from this…. For a given task to be done by many, WE MAY ASK OURSELVES: How many people are involve in the work ?. (Degree of Parallelism)  What is needed to begin the work? (Initialization) Who does what ? (Work distribution)  How do we regulate Access to work part. (Data/IO access)  Find OUT: Whether they need info from each other to finish their own job. (Communication)  When are they all done ? (Synchronization)  What needs to be done to collate the result. AWAYTOTHINK:PARALLELAPPROACH
  39. • Development of parallel programming impose the need of Performance metrics and Software tools in order to evaluate the performance of parallel algorithm. • Some factors can help in achieving this goal: - Type of Hardware used - The degree of parallelism of the problem - The type of parallel model to use The goal is : To compare what is obtained (Parallel program) from what was there (Original Sequence). Analyses focuses on the number of threads and/or the number of processes used. Note: Ahmdal’s Law will introduce the limitations related to Parallel computation. And, the Gustafson’s Law will evaluate the degree of efficiency of Parallelization of a sequential algorithm. EVALUATIONMETRICS
  40.  Relation between the Execution time (Tp) and Speedup, (S) S(p, n) = T(1, n) / T(p, n) - Usually, S(p, n) < p - Sometimes S(p, n) > p (super linear speedup)  Efficiency, E E(p, n) = S(p, n)/p - Usually, E(p, n) < 1, Sometimes, greater than 1  Scalability – Limitations in parallel computing, relation to n and p. EVALUATIONMETRICS(Cont…) SpeedUP Measurement (S) • Speedup is a MEASURE • It help in appreciating the benefit in solving a problem in parallel • It is given by: Ratio of the time taken to solve a Problem on a Single processing element (Ts) to the time required to solve the same problem on a p identical processing elements (Tp). • That is : S= Ts/Tp - IF S = p (Ideal condition) LINEAR SPEEDUP (Speed of execution is with the number of processors.) - IF S < p,  Real speedup - IF S > p, Super Linear Speedup.
  41. EFFICIENCY (E) • Another performance metric • Will estimate the ability of the processors to solve a given task in comparison of how much effort is wasted in communication and Synchronization. • Ideal Condition of a parallel system: S=P (Speedup is equal to p Processing elements--- VERY RARE !!!) • Efficiency (E) is given by: E= S/p = TS/pTp - When E=1 - It is a LINEAR Case - When E<1, It is a REAL Case - When E<<1, It is a problem that is parallelizable with low efficiency EVALUATIONMETRICS(Cont…)
  42. SCALABILITY • RULE: Efficiency decreases with increasing P; increases with increasing N. But here are the fundamentals questions: 1- How effectively the parallel algorithm can use an increasing number of processors ? 2- How the amount of computation performed must scale with P to keep E constant ? • SCALING is simply the ability to be efficient on a parallel machine. - It identifies the Computing Power ( How fast task are executed) proportionally to the number processors - IF, we increase the problem size (n) and the number of Processors (p) at the same time, THERE WILL BE NO LOSS IN TERM OF PERFORMANCE. - It all depends on how increments is done so that Efficiency should be maintained or improved. SCALABILITYEVALUATIONMETRICS(Cont…)
  43. APPRECIATING SPEEDANDEFFICIENCY Note: Serial sections limit the parallel effectiveness REMEMEBER: If you have a lot of serial computation then you will not get good speedup BECAUSE - No serial work “allows” prefect speedup - REFERS TO Amdahl’s Law to appreciate this truth
  44. THE AMD • How many processors can we really use? Let’s say we have a legacy code such that is it only feasible to convert half of the heavily used routines to parallel. • Amdahl’s Law is widely used to design processors and parallel algorithms • Statement: the maximum speedup that can be achieved is limited by serial components of the program: S= 1/(1-p), with (1-p) been the serial component( part not parallelized) of a program. Example: A program has 90% of the code parallelize and 10 % that must remain serialized. What is the maximum achievable speedup ? Answer: S=9, with (1-p)=10, S=90/10=9. THEAMDHAL’SLAW
  45. If we run this on a parallel machine with five processors: Our code now takes about 60s.  We have speed it up by about 40%. Let’s say we use a thousand processors ?? We have now speed our code by about a factor of two. Is this a big enough win ? THEAMDHAL’SLAW(Cont…)
  49. • Handle Most fundamental limitation on parallel speedup If fraction s of execution time is serial then speedup < 1/s Is this realistic? We know that inherently parallel code can be executed in “no time” but inherently sequential code still needs fraction s of time. Example : if s is 0.2 then speedup cannot exceed 1/0.2 = 5. - Using p processors, we can find the Speedup : - Total sequential execution time on a single processor is normalized to 1 - Serial code on p processors requires fraction s of time - Parallel code on p processors requires fraction (1 – s)/p of time THEAMDHAL’SLAW(Cont…)
  51. Example: 2-phase calculation * Sweep over n-by-n grid and do some independent computation *Sweep again and add each value to global sum - Time for first phase on p parallel processors = n²/p Second phase serialized at global variable, so time = n² Speedup <= or at most 2 for large p Improvement: divide second phase into two - Accumulate p private sums during first sweep - Add per-process private sums into global sum - Parallel time is: n²/p + n²/p + p, and speedup <= APRACTICALAPPLICATIONOFAMDHAL’SLAW
  52. Amdahl's law is based on fixed workload or fixed problem size.  It implies that the sequential part of a program does not change with respect to machine size (i.e., the number of processors).  the parallel part is evenly distributed over P processors.  Gustafson's law was to select or reformulate problems in order to minimize the sequential part of a program so that solving a larger problem in the same amount of time would be possible. This Law therefore consider that: - While increasing the dimension of a problem, its sequential parts remain constant - While increasing the number of processors, the work require on each them still remains the same. Mathematically: S(P)=P-α(P-1), with P: Number of Processors, S is the Speedup and α is the non parallelize fraction of any parallel process. NOTE: This expression contrast the Amdahl's Law which consider a single process execution time as a fixe quantity and compares it to a shrinking per process parallel execution time. Amdhal assume a fixe problem size because he believes that the overall workload of a program does not change according to the machine size ( number of processors). Gustafson’s Law therefore address the deficiency of Amdahl's Law which does not consider the total number of Computing resources involve in solving a task. Gustafson suggest to consider all computer resources if we intend to achieve efficient parallelism. FIXEDSIZEVSSCALESIZE
  53. Let n be a measure of the problem size.  The execution of the program on a parallel computer : a(n) + b(n) = 1 where a is the sequential fraction and b is the parallel fraction  On a sequential computer: - a(n) + p.b(n) , where p is the number of processors in the parallel case. And, Speedup = a(n) + p.b(n) = a(n) + p.(1-a(n)) Assume serial function a(n) diminishes with problem size n, then speedup approaches p as n approaches infinity, as desired. WHAT DO WE MEAN ? WHATABOUTGUSTAFSON’SLAW
  55. OpenDEBATE?
  56. • Parallel and Distributed Computing aims at satisfying requirement of Next generation Computing Systems by: - Providing Platform for fast processing - Providing Platform where management of large and complex amount of data does no more constitute a major bottleneck to the understanding of complex phenomenon. The domain intends to provide a far more better user experience, so far software development field will succeed in satisfying requirement of such a design, and, the technology will finally solve issues related to the noticeable Disparity between the clock rates growth of High-end Processors and Memory Access time. CONCLUSION
  57. • Kindly look at the diagram below and answer to the following questions: 1- How do you classify such a design: serialization or Parallelism? Justify your answer 2- Kindly explain what M1, P1 and D1 represent 3 – what are the functions of : - Processor-Memory Interconnection Network (PMIN) - Input-Output-Processor Interconnection Network (IOPIN) - Interrupt Signal Interconnection Network (ISIN) 4- Explain in your terms, the concept of Shared Memory System / Tightly Coupled System. 5- Flynn’s classification of Computers is based on multiplicity of instruction streams and data streams observed by the CPU during program execution. Can you identify another way Computer can be Classified? Elaborate the Concept according to the author. Submission Date: 4 October 2020 Time: 12 Pm Email: NOTE: Late submission = - 50% of the assignment Points. Acceptable ONCE. ASSIGNMENT1b
  59. Further Reading • Recommended reading:"Designing and Building Parallel Programs". Ian Foster. • "Introduction to Parallel Computing". Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar. • "Overview of Recent Supercomputers". A.J. van der Steen, Jack Dongarra. OverviewRecentSupercomputers.2008.pdf
  60. REFERENCESK. Hwang, Z. Xu, “ Scalable Parallel Computing”, Boston: WCB/McGraw-Hill, c1998. 2. I. Foster, “ Designing and Building Parallel Programs”, Reading, Mass: Addison-Wesley, c1995. 3. D. J. Evans, “Parallel SOR Iterative Methods”, Parallel Computing, Vol.1, pp. 3-8, 1984. 4. L. Adams, “Reordering Computations for Parallel Execution”, Commun. Appl. Numer. Methods, Vol.2, pp 263-271, 1985. 5. K. P. Wang and J. C. Bruch, Jr., “A SOR Iterative Algorithm for the Finite Difference and Finite Element Methods that is Efficient and Parallelizable”, Advances in Engineering Software, 21(1), pp. 37-48, 1994. 6. Lecture Notes on Parallel Computation, Stefan Boeriu, Kai-Ping Wang and John C. Bruch Jr. Office of Information Technology and Department of Mechanical and Environmental Engineering, University of California, Santa Barbara, CA 7. John Mellor-Crummey , COMP 422/534 Parallel Computing: An Introduction , Department of Computer Science Rice University,, January 2020 8. Roshan Karunarathna, Introduction to parallel Computing,2020 9. Safwat HAMAD , DistriByted Computing, Lecture 1- Introduction - FCIS SCience Department - FCIS SC, 2020.

Hinweis der Redaktion

  1. Moore’s Law: the number of Transistors on a Microchip doubles every two years. So we can expect the speed and capability of our computers to increase every couple of years