This Chapter provides a Background Review of Parallel and Distributed Computing. a focus is made on the concept of SISD, SIMD, MISD, MIMD.
It also gives an understanding of the notion of HPC (High-Performance Computing). A survey is done using some case studies to show why parallelism is needed. The chapter discusses the Amdahl's Law and the limitations. Gustafson's Law is also discussed.
2. CONTENT
Part 1- Introducing Parallel and Distributed Computing
• Background Review of Parallel and Distributed Computing
• INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING
• Some keys terminologies
• Why parallel Computing?
• Parallel Computing: the Facts
• Basic Design Computer Architecture: the von Neumann Architecture
• Classification of Parallel Computers (SISD,SIMD,MISD,MIMD)
• Assignment 1a
Part 2- Initiation to Parallel Programming Principles
• High Performance Computing (HPC)
• Speed: a need to solve Complexity
• Some Case Studies Showing the need of Parallel Computing
• Challenge of explicit Parallelism
• General Structure of Parallel Programs
• Introduction to the Amdahl's LAW
• The GUSTAFSON’s LAW
• SCALIBILITY
• Fixed Size Versus Scale Size
• Assignment 1b
• Conclusion
3. BACKGROUND
• Interest on PARALLEL COMPUTING dates back to the late
1950’s.
• Supercomputer surfaced throughout the 60’s and 70’s
introducing shared memory multiprocessors, with
multiple processors working side-by-side on shared data.
• By 1980’s, a new kind of parallel computing was launched.
- Introduction of a Supercomputer for scientific applications from 64
Intel 8086/8087 processors designed by Caltech Concurrent Computation.
- Assurance that great performance could be achieved with Massive
Parallel Processors (MPP).
• By 1997, ASCI Red supercomputer computer break the barrier
of one trillion floating point operations per second(FLOPS).
• 1980’s introduced concept of CLUSTERS (Many computers
operating as a single unit, performing the same task under the
supervision of a Software), main system that competed and
displaced MPPs from various applications.
4. BACKGROUND
• TODAY:
Parallel computing is becoming mainstream based on multi-core
processors
Chips manufacturers are increasing overall processing performance by
adding additional CPU cores (Dual Core, Quad Core, etc.).
WHY?
• Increasing performance through parallel processing appears to be far more
energy-efficient than increasing microprocessor clock frequencies.
• Better performance is predicted by Moore’s LAW who believe in the ability
of Transistors to empower systems.
Consequence: We shall be going from few Cores to Many….
• Besides, Software development has been very active in the evolution of Parallel
Computing. They field must run after if Parallel and distributed systems must expand !
• BUT, Parallel programs have been harder to write than sequential ones.
Why? - Difficulties in synchronization of the multiple tasks that run by those
program at the same time.
• For MPP and Clusters: Application Programming Interfaces (API) converged to a single
standard called MPI (Message Passive Interface) that handle Parallel Computing
architectures.
• For Shared Memory Multiprocessor Computing , convergence is towards two
standards pthreadsand OpenMP.
KEY CHALLENGE: Ensure effective transition of the software industry to parallel
programming so that a new generation of systems can take place and offer a more
powerful user-experience of Digital technologies, solutions and applications.
5. INTRODUCTIONTO PARALLELCOMPUTING
WHAT IS PARALLEL COMPUTING( Parallel Execution) ?
Traditionally, software are written to operate following serial
computation. That is:
– RUN on a single computer having a single
Central Processing Unit (CPU);
– Brake any given problem into a discrete series of
instructions.
– Those Instructions are then executed one after another.
– Important: Only one instruction may execute at any moment in
time.
Generally, serial processing compared to Parallel processing is as
followed:
6. CASE OF SERIAL COMPUTATION
What about executing these micro- programs Simultaneously ?
PARALLEL COMPUTATION
• Here, Problem is broken into discrete parts that can be solved at the same
time (Concurrently)
• Discrete parts are then broken down to a series of instructions
• These instructions from each part execute simultaneously on different
processors under the supervision of an overall control/coordination
mechanism
INTRODUCTIONTO PARALLELCOMPUTING(Cont...)
Function- solve Payroll Problem
Run Micro- Program one after another
7. Discrete
Parts of
the
Problem
INTRODUCTIONTO PARALLELCOMPUTING(Cont..)
It means to compute in parallel:
the problem must be broken apart into discrete pieces of work that can be solved
simultaneously;
at a give time t, instructions from multiple program should be able to be executed;
the time taken to solve the problem should be far more shorter than with serial
computation( Single compute resource).
WHO DOES THE COMPUTING?
• A single computer with multiple processors/cores
• Can also be an arbitrary number of such computers connected by a network
8. There is a “jargon” used in the area of PARALLEL COMPUTING. Some key terminology
are:
PARALLELISM: ability to execute parts of a computation concurrently
Supercomputing / High Performance Computing (HPC) : refers to world's fastest and largest computers with the
ability to solve large problems
Node: A standalone «single» computer that will form the Super computer once network together.
Thread: a unit of execution consisting of a sequence of instructions that is managed by either the operating
system or a runtime system.
CPU / Socket / Processor / Core: basically a singular execution component of a computer. Individual CPUs are
subdivided into multiple cores that constitute individual execution unit. SOCKET expresses CPU with multiple
cores. Anyway…. Terminology can be confusing. However, this is the center of computing operations.
Task: Program or program-like set of instructions that is executed by a processor. Parallelism involve multiple
tasks running on multiple processors.
Shared Memory Architecture where all computer processors have direct (usually bus based) access to common
physical memory.
Symmetric Multi-Processor (SMP) Shared memory hardware architecture where multiple processors share a
single address space and have equal access to all resources.
Granularity (Grain Size) often refer to a given task and represent the measure of the amount of work (Or
Computation) which is performed by that task. When a program is split into large task generating a large
amount of computation in processors, it is called Coarse-grained parallelism. Otherwise, when the splitting
generate small task with minimum requirement in processing, it is called FINE.
Massively Parallel refer to hardware of parallel systems with many processors (“many” = hundreds of
thousands)
Pleasantly Parallel solving many similar but independent tasks simultaneously. Requires very little
communication
Scalability a proportionate increase in parallel speedup with the addition of more processors
CONCEPT AND TERMINOLOGY
9. • TO Address limitations of serial computing:
Expensive in attempt to make single processing faster.
Serialization speed is directly dependent upon how fast data can move through hardware
(Transmission bus). Must minimize distance between processing element to achieve improve speed
Do not satisfy constraint of reality where event often happened consecutively. There is need for
a solution that is suitable for modeling and simulating complex world Phenomena ( example:
Modeling processes of assembling of Cars, or Jet, or Traffic during Rush hours,..)
Also
o Physical limitation of hardware components
o Economical reasons – more complex = more expensive
o Issues of Performance limits – double frequency <> double performance
o Large applications – demand too much memory & time
SO….
We need to :
Save time - wall clock time
Solve larger problems in the most efficient way
Provide concurrency (do multiple
things at the same time)
IT MEANS…
with more parallelism,
– We solve larger problems in the same time
– AND, solve a fixed size problem in shorter time
NOTE: if we agree that most stand alone computers handle multiple functional unit (L1 cache, L2
cache, branch, prefetch, decode, floating-point, graphics processing (GPU), integer, etc), have
Multiple execution Unit or cores and Multiple hardware threat, THEN: ALL STAND ALONE
COMPUTERs today can be Characterize as implementing PARALLEL COMPUTING.
WHY PARALLEL COMPUTING ?
10. Future of computing cannot be conceived
without parallel processing .
Continuous development and expansion of the Internet
and the improvement in network management schemes.
Having better means available for a group of computers
to cooperate in solving a computational problem will
inevitably translate into a higher profile of clusters and
grids in the landscape of parallel and distributed
computing. Akl S.G., Nagy M. (2009)
The increase boosting of computer power will provide
more SCALING ability to parallel computing programs.
PARALLEL COMPUTING: THE FACTS
11. BasicDesignArchitectureofParallelComputing:
thevonNeumannArchitecture
From "hard wiring“ computers ( computers were programmed through wiring) to
"stored-program computer" where both program instructions and data are kept in
electronic memory, all computer basically have the same design comprising:
• Four main components: Memory, Control Unit, Arithmetic Logic Unit and Input /
Output
Read/write, Random Access Memory used to store both program instructions (coded
data which tell the computer to do something and data (information to be used by the
program)
Control unit fetches instructions/data from memory, decodes the instructions and
then sequentially coordinates operations to accomplish the programmed task.
Arithmetic Unit performs basic arithmetic operations
Input / Output is the interface to the human operator
NOTE: Parallel computers still follow this basic design, Only that Units are Multiplied. The
basic, fundamental architecture remains the same
12. • VARIOUS ways to classify parallel computers
• THE MOST WIDELY USED CLASSIFICATION: Flynn's Taxonomy.
- Classification here is made following two independent dimension:
Instruction Stream and Data Stream.
- No matter the dimension, only one possible state can be
manifested: Single Operation (=instruction) or Multiple Operations
(=instructions)
CLASSIFICATIONOFPARALLELCOMPUTERS
SISD
Single Instruction Stream
Single Data Stream
(One CPU)+ Memory
SIMD
Single Instruction Stream
Multiple Data Stream
(One CPU) + Memories (+)
MISD
Multiple Instruction Streams
Single Data Stream
(Multiple CPU)+ Memory
MIMD
Multiple Instruction Streams
Multiple Data Streams
(Multiple CPU)+ Memory(+)
SINGLE
MULTIPLE
INSTRUCTIONS
DATA STREAM
13. The SISD: Single Instruction (Only one instruction stream is being acted on by
the CPU during any one clock cycle) stream, Single Data (Only one data stream is
being used as input during any one clock cycle) stream.
- This is the most popular (Common) Computer produced and
used. Example: Workstations, PCs, etc.
- Here, only one CPU is present and instruction operates on 1
data item at a time.
- Execution is Deterministic
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
MEMORYCPU
Bus
Instruction
Data (operand)
14. SIMD: Single Instruction stream, Multiple Data Stream
• Computers here operates parallel computing and are known as VECTOR COMPUTER
• First type of computers having a system with a massive amount of processors with
computational power above Giga FLOP range.
• Machines executes ONE instruction stream but on MULTIPLE (Different) data items
considered as multiple data streams.
• It means all processing units execute the same instruction at any given clock cycle
(Instruction stream) with the flexibility that Each processing unit can operate on a
different data element (Data stream)
• This type of processing is suitable for problems requesting a high degree of regularity.
Example: graphics/image processing
• Processing method is characterized as Synchronous and Deterministic
• They are two variety of such processing: Processors arrays and vectors pipelines
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
Memory Bank 1
Memory Bank 2
CPU
Instruction
Data (operand)
di
dii
15. With the Single Instruction stream, Multiple Data Stream (SIMD),
• Only One CPU is Present is in the Computer. This CPU has:
- One instruction register but Multiples Arithmetic Logic Unit (ALU) and uses multiple
data buses to get multiple operands simultaneously. It uses multiple data buses to
handle multiple operands simultaneously.
- The memories are divided into multiple banks that are accessed independently. They
are also multiple data buses for the CPU to access data simultaneously.
Operational behavior
only 1 instruction is fetched by the CPU at the time
Some instructions, known as VECTOR INTRUCTIONS ( witch will fetch multiple
operands simultaneously) , operate on multiple data Items at once.
EXAMPLES OF SIMD COMPUTERS
• Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2, ILLIAC IV
• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10
• Most modern computer integrating graphics processor units (GPUs) employ SIMD instructions and execution units.
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
16. MIMD: Multiple Instructions streams, Multiple Data Streams
• Here, multiple instructions are executed at ONCE. So they are multiple Data
streams.
• Each instruction operates on its own data independently (Multiple
operations on the same data is rare !!)
• Two main type of MIMD computers: shared Memory and Message Passing
Shared Memory MIMD
In shared Memory MIMD, Memory locations are all accessible by all the
processors (CPUs): this type of computer is a Multi processor Computer
Most workstations and high end PC are Multi processor based today.
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
CPU
1
CPU
2
CPU
N+1
CPU
n Memory Bank
Instruction i
Data (operand i)
Buses
Buses
Instruction ii
Data (operand k)
17. Shared Memory MIMD computers are characterize
by:
Multiple CPUs in a computer sharing the Same memory
Even though CPUs are coupled Tightly, Each CPU fetches ONE
INSTRUCTION at a time t, and, different CPUs can fetch
different instructions by so generating multiple instructions
streams
The structure of the memory must be Multiple access points
designed (that is, organized into multiple independent
memory banks) so that multiple instructions/operands can be
transferred simultaneously. This structuration help in avoiding
Conflict of Access by CPUs on the same memory bank.
Finally, an instruction operates one data item a time !
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
18. Message Passing MIMD ( CLUSTER computing)
• In This architecture, Multiple SISD computers are interconnected by
a COMMUNICATION NETWORK
• And, each CPU has its own private memory. It means there is not
sharing of memory among the various CPUs
• It is possible for programs running on different CPUs to exchange
information if required. In that case, exchange is done through
MESSAGES.
• More cheaper to Manufacture Message Passing Computers than
Shared-memory MIMD
• There is a need of a dedicated HIGH SPEED NETWORK SWITCH to
perform the interconnection role of the SISD computers
• The MIMD Message Passing computers always provide Message
Passing API (Aplication Programming Interface) so that
programmers can be able to include in their programs, statements
that permits exchange of messages. Example: the Message Passing
Interface (MPI)
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
19. Message Passing MIMD ( CLUSTERcomputing) Operationnal architeture
MEMORYCPU
Bus
Instruction
Data (operand)
MEMORYCPU
Bus
Instruction
Data (operand)
MEMORYCPU
Bus
Instruction
Data (operand)
MEMORYCPU
Bus
Instruction
Data (operand)
Data Exchange
Data Exchange
Data Exchange
Data Exchange
SWITCH
SISD 1
SISD n-1
SISD n
SISD n+1
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
20. MISD: Multiple Instructions stream, Single Data Stream
Assignment 1a:
Research on Multiple Instruction stream, Single Data
stream (MISD) Parallel Computing.
You will emphasize on:
1. Architecture design and modeling
2. Properties of such a design
3. Operational details
4. Practical example and Specifications.
Submission Date: 28 October 2020 Time: 12 Pm
Email: malobecyrille.marcel@ictuniversity.org
NOTE: Late submission = - 50% of the assignment Points.
Acceptable ONCE.
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
21. END OF PART 1
What are we saying ?
Looking at it VIRTUALLY, All stand-alone computers today are parallel.
from a hardware perspective, computers have:
Multiple functional units (floating point, integer, GPU, etc.)
Multiple execution units / cores
Multiple hardware threads
22. CHECK YOUR PROGRESS …..
• Check Your Progress 1
1) What are various criteria for classification of parallel computers?
…………………………………………………………………………………………
…………………………………………………………………………………………
2) Define instruction and data streams.
…………………………………………………………………………………………
…………………………………………………………………………………………
3) State whether True or False for the following:
If Is=Instruction Stream and Ds=Data Stream,
a) SISD computers can be characterized as Is > 1 and Ds > 1
b) SIMD computers can be characterized as Is > 1 and Ds = 1
c) MISD computers can be characterized as Is = 1 and Ds = 1
d) MIMD computers can be characterized as Is > 1 and Ds > 1
4) Why do we need Parallel Computing ?
…………………………………………………………………………………………
…………………………………………………………………………………………
23. • Why do we need HPC ?
1. Save time and/or money: the more you allocate resources on a given task, the
faster you expect to see it completed and save some money. Consider that Parallel
clusters can be built from cheap, commodity components.
2. Solve larger problems: They are so many complex problems that can’t be solved
with single Computer especially considering their limited computer memory.
3. Provide concurrency: A single compute resource can only do one thing at a time.
Multiple computing resources can be doing many things simultaneously.
4. Use of non-local resources: HPC will provide the flexibility to use compute
resources on a wide area network, or even the Internet when local compute
resources are scarce.
SO….
High-performance computing (HPC) is the use of parallel processing
for running advanced application programs efficiently, reliably and quickly.
HIGHPERFORMANCECOMPUTING(HPC)
CLOSE TO REALITY
24. THE MOORE’S LAW PREDICTION
Statement [1965]:
`The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly
over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase
is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.
That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.''
REVISION in [1975]: `There is no room left to squeeze anything out by being clever. Going forward from here we have to
depend on the two size factors - bigger dies and finer dimensions.''
IT MEANS:
- Prioritize minimum size and improve in power by increasing Transistors. That is….
- More transistors = ↑ opportunities for exploiting parallelism
If one is to buy into Moore's law, the question still remains
• how does one translate transistors into useful OPS (operations per second)?
If Moore believes that : the transistor density of semiconductor chips would double roughly every 18
months.
A tangible solution is to rely on parallelism, both implicit and explicit.
TWO Possible way to implement parallelism:
Implicit parallelism: invisible to the programmer
– pipelined execution of instructions, using conventional language such as C, Fortran or Pascal to write the source Program
– code source program is sequential and translated into parallel object code by a Parallelizing Compiler that will detect and
assign target machine resources. This is apply in programming shared multiprocessors and require less effort from the
programmer.
Explicit parallelism
– Long instruction words (VLIW) and require more effort by the programmer to develop a source program
-- Made of bundles of independent instructions that can be issued together, reducing the burden on the compiler to detect
parallelism, which will detect and assign target machine resources when needed.. Example: Intel Itanium processor 2000-2017
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
25. IS MOORE’S LAW STILL APPLICABLE ?
• Up to early 2000s, transistor count was a valid indication of how much additional processing
power could be packed into an area.
• Moore’s law and Dennard scaling, when combined, held that more transistors could be
packed into the same space and that those transistors would use less power while operating
at a higher clock speed.
ARGUMENTS
Because Classic Dennard scaling no longer occurs at each lower node, packing more
transistors into a smaller space no longer guarantees lower total power
consumption. Consequently, does no longer correlates directly to higher performance.
• The Major Limiting Factor: Hot spot formation
Possible way forward TODAY:
- Focus on Improving CPU Cooling (One of the biggest barriers to
higher CPU clock speeds is hot spots) by either ameliorating the
efficiency of the Thermal interface Material(TIM) or improving lateral
heat transfer within the CPU itself or making used of computational sprinting
to increase thermal dissipation.
HOWEVER: This won’t improve compute performance over
sustained periods of time
BUT it would speed latency-sensitive applications like web page
loads or brief, intensive computations.
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
26. 42 Years of Microprocessor Trend Data .
Orange: Moore’s Law trend;
Purpule: Dennard scaling breakdown;
Green & Red: Immediate implications of Dennard scaling breakdown;
Blue: Slowdown of ST increase in performances;
Black: The age of increase parallelism
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
SOURCE: Karl Rupp. 42 Years of Microprocessor Trend Data. https://www.karlrupp.net/ 2018/02/42-years-of-microprocessor-trend-data/, 2018. [Online].
ILLUSTRATION
27. ORTHER HPC LIMITING FACTORS
- Disparity between the clock rates growth of High-end Processors and Memory Access time:
Clock rate (40%)/year over the past decade while DRAM (10%)/year over the same
period. This is a significant performance bottleneck.
This is issue is addressed by Parallel Computing by :
• providing increased bandwidth to the memory system
• offering higher aggregate caches.
This explain why Some of the fastest growing applications of parallel computing utilize not
their raw computational speed, rather their ability to pump data to memory and disk faster.
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
Source: How Multithreading Addresses the Memory Wall - Scientific Figure on ResearchGate. Available from:
https://www.researchgate.net/figure/The-growing-CPU-DRAM-speed-gap-expressed-as-relative-speed-over-time-log-scale-The_fig1_37616249
[accessed 22 Oct, 2020]
28. PROCESSORS EVOLUTION: CASE OF INTEL PROCESSORS
From 2017, 9 generations of Processors have been developed.
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
Intel core i10 processor
Source: Retrieve from https://strackit.com/tech-news/intel-core-i10-processor/
29. • differences between series of Processor Core i3, i5, i7 and i9.
Generation Number of Cores Specifications Cache Size (MB)
Core i3 2 physical cores
- Cheapest processors
- used INTEL® HYPER
THREADING
TECHNOLOGY
creates a 2 physical cores
and 2 more virtual
operating system
determines that the
processor has 4 cores
Memory 3-4MB
Core i5 4 physical cores, some
models have only 2
physical cores + 2 virtual
Higher performance is
achieved by the presence
of 4 physical cores and
increased volume of cache
memory
4 or 8MB
Core i7 4 to 8 physical cores ,
use INTEL® HYPER
THREADING TECHNOLOGY
Performance increased
virtual cores and a large
volume of cache memory.
Processors for mobile
devices can have 2 physical
cores.
from 8MB to 20MB
Core i9 6 - 8 physical cores i9 series was conceived as
a competitor to AMD game
processors.
More cores, more speed
but not much. Since i9 is
slightly better than i7.
there is practically no
sense in the development
of this processor line.
cache 10Mb-20Mb
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
30. ACHIEVING GREATER SPEED WILL HELP IN UNDERTANDING VARIOUS
PHENOMENON APLICABLE TO DIFFERENT DOMAIN OF LIFE.
• Science
—understanding matter from elementary particles to cosmology
—storm forecasting and climate prediction
—understanding biochemical processes of living organisms
• Engineering
—multi-scale simulations of metal additive manufacturing processes
—understanding quantum properties of materials
—understanding reaction dynamics of heterogeneous catalysts
—earthquake and structural modeling
—pollution modeling and remediation planning
—molecular nanotechnology
• Business
—computational finance - high frequency trading
—information retrieval
—data mining “big data”
• Defense
—nuclear weapons stewardship
• Computers:
— Embedded systems increasingly rely on
distributed control algorithms.
— Network intrusion detection, cryptography, etc.
— Optimizing performance of modern automobile.
— Networks, mail-servers, search engines…
— Visualization
SPEED:ANEEDTOSOLVECOMPLEXITY
31. • Parallelism finds applications in very diverse domains for different
motivating reasons. These range from improved application
performance to cost considerations.
CASE 1: Earthquake Simulation in Japan
HOW DO WE PREVENT SUCH TO HAPPEN AGAIN ?
• We need Computers , with ability to put together Computation Power in order to be able to
simulate and calculate High level operations for better prediction of natural Phenomenon
SOMECASESTUDIES
SOURCE: Earthquake Research Institute, University of Tokyo Tonankai-Tokai Earthquake Scenario. Video Credit: The Earth Simulator Art Gallery, CD-ROM, March 2004
32. SOMECASESTUDIES(Cont…)
CASE 2: El Niño
El Niño is an anomalous, yet periodic, warming of the central and eastern equatorial Pacific Ocean.
For reasons still not well understood, every 2-7 years, this patch of ocean warms for six to 18 months
El Niño was strong through the Northern Hemisphere winter 2015-16, with a transition to ENSO-
neutral in May 2016.
HOW DO WE EXPLAIN SUCH PHENOMENON ?
May be bringing into collaboration various Processors of distributed computers
(placed in different location ) can help provide an answer. Parallel Programming must
therefore be able to develop compatible software that can contribute in collection
and analyses of data.
33. • Most of Parallelism concepts we shall study are from the
Explicit orientation.
Challenges related to Explicit Parallelism are:
Algorithm development is harder
—complexity of specifying and coordinating concurrent
activities
Software development is much harder
—lack of standardized & effective development tools and
programming models
—subtle program errors: race conditions
Rapid pace of change in computer system architecture
—a great parallel algorithm for one machine may not be
suitable for another
– example: homogeneous multicore processors vs. GPUs
ChallengesofEXPLICITPARALLELISM
34. Parallel science applications are often very sophisticated
—e.g. adaptive algorithms may require dynamic load
balancing
• Multilevel parallelism is difficult to manage
• Extreme scale exacerbates inefficiencies
—algorithmic scalability losses
—serialization and load imbalance
—communication or I/O bottlenecks
—insufficient or inefficient parallelization
• Hard to achieve top performance even on individual
nodes
—contention for shared memory bandwidth
—memory hierarchy utilization on multicore processors
Challenges of PARALLELISM INGENERAL
35. • IT IS NOT ALL ABOUT COMPUTATION.
There is also a need to:
Improve on Memory latency and bandwidth. Because,
—CPU rates are > 200x faster than
memory
—bridge speed gap using memory
hierarchy
—more cores exacerbates demand
Improve on the Interprocessor communication
Improve ON Input/output Correlation
— I/O bandwidth to disk typically needs to grow linearly
with the
# processors
ACHIEVINGHIGHPERFORMANCEONPARALLELSYSTEMS
36. EXPLICITLY
define tasks, work decomposition, data decomposition,
communication, synchronization.
EXAMPLE: MPI is a library for fully
explicit parallelization.
“It is either All or nothing”.
IMPLICITLY
define tasks only, rest implied; or define tasks and
work decomposition rest implied;
EXAMPLE OpenMP is a high-level
parallel programming
model, which is mostly an
implicit model
HOWDOWEEXPRESSPARALLELISMINAPROGRAM?
37. All parallel programs contain:
- Parallel sections
And,
- Serial sections (Serial sections are when work is being duplicated or no useful work is being done,(waiting
for others))
We therefore need to Build efficient algorithms by avoiding:
- Communication delay
- Idling
- Synchronization
QUICKVIEWONTHESTRUCTUREOFAPARALLEL
PROGRAM
38. Generally, Parallel thinking is closer to us than what we believe. Daily, We try to do things
simultaneously, avoiding to have delay in any. Parallel Computing thinking is not far from this….
For a given task to be done by many, WE MAY ASK OURSELVES:
How many people are involve in the work ?.
(Degree of Parallelism)
What is needed to begin the work?
(Initialization)
Who does what ?
(Work distribution)
How do we regulate Access to work part.
(Data/IO access)
Find OUT: Whether they need info from each other to finish their
own job.
(Communication)
When are they all done ?
(Synchronization)
What needs to be done to collate the result.
AWAYTOTHINK:PARALLELAPPROACH
39. • Development of parallel programming impose the need of
Performance metrics and Software tools in order to evaluate the
performance of parallel algorithm.
• Some factors can help in achieving this goal:
- Type of Hardware used
- The degree of parallelism of the problem
- The type of parallel model to use
The goal is : To compare what is obtained (Parallel program) from
what was there (Original Sequence).
Analyses focuses on the number of threads and/or the number of
processes used.
Note:
Ahmdal’s Law will introduce the limitations related to Parallel
computation.
And,
the Gustafson’s Law will evaluate the degree of efficiency of
Parallelization of a sequential algorithm.
EVALUATIONMETRICS
40. Relation between the Execution time (Tp) and Speedup, (S)
S(p, n) = T(1, n) / T(p, n)
- Usually, S(p, n) < p
- Sometimes S(p, n) > p (super linear speedup)
Efficiency, E
E(p, n) = S(p, n)/p
- Usually, E(p, n) < 1, Sometimes, greater than 1
Scalability – Limitations in parallel computing, relation to n and p.
EVALUATIONMETRICS(Cont…)
SpeedUP Measurement (S)
• Speedup is a MEASURE
• It help in appreciating the benefit in solving a problem in parallel
• It is given by: Ratio of the time taken to solve a Problem on a Single
processing element (Ts) to the time required to solve the same problem
on a p identical processing elements (Tp).
• That is : S= Ts/Tp
- IF S = p (Ideal condition) LINEAR SPEEDUP (Speed of execution is with the number of processors.)
- IF S < p, Real speedup
- IF S > p, Super Linear Speedup.
41. EFFICIENCY (E)
• Another performance metric
• Will estimate the ability of the processors to solve a given
task in comparison of how much effort is wasted in
communication and Synchronization.
• Ideal Condition of a parallel system: S=P (Speedup is equal to p
Processing elements--- VERY RARE !!!)
• Efficiency (E) is given by: E= S/p = TS/pTp
- When E=1 - It is a LINEAR Case
- When E<1, It is a REAL Case
- When E<<1, It is a problem that is parallelizable with low efficiency
EVALUATIONMETRICS(Cont…)
42. SCALABILITY
• RULE: Efficiency decreases with increasing P; increases
with increasing N.
But here are the fundamentals questions:
1- How effectively the parallel algorithm can use an
increasing number of processors ?
2- How the amount of computation performed must scale
with P to keep E constant ?
• SCALING is simply the ability to be efficient on a parallel machine.
- It identifies the Computing Power ( How fast task are executed)
proportionally to the number processors
- IF, we increase the problem size (n) and the number of Processors (p) at
the same time, THERE WILL BE NO LOSS IN TERM OF PERFORMANCE.
- It all depends on how increments is done so that Efficiency should be
maintained or improved.
SCALABILITYEVALUATIONMETRICS(Cont…)
43. APPRECIATING SPEEDANDEFFICIENCY
Note: Serial sections limit the parallel effectiveness
REMEMEBER: If you have a lot of serial computation then you
will not get good speedup BECAUSE
- No serial work “allows” prefect speedup
- REFERS TO Amdahl’s Law to appreciate this truth
44. THE AMD
• How many processors can we really use?
Let’s say we have a legacy code such that is it only
feasible to convert half of the heavily used routines
to parallel.
• Amdahl’s Law is widely used to design processors and parallel
algorithms
• Statement: the maximum speedup that can be achieved is limited
by serial components of the program: S= 1/(1-p), with
(1-p) been the serial component( part not parallelized) of a program.
Example: A program has 90% of the code parallelize and 10 % that must remain
serialized. What is the maximum achievable speedup ?
Answer: S=9, with (1-p)=10, S=90/10=9.
THEAMDHAL’SLAW
45. If we run this on a parallel machine with five processors:
Our code now takes about 60s.
We have speed it up by about 40%.
Let’s say we use a thousand processors ??
We have now speed our code by about a factor of two. Is this a
big enough win ?
THEAMDHAL’SLAW(Cont…)
49. • Handle Most fundamental limitation on parallel speedup
If fraction s of execution time is serial then speedup < 1/s
Is this realistic?
We know that inherently parallel code can be executed in “no time”
but inherently sequential code still needs fraction s of time.
Example : if s is 0.2 then speedup cannot exceed 1/0.2 = 5.
- Using p processors, we can find the Speedup :
- Total sequential execution time on a single processor is
normalized to 1
- Serial code on p processors requires fraction s of time
- Parallel code on p processors requires fraction (1 – s)/p of time
THEAMDHAL’SLAW(Cont…)
51. Example: 2-phase calculation
* Sweep over n-by-n grid and do some independent computation
*Sweep again and add each value to global sum
- Time for first phase on p parallel processors = n²/p
Second phase serialized at global variable, so time = n²
Speedup <=
or at most 2 for large p
Improvement: divide second phase into two
- Accumulate p private sums during first sweep
- Add per-process private sums into global sum
- Parallel time is: n²/p + n²/p + p,
and
speedup <=
APRACTICALAPPLICATIONOFAMDHAL’SLAW
52. Amdahl's law is based on fixed workload or fixed problem size.
It implies that the sequential part of a program does
not change with respect to machine size (i.e., the
number of processors).
the parallel part is evenly distributed over P processors.
Gustafson's law was to select or reformulate problems in order to minimize the sequential
part of a program so that solving a larger problem in the same amount of time would be
possible.
This Law therefore consider that:
- While increasing the dimension of a problem, its sequential parts remain constant
- While increasing the number of processors, the work require on each them still remains the
same.
Mathematically: S(P)=P-α(P-1), with P: Number of Processors, S is the Speedup and α is
the non parallelize fraction of any parallel process.
NOTE: This expression contrast the Amdahl's Law which consider a single process execution
time as a fixe quantity and compares it to a shrinking per process parallel execution time.
Amdhal assume a fixe problem size because he believes that the overall workload of a
program does not change according to the machine size ( number of processors).
Gustafson’s Law therefore address the deficiency of Amdahl's Law which does not consider
the total number of Computing resources involve in solving a task. Gustafson suggest to
consider all computer resources if we intend to achieve efficient parallelism.
FIXEDSIZEVSSCALESIZE
53. Let n be a measure of the problem size.
The execution of the program on a parallel computer :
a(n) + b(n) = 1
where a is the sequential fraction and b is the parallel
fraction
On a sequential computer:
- a(n) + p.b(n) , where p is the number of processors in
the parallel case.
And, Speedup = a(n) + p.b(n) = a(n) + p.(1-a(n))
Assume serial function a(n) diminishes with problem size n,
then speedup approaches p as n approaches infinity, as
desired.
WHAT DO WE MEAN ?
WHATABOUTGUSTAFSON’SLAW
56. • Parallel and Distributed Computing aims at satisfying
requirement of Next generation Computing Systems by:
- Providing Platform for fast processing
- Providing Platform where management of large and
complex amount of data does no more constitute a
major bottleneck to the understanding of complex
phenomenon.
The domain intends to provide a far more better user
experience, so far software development field will succeed in
satisfying requirement of such a design, and, the technology
will finally solve issues related to the noticeable Disparity
between the clock rates growth of High-end Processors and
Memory Access time.
CONCLUSION
57. • Kindly look at the diagram below and answer to the following questions:
1- How do you classify such a design: serialization or Parallelism? Justify your answer
2- Kindly explain what M1, P1 and D1 represent
3 – what are the functions of :
- Processor-Memory Interconnection Network (PMIN)
- Input-Output-Processor Interconnection Network (IOPIN)
- Interrupt Signal Interconnection Network (ISIN)
4- Explain in your terms, the concept of Shared Memory System / Tightly Coupled System.
5- Flynn’s classification of Computers is based on multiplicity of instruction streams and data
streams observed by the CPU during program execution.
Can you identify another way Computer can be Classified? Elaborate the Concept according to the
author.
Submission Date: 4 October 2020 Time: 12 Pm
Email: malobecyrille.marcel@ictuniversity.org
NOTE: Late submission = - 50% of the assignment Points. Acceptable ONCE.
ASSIGNMENT1b
59. Further Reading
• Recommended reading:"Designing and Building Parallel
Programs". Ian Foster.
https://www.mcs.anl.gov/~itf/dbpp/
• "Introduction to Parallel Computing". Ananth Grama, Anshul
Gupta, George Karypis, Vipin Kumar.
http://www-users.cs.umn.edu/~karypis/parbook/
• "Overview of Recent Supercomputers". A.J. van der Steen,
Jack Dongarra.
OverviewRecentSupercomputers.2008.pdf
60. REFERENCESK. Hwang, Z. Xu, “ Scalable Parallel Computing”, Boston:
WCB/McGraw-Hill, c1998.
2. I. Foster, “ Designing and Building Parallel Programs”, Reading,
Mass: Addison-Wesley, c1995.
3. D. J. Evans, “Parallel SOR Iterative Methods”, Parallel Computing,
Vol.1, pp. 3-8, 1984.
4. L. Adams, “Reordering Computations for Parallel Execution”,
Commun. Appl. Numer. Methods, Vol.2, pp 263-271, 1985.
5. K. P. Wang and J. C. Bruch, Jr., “A SOR Iterative Algorithm for the
Finite Difference and Finite Element Methods that is Efficient and
Parallelizable”, Advances in Engineering Software, 21(1), pp. 37-48,
1994.
6. Lecture Notes on Parallel Computation, Stefan Boeriu, Kai-Ping Wang and John C.
Bruch Jr. Office of Information Technology and
Department of Mechanical and Environmental Engineering, University of California,
Santa Barbara, CA
7. John Mellor-Crummey , COMP 422/534 Parallel Computing: An Introduction ,
Department of Computer Science Rice University, johnmc@rice.edu, January 2020
8. Roshan Karunarathna, Introduction to parallel Computing,2020
9. Safwat HAMAD , DistriByted Computing, Lecture 1- Introduction - FCIS SCience
Department - FCIS SC, 2020.
Hinweis der Redaktion
Moore’s Law: the number of Transistors on a Microchip doubles every two years. So we can expect the speed and capability of our computers to increase every couple of years