SlideShare ist ein Scribd-Unternehmen logo
1 von 60
UBa/NAHPI-2020
DepartmentofComputer
Engineering
PARALLEL AND DISTRIBUTED
COMPUTING
By
Malobe LOTTIN Cyrille .M
Network and Telecoms Engineer
PhD Student- ICT–U USA/CAMEROON
Contact
Email:malobecyrille.marcel@ictuniversity.org
Phone:243004411/695654002
CONTENT
 Part 1- Introducing Parallel and Distributed Computing
• Background Review of Parallel and Distributed Computing
• INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING
• Some keys terminologies
• Why parallel Computing?
• Parallel Computing: the Facts
• Basic Design Computer Architecture: the von Neumann Architecture
• Classification of Parallel Computers (SISD,SIMD,MISD,MIMD)
• Assignment 1a
 Part 2- Initiation to Parallel Programming Principles
• High Performance Computing (HPC)
• Speed: a need to solve Complexity
• Some Case Studies Showing the need of Parallel Computing
• Challenge of explicit Parallelism
• General Structure of Parallel Programs
• Introduction to the Amdahl's LAW
• The GUSTAFSON’s LAW
• SCALIBILITY
• Fixed Size Versus Scale Size
• Assignment 1b
• Conclusion
BACKGROUND
• Interest on PARALLEL COMPUTING dates back to the late
1950’s.
• Supercomputer surfaced throughout the 60’s and 70’s
introducing shared memory multiprocessors, with
multiple processors working side-by-side on shared data.
• By 1980’s, a new kind of parallel computing was launched.
- Introduction of a Supercomputer for scientific applications from 64
Intel 8086/8087 processors designed by Caltech Concurrent Computation.
- Assurance that great performance could be achieved with Massive
Parallel Processors (MPP).
• By 1997, ASCI Red supercomputer computer break the barrier
of one trillion floating point operations per second(FLOPS).
• 1980’s introduced concept of CLUSTERS (Many computers
operating as a single unit, performing the same task under the
supervision of a Software), main system that competed and
displaced MPPs from various applications.
BACKGROUND
• TODAY:
Parallel computing is becoming mainstream based on multi-core
processors
 Chips manufacturers are increasing overall processing performance by
adding additional CPU cores (Dual Core, Quad Core, etc.).
WHY?
• Increasing performance through parallel processing appears to be far more
energy-efficient than increasing microprocessor clock frequencies.
• Better performance is predicted by Moore’s LAW who believe in the ability
of Transistors to empower systems.
Consequence: We shall be going from few Cores to Many….
• Besides, Software development has been very active in the evolution of Parallel
Computing. They field must run after if Parallel and distributed systems must expand !
• BUT, Parallel programs have been harder to write than sequential ones.
Why? - Difficulties in synchronization of the multiple tasks that run by those
program at the same time.
• For MPP and Clusters: Application Programming Interfaces (API) converged to a single
standard called MPI (Message Passive Interface) that handle Parallel Computing
architectures.
• For Shared Memory Multiprocessor Computing , convergence is towards two
standards pthreadsand OpenMP.
KEY CHALLENGE: Ensure effective transition of the software industry to parallel
programming so that a new generation of systems can take place and offer a more
powerful user-experience of Digital technologies, solutions and applications.
INTRODUCTIONTO PARALLELCOMPUTING
WHAT IS PARALLEL COMPUTING( Parallel Execution) ?
Traditionally, software are written to operate following serial
computation. That is:
– RUN on a single computer having a single
Central Processing Unit (CPU);
– Brake any given problem into a discrete series of
instructions.
– Those Instructions are then executed one after another.
– Important: Only one instruction may execute at any moment in
time.
Generally, serial processing compared to Parallel processing is as
followed:
 CASE OF SERIAL COMPUTATION
What about executing these micro- programs Simultaneously ?
 PARALLEL COMPUTATION
• Here, Problem is broken into discrete parts that can be solved at the same
time (Concurrently)
• Discrete parts are then broken down to a series of instructions
• These instructions from each part execute simultaneously on different
processors under the supervision of an overall control/coordination
mechanism
INTRODUCTIONTO PARALLELCOMPUTING(Cont...)
Function- solve Payroll Problem
Run Micro- Program one after another
Discrete
Parts of
the
Problem
INTRODUCTIONTO PARALLELCOMPUTING(Cont..)
It means to compute in parallel:
the problem must be broken apart into discrete pieces of work that can be solved
simultaneously;
at a give time t, instructions from multiple program should be able to be executed;
 the time taken to solve the problem should be far more shorter than with serial
computation( Single compute resource).
WHO DOES THE COMPUTING?
• A single computer with multiple processors/cores
• Can also be an arbitrary number of such computers connected by a network
There is a “jargon” used in the area of PARALLEL COMPUTING. Some key terminology
are:
 PARALLELISM: ability to execute parts of a computation concurrently
 Supercomputing / High Performance Computing (HPC) : refers to world's fastest and largest computers with the
ability to solve large problems
 Node: A standalone «single» computer that will form the Super computer once network together.
 Thread: a unit of execution consisting of a sequence of instructions that is managed by either the operating
system or a runtime system.
 CPU / Socket / Processor / Core: basically a singular execution component of a computer. Individual CPUs are
subdivided into multiple cores that constitute individual execution unit. SOCKET expresses CPU with multiple
cores. Anyway…. Terminology can be confusing. However, this is the center of computing operations.
 Task: Program or program-like set of instructions that is executed by a processor. Parallelism involve multiple
tasks running on multiple processors.
 Shared Memory Architecture where all computer processors have direct (usually bus based) access to common
physical memory.
 Symmetric Multi-Processor (SMP) Shared memory hardware architecture where multiple processors share a
single address space and have equal access to all resources.
 Granularity (Grain Size) often refer to a given task and represent the measure of the amount of work (Or
Computation) which is performed by that task. When a program is split into large task generating a large
amount of computation in processors, it is called Coarse-grained parallelism. Otherwise, when the splitting
generate small task with minimum requirement in processing, it is called FINE.
 Massively Parallel refer to hardware of parallel systems with many processors (“many” = hundreds of
thousands)
 Pleasantly Parallel solving many similar but independent tasks simultaneously. Requires very little
communication
 Scalability a proportionate increase in parallel speedup with the addition of more processors
CONCEPT AND TERMINOLOGY
• TO Address limitations of serial computing:
 Expensive in attempt to make single processing faster.
 Serialization speed is directly dependent upon how fast data can move through hardware
(Transmission bus). Must minimize distance between processing element to achieve improve speed
Do not satisfy constraint of reality where event often happened consecutively. There is need for
a solution that is suitable for modeling and simulating complex world Phenomena ( example:
Modeling processes of assembling of Cars, or Jet, or Traffic during Rush hours,..)
Also
o Physical limitation of hardware components
o Economical reasons – more complex = more expensive
o Issues of Performance limits – double frequency <> double performance
o Large applications – demand too much memory & time
SO….
We need to :
 Save time - wall clock time
 Solve larger problems in the most efficient way
 Provide concurrency (do multiple
things at the same time)
IT MEANS…
with more parallelism,
– We solve larger problems in the same time
– AND, solve a fixed size problem in shorter time
NOTE: if we agree that most stand alone computers handle multiple functional unit (L1 cache, L2
cache, branch, prefetch, decode, floating-point, graphics processing (GPU), integer, etc), have
Multiple execution Unit or cores and Multiple hardware threat, THEN: ALL STAND ALONE
COMPUTERs today can be Characterize as implementing PARALLEL COMPUTING.
WHY PARALLEL COMPUTING ?
Future of computing cannot be conceived
without parallel processing .
 Continuous development and expansion of the Internet
and the improvement in network management schemes.
Having better means available for a group of computers
to cooperate in solving a computational problem will
inevitably translate into a higher profile of clusters and
grids in the landscape of parallel and distributed
computing. Akl S.G., Nagy M. (2009)
The increase boosting of computer power will provide
more SCALING ability to parallel computing programs.
PARALLEL COMPUTING: THE FACTS
BasicDesignArchitectureofParallelComputing:
thevonNeumannArchitecture
From "hard wiring“ computers ( computers were programmed through wiring) to
"stored-program computer" where both program instructions and data are kept in
electronic memory, all computer basically have the same design comprising:
• Four main components: Memory, Control Unit, Arithmetic Logic Unit and Input /
Output
 Read/write, Random Access Memory used to store both program instructions (coded
data which tell the computer to do something and data (information to be used by the
program)
 Control unit fetches instructions/data from memory, decodes the instructions and
then sequentially coordinates operations to accomplish the programmed task.
 Arithmetic Unit performs basic arithmetic operations
 Input / Output is the interface to the human operator
NOTE: Parallel computers still follow this basic design, Only that Units are Multiplied. The
basic, fundamental architecture remains the same
• VARIOUS ways to classify parallel computers
• THE MOST WIDELY USED CLASSIFICATION: Flynn's Taxonomy.
- Classification here is made following two independent dimension:
Instruction Stream and Data Stream.
- No matter the dimension, only one possible state can be
manifested: Single Operation (=instruction) or Multiple Operations
(=instructions)
CLASSIFICATIONOFPARALLELCOMPUTERS
SISD
Single Instruction Stream
Single Data Stream
(One CPU)+ Memory
SIMD
Single Instruction Stream
Multiple Data Stream
(One CPU) + Memories (+)
MISD
Multiple Instruction Streams
Single Data Stream
(Multiple CPU)+ Memory
MIMD
Multiple Instruction Streams
Multiple Data Streams
(Multiple CPU)+ Memory(+)
SINGLE
MULTIPLE
INSTRUCTIONS
DATA STREAM
The SISD: Single Instruction (Only one instruction stream is being acted on by
the CPU during any one clock cycle) stream, Single Data (Only one data stream is
being used as input during any one clock cycle) stream.
- This is the most popular (Common) Computer produced and
used. Example: Workstations, PCs, etc.
- Here, only one CPU is present and instruction operates on 1
data item at a time.
- Execution is Deterministic
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
MEMORYCPU
Bus
Instruction
Data (operand)
SIMD: Single Instruction stream, Multiple Data Stream
• Computers here operates parallel computing and are known as VECTOR COMPUTER
• First type of computers having a system with a massive amount of processors with
computational power above Giga FLOP range.
• Machines executes ONE instruction stream but on MULTIPLE (Different) data items
considered as multiple data streams.
• It means all processing units execute the same instruction at any given clock cycle
(Instruction stream) with the flexibility that Each processing unit can operate on a
different data element (Data stream)
• This type of processing is suitable for problems requesting a high degree of regularity.
Example: graphics/image processing
• Processing method is characterized as Synchronous and Deterministic
• They are two variety of such processing: Processors arrays and vectors pipelines
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
Memory Bank 1
Memory Bank 2
CPU
Instruction
Data (operand)
di
dii
With the Single Instruction stream, Multiple Data Stream (SIMD),
• Only One CPU is Present is in the Computer. This CPU has:
- One instruction register but Multiples Arithmetic Logic Unit (ALU) and uses multiple
data buses to get multiple operands simultaneously. It uses multiple data buses to
handle multiple operands simultaneously.
- The memories are divided into multiple banks that are accessed independently. They
are also multiple data buses for the CPU to access data simultaneously.
Operational behavior
 only 1 instruction is fetched by the CPU at the time
 Some instructions, known as VECTOR INTRUCTIONS ( witch will fetch multiple
operands simultaneously) , operate on multiple data Items at once.
EXAMPLES OF SIMD COMPUTERS
• Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2, ILLIAC IV
• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10
• Most modern computer integrating graphics processor units (GPUs) employ SIMD instructions and execution units.
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
MIMD: Multiple Instructions streams, Multiple Data Streams
• Here, multiple instructions are executed at ONCE. So they are multiple Data
streams.
• Each instruction operates on its own data independently (Multiple
operations on the same data is rare !!)
• Two main type of MIMD computers: shared Memory and Message Passing
Shared Memory MIMD
 In shared Memory MIMD, Memory locations are all accessible by all the
processors (CPUs): this type of computer is a Multi processor Computer
 Most workstations and high end PC are Multi processor based today.
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
CPU
1
CPU
2
CPU
N+1
CPU
n Memory Bank
Instruction i
Data (operand i)
Buses
Buses
Instruction ii
Data (operand k)
Shared Memory MIMD computers are characterize
by:
 Multiple CPUs in a computer sharing the Same memory
 Even though CPUs are coupled Tightly, Each CPU fetches ONE
INSTRUCTION at a time t, and, different CPUs can fetch
different instructions by so generating multiple instructions
streams
 The structure of the memory must be Multiple access points
designed (that is, organized into multiple independent
memory banks) so that multiple instructions/operands can be
transferred simultaneously. This structuration help in avoiding
Conflict of Access by CPUs on the same memory bank.
Finally, an instruction operates one data item a time !
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
 Message Passing MIMD ( CLUSTER computing)
• In This architecture, Multiple SISD computers are interconnected by
a COMMUNICATION NETWORK
• And, each CPU has its own private memory. It means there is not
sharing of memory among the various CPUs
• It is possible for programs running on different CPUs to exchange
information if required. In that case, exchange is done through
MESSAGES.
• More cheaper to Manufacture Message Passing Computers than
Shared-memory MIMD
• There is a need of a dedicated HIGH SPEED NETWORK SWITCH to
perform the interconnection role of the SISD computers
• The MIMD Message Passing computers always provide Message
Passing API (Aplication Programming Interface) so that
programmers can be able to include in their programs, statements
that permits exchange of messages. Example: the Message Passing
Interface (MPI)
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
 Message Passing MIMD ( CLUSTERcomputing) Operationnal architeture
MEMORYCPU
Bus
Instruction
Data (operand)
MEMORYCPU
Bus
Instruction
Data (operand)
MEMORYCPU
Bus
Instruction
Data (operand)
MEMORYCPU
Bus
Instruction
Data (operand)
Data Exchange
Data Exchange
Data Exchange
Data Exchange
SWITCH
SISD 1
SISD n-1
SISD n
SISD n+1
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
MISD: Multiple Instructions stream, Single Data Stream
Assignment 1a:
Research on Multiple Instruction stream, Single Data
stream (MISD) Parallel Computing.
You will emphasize on:
1. Architecture design and modeling
2. Properties of such a design
3. Operational details
4. Practical example and Specifications.
Submission Date: 28 October 2020 Time: 12 Pm
Email: malobecyrille.marcel@ictuniversity.org
NOTE: Late submission = - 50% of the assignment Points.
Acceptable ONCE.
CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
END OF PART 1
What are we saying ?
Looking at it VIRTUALLY, All stand-alone computers today are parallel.
from a hardware perspective, computers have:
 Multiple functional units (floating point, integer, GPU, etc.)
 Multiple execution units / cores
 Multiple hardware threads
CHECK YOUR PROGRESS …..
• Check Your Progress 1
1) What are various criteria for classification of parallel computers?
…………………………………………………………………………………………
…………………………………………………………………………………………
2) Define instruction and data streams.
…………………………………………………………………………………………
…………………………………………………………………………………………
3) State whether True or False for the following:
If Is=Instruction Stream and Ds=Data Stream,
a) SISD computers can be characterized as Is > 1 and Ds > 1
b) SIMD computers can be characterized as Is > 1 and Ds = 1
c) MISD computers can be characterized as Is = 1 and Ds = 1
d) MIMD computers can be characterized as Is > 1 and Ds > 1
4) Why do we need Parallel Computing ?
…………………………………………………………………………………………
…………………………………………………………………………………………
• Why do we need HPC ?
1. Save time and/or money: the more you allocate resources on a given task, the
faster you expect to see it completed and save some money. Consider that Parallel
clusters can be built from cheap, commodity components.
2. Solve larger problems: They are so many complex problems that can’t be solved
with single Computer especially considering their limited computer memory.
3. Provide concurrency: A single compute resource can only do one thing at a time.
Multiple computing resources can be doing many things simultaneously.
4. Use of non-local resources: HPC will provide the flexibility to use compute
resources on a wide area network, or even the Internet when local compute
resources are scarce.
SO….
High-performance computing (HPC) is the use of parallel processing
for running advanced application programs efficiently, reliably and quickly.
HIGHPERFORMANCECOMPUTING(HPC)
CLOSE TO REALITY
THE MOORE’S LAW PREDICTION
 Statement [1965]:
`The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly
over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase
is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.
That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.''
REVISION in [1975]: `There is no room left to squeeze anything out by being clever. Going forward from here we have to
depend on the two size factors - bigger dies and finer dimensions.''
IT MEANS:
- Prioritize minimum size and improve in power by increasing Transistors. That is….
- More transistors = ↑ opportunities for exploiting parallelism
If one is to buy into Moore's law, the question still remains
• how does one translate transistors into useful OPS (operations per second)?
If Moore believes that : the transistor density of semiconductor chips would double roughly every 18
months.
A tangible solution is to rely on parallelism, both implicit and explicit.
TWO Possible way to implement parallelism:
 Implicit parallelism: invisible to the programmer
– pipelined execution of instructions, using conventional language such as C, Fortran or Pascal to write the source Program
– code source program is sequential and translated into parallel object code by a Parallelizing Compiler that will detect and
assign target machine resources. This is apply in programming shared multiprocessors and require less effort from the
programmer.
 Explicit parallelism
– Long instruction words (VLIW) and require more effort by the programmer to develop a source program
-- Made of bundles of independent instructions that can be issued together, reducing the burden on the compiler to detect
parallelism, which will detect and assign target machine resources when needed.. Example: Intel Itanium processor 2000-2017
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
IS MOORE’S LAW STILL APPLICABLE ?
• Up to early 2000s, transistor count was a valid indication of how much additional processing
power could be packed into an area.
• Moore’s law and Dennard scaling, when combined, held that more transistors could be
packed into the same space and that those transistors would use less power while operating
at a higher clock speed.
ARGUMENTS
 Because Classic Dennard scaling no longer occurs at each lower node, packing more
transistors into a smaller space no longer guarantees lower total power
consumption. Consequently, does no longer correlates directly to higher performance.
• The Major Limiting Factor: Hot spot formation
Possible way forward TODAY:
- Focus on Improving CPU Cooling (One of the biggest barriers to
higher CPU clock speeds is hot spots) by either ameliorating the
efficiency of the Thermal interface Material(TIM) or improving lateral
heat transfer within the CPU itself or making used of computational sprinting
to increase thermal dissipation.
HOWEVER: This won’t improve compute performance over
sustained periods of time
BUT it would speed latency-sensitive applications like web page
loads or brief, intensive computations.
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
42 Years of Microprocessor Trend Data .
Orange: Moore’s Law trend;
Purpule: Dennard scaling breakdown;
Green & Red: Immediate implications of Dennard scaling breakdown;
Blue: Slowdown of ST increase in performances;
Black: The age of increase parallelism
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
SOURCE: Karl Rupp. 42 Years of Microprocessor Trend Data. https://www.karlrupp.net/ 2018/02/42-years-of-microprocessor-trend-data/, 2018. [Online].
ILLUSTRATION
ORTHER HPC LIMITING FACTORS
- Disparity between the clock rates growth of High-end Processors and Memory Access time:
Clock rate (40%)/year over the past decade while DRAM (10%)/year over the same
period. This is a significant performance bottleneck.
This is issue is addressed by Parallel Computing by :
• providing increased bandwidth to the memory system
• offering higher aggregate caches.
This explain why Some of the fastest growing applications of parallel computing utilize not
their raw computational speed, rather their ability to pump data to memory and disk faster.
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
Source: How Multithreading Addresses the Memory Wall - Scientific Figure on ResearchGate. Available from:
https://www.researchgate.net/figure/The-growing-CPU-DRAM-speed-gap-expressed-as-relative-speed-over-time-log-scale-The_fig1_37616249
[accessed 22 Oct, 2020]
PROCESSORS EVOLUTION: CASE OF INTEL PROCESSORS
From 2017, 9 generations of Processors have been developed.
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
Intel core i10 processor
Source: Retrieve from https://strackit.com/tech-news/intel-core-i10-processor/
• differences between series of Processor Core i3, i5, i7 and i9.
Generation Number of Cores Specifications Cache Size (MB)
Core i3 2 physical cores
- Cheapest processors
- used INTEL® HYPER
THREADING
TECHNOLOGY
creates a 2 physical cores
and 2 more virtual
operating system
determines that the
processor has 4 cores
Memory 3-4MB
Core i5 4 physical cores, some
models have only 2
physical cores + 2 virtual
Higher performance is
achieved by the presence
of 4 physical cores and
increased volume of cache
memory
4 or 8MB
Core i7 4 to 8 physical cores ,
use INTEL® HYPER
THREADING TECHNOLOGY
Performance increased
virtual cores and a large
volume of cache memory.
Processors for mobile
devices can have 2 physical
cores.
from 8MB to 20MB
Core i9 6 - 8 physical cores i9 series was conceived as
a competitor to AMD game
processors.
More cores, more speed
but not much. Since i9 is
slightly better than i7.
there is practically no
sense in the development
of this processor line.
cache 10Mb-20Mb
HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
 ACHIEVING GREATER SPEED WILL HELP IN UNDERTANDING VARIOUS
PHENOMENON APLICABLE TO DIFFERENT DOMAIN OF LIFE.
• Science
—understanding matter from elementary particles to cosmology
—storm forecasting and climate prediction
—understanding biochemical processes of living organisms
• Engineering
—multi-scale simulations of metal additive manufacturing processes
—understanding quantum properties of materials
—understanding reaction dynamics of heterogeneous catalysts
—earthquake and structural modeling
—pollution modeling and remediation planning
—molecular nanotechnology
• Business
—computational finance - high frequency trading
—information retrieval
—data mining “big data”
• Defense
—nuclear weapons stewardship
• Computers:
— Embedded systems increasingly rely on
distributed control algorithms.
— Network intrusion detection, cryptography, etc.
— Optimizing performance of modern automobile.
— Networks, mail-servers, search engines…
— Visualization
SPEED:ANEEDTOSOLVECOMPLEXITY
• Parallelism finds applications in very diverse domains for different
motivating reasons. These range from improved application
performance to cost considerations.
CASE 1: Earthquake Simulation in Japan
HOW DO WE PREVENT SUCH TO HAPPEN AGAIN ?
• We need Computers , with ability to put together Computation Power in order to be able to
simulate and calculate High level operations for better prediction of natural Phenomenon
SOMECASESTUDIES
SOURCE: Earthquake Research Institute, University of Tokyo Tonankai-Tokai Earthquake Scenario. Video Credit: The Earth Simulator Art Gallery, CD-ROM, March 2004
SOMECASESTUDIES(Cont…)
CASE 2: El Niño
El Niño is an anomalous, yet periodic, warming of the central and eastern equatorial Pacific Ocean.
For reasons still not well understood, every 2-7 years, this patch of ocean warms for six to 18 months
El Niño was strong through the Northern Hemisphere winter 2015-16, with a transition to ENSO-
neutral in May 2016.
HOW DO WE EXPLAIN SUCH PHENOMENON ?
May be bringing into collaboration various Processors of distributed computers
(placed in different location ) can help provide an answer. Parallel Programming must
therefore be able to develop compatible software that can contribute in collection
and analyses of data.
• Most of Parallelism concepts we shall study are from the
Explicit orientation.
Challenges related to Explicit Parallelism are:
 Algorithm development is harder
—complexity of specifying and coordinating concurrent
activities
 Software development is much harder
—lack of standardized & effective development tools and
programming models
—subtle program errors: race conditions
 Rapid pace of change in computer system architecture
—a great parallel algorithm for one machine may not be
suitable for another
– example: homogeneous multicore processors vs. GPUs
ChallengesofEXPLICITPARALLELISM
Parallel science applications are often very sophisticated
—e.g. adaptive algorithms may require dynamic load
balancing
• Multilevel parallelism is difficult to manage
• Extreme scale exacerbates inefficiencies
—algorithmic scalability losses
—serialization and load imbalance
—communication or I/O bottlenecks
—insufficient or inefficient parallelization
• Hard to achieve top performance even on individual
nodes
—contention for shared memory bandwidth
—memory hierarchy utilization on multicore processors
Challenges of PARALLELISM INGENERAL
• IT IS NOT ALL ABOUT COMPUTATION.
There is also a need to:
Improve on Memory latency and bandwidth. Because,
—CPU rates are > 200x faster than
memory
—bridge speed gap using memory
hierarchy
—more cores exacerbates demand
 Improve on the Interprocessor communication
Improve ON Input/output Correlation
— I/O bandwidth to disk typically needs to grow linearly
with the
# processors
ACHIEVINGHIGHPERFORMANCEONPARALLELSYSTEMS
EXPLICITLY
define tasks, work decomposition, data decomposition,
communication, synchronization.
EXAMPLE: MPI is a library for fully
explicit parallelization.
“It is either All or nothing”.
IMPLICITLY
define tasks only, rest implied; or define tasks and
work decomposition rest implied;
EXAMPLE OpenMP is a high-level
parallel programming
model, which is mostly an
implicit model
HOWDOWEEXPRESSPARALLELISMINAPROGRAM?
All parallel programs contain:
- Parallel sections
And,
- Serial sections (Serial sections are when work is being duplicated or no useful work is being done,(waiting
for others))
We therefore need to Build efficient algorithms by avoiding:
- Communication delay
- Idling
- Synchronization
QUICKVIEWONTHESTRUCTUREOFAPARALLEL
PROGRAM
Generally, Parallel thinking is closer to us than what we believe. Daily, We try to do things
simultaneously, avoiding to have delay in any. Parallel Computing thinking is not far from this….
For a given task to be done by many, WE MAY ASK OURSELVES:
How many people are involve in the work ?.
(Degree of Parallelism)
 What is needed to begin the work?
(Initialization)
Who does what ?
(Work distribution)
 How do we regulate Access to work part.
(Data/IO access)
 Find OUT: Whether they need info from each other to finish their
own job.
(Communication)
 When are they all done ?
(Synchronization)
 What needs to be done to collate the result.
AWAYTOTHINK:PARALLELAPPROACH
• Development of parallel programming impose the need of
Performance metrics and Software tools in order to evaluate the
performance of parallel algorithm.
• Some factors can help in achieving this goal:
- Type of Hardware used
- The degree of parallelism of the problem
- The type of parallel model to use
The goal is : To compare what is obtained (Parallel program) from
what was there (Original Sequence).
Analyses focuses on the number of threads and/or the number of
processes used.
Note:
Ahmdal’s Law will introduce the limitations related to Parallel
computation.
And,
the Gustafson’s Law will evaluate the degree of efficiency of
Parallelization of a sequential algorithm.
EVALUATIONMETRICS
 Relation between the Execution time (Tp) and Speedup, (S)
S(p, n) = T(1, n) / T(p, n)
- Usually, S(p, n) < p
- Sometimes S(p, n) > p (super linear speedup)
 Efficiency, E
E(p, n) = S(p, n)/p
- Usually, E(p, n) < 1, Sometimes, greater than 1
 Scalability – Limitations in parallel computing, relation to n and p.
EVALUATIONMETRICS(Cont…)
SpeedUP Measurement (S)
• Speedup is a MEASURE
• It help in appreciating the benefit in solving a problem in parallel
• It is given by: Ratio of the time taken to solve a Problem on a Single
processing element (Ts) to the time required to solve the same problem
on a p identical processing elements (Tp).
• That is : S= Ts/Tp
- IF S = p (Ideal condition) LINEAR SPEEDUP (Speed of execution is with the number of processors.)
- IF S < p,  Real speedup
- IF S > p, Super Linear Speedup.
EFFICIENCY (E)
• Another performance metric
• Will estimate the ability of the processors to solve a given
task in comparison of how much effort is wasted in
communication and Synchronization.
• Ideal Condition of a parallel system: S=P (Speedup is equal to p
Processing elements--- VERY RARE !!!)
• Efficiency (E) is given by: E= S/p = TS/pTp
- When E=1 - It is a LINEAR Case
- When E<1, It is a REAL Case
- When E<<1, It is a problem that is parallelizable with low efficiency
EVALUATIONMETRICS(Cont…)
SCALABILITY
• RULE: Efficiency decreases with increasing P; increases
with increasing N.
But here are the fundamentals questions:
1- How effectively the parallel algorithm can use an
increasing number of processors ?
2- How the amount of computation performed must scale
with P to keep E constant ?
• SCALING is simply the ability to be efficient on a parallel machine.
- It identifies the Computing Power ( How fast task are executed)
proportionally to the number processors
- IF, we increase the problem size (n) and the number of Processors (p) at
the same time, THERE WILL BE NO LOSS IN TERM OF PERFORMANCE.
- It all depends on how increments is done so that Efficiency should be
maintained or improved.
SCALABILITYEVALUATIONMETRICS(Cont…)
APPRECIATING SPEEDANDEFFICIENCY
Note: Serial sections limit the parallel effectiveness
REMEMEBER: If you have a lot of serial computation then you
will not get good speedup BECAUSE
- No serial work “allows” prefect speedup
- REFERS TO Amdahl’s Law to appreciate this truth
THE AMD
• How many processors can we really use?
Let’s say we have a legacy code such that is it only
feasible to convert half of the heavily used routines
to parallel.
• Amdahl’s Law is widely used to design processors and parallel
algorithms
• Statement: the maximum speedup that can be achieved is limited
by serial components of the program: S= 1/(1-p), with
(1-p) been the serial component( part not parallelized) of a program.
Example: A program has 90% of the code parallelize and 10 % that must remain
serialized. What is the maximum achievable speedup ?
Answer: S=9, with (1-p)=10, S=90/10=9.
THEAMDHAL’SLAW
If we run this on a parallel machine with five processors:
Our code now takes about 60s.
 We have speed it up by about 40%.
Let’s say we use a thousand processors ??
We have now speed our code by about a factor of two. Is this a
big enough win ?
THEAMDHAL’SLAW(Cont…)
UNDERSTANDINGTHEAMDHAL’SLAW
UNDERSTANDINGTHEAMDHAL’SLAW(Cont…)
UNDERSTANDINGTHEAMDHAL’SLAW(Cont…)
• Handle Most fundamental limitation on parallel speedup
If fraction s of execution time is serial then speedup < 1/s
Is this realistic?
We know that inherently parallel code can be executed in “no time”
but inherently sequential code still needs fraction s of time.
Example : if s is 0.2 then speedup cannot exceed 1/0.2 = 5.
- Using p processors, we can find the Speedup :
- Total sequential execution time on a single processor is
normalized to 1
- Serial code on p processors requires fraction s of time
- Parallel code on p processors requires fraction (1 – s)/p of time
THEAMDHAL’SLAW(Cont…)
THEAMDHAL’SLAW Vs REALITY
Example: 2-phase calculation
* Sweep over n-by-n grid and do some independent computation
*Sweep again and add each value to global sum
- Time for first phase on p parallel processors = n²/p
Second phase serialized at global variable, so time = n²
Speedup <=
or at most 2 for large p
Improvement: divide second phase into two
- Accumulate p private sums during first sweep
- Add per-process private sums into global sum
- Parallel time is: n²/p + n²/p + p,
and
speedup <=
APRACTICALAPPLICATIONOFAMDHAL’SLAW
Amdahl's law is based on fixed workload or fixed problem size.
 It implies that the sequential part of a program does
not change with respect to machine size (i.e., the
number of processors).
 the parallel part is evenly distributed over P processors.
 Gustafson's law was to select or reformulate problems in order to minimize the sequential
part of a program so that solving a larger problem in the same amount of time would be
possible.
This Law therefore consider that:
- While increasing the dimension of a problem, its sequential parts remain constant
- While increasing the number of processors, the work require on each them still remains the
same.
Mathematically: S(P)=P-α(P-1), with P: Number of Processors, S is the Speedup and α is
the non parallelize fraction of any parallel process.
NOTE: This expression contrast the Amdahl's Law which consider a single process execution
time as a fixe quantity and compares it to a shrinking per process parallel execution time.
Amdhal assume a fixe problem size because he believes that the overall workload of a
program does not change according to the machine size ( number of processors).
Gustafson’s Law therefore address the deficiency of Amdahl's Law which does not consider
the total number of Computing resources involve in solving a task. Gustafson suggest to
consider all computer resources if we intend to achieve efficient parallelism.
FIXEDSIZEVSSCALESIZE
Let n be a measure of the problem size.
 The execution of the program on a parallel computer :
a(n) + b(n) = 1
where a is the sequential fraction and b is the parallel
fraction
 On a sequential computer:
- a(n) + p.b(n) , where p is the number of processors in
the parallel case.
And, Speedup = a(n) + p.b(n) = a(n) + p.(1-a(n))
Assume serial function a(n) diminishes with problem size n,
then speedup approaches p as n approaches infinity, as
desired.
WHAT DO WE MEAN ?
WHATABOUTGUSTAFSON’SLAW
WHATABOUTGUSTAFSON’SLAW- Illustration
OpenDEBATE?
• Parallel and Distributed Computing aims at satisfying
requirement of Next generation Computing Systems by:
- Providing Platform for fast processing
- Providing Platform where management of large and
complex amount of data does no more constitute a
major bottleneck to the understanding of complex
phenomenon.
The domain intends to provide a far more better user
experience, so far software development field will succeed in
satisfying requirement of such a design, and, the technology
will finally solve issues related to the noticeable Disparity
between the clock rates growth of High-end Processors and
Memory Access time.
CONCLUSION
• Kindly look at the diagram below and answer to the following questions:
1- How do you classify such a design: serialization or Parallelism? Justify your answer
2- Kindly explain what M1, P1 and D1 represent
3 – what are the functions of :
- Processor-Memory Interconnection Network (PMIN)
- Input-Output-Processor Interconnection Network (IOPIN)
- Interrupt Signal Interconnection Network (ISIN)
4- Explain in your terms, the concept of Shared Memory System / Tightly Coupled System.
5- Flynn’s classification of Computers is based on multiplicity of instruction streams and data
streams observed by the CPU during program execution.
Can you identify another way Computer can be Classified? Elaborate the Concept according to the
author.
Submission Date: 4 October 2020 Time: 12 Pm
Email: malobecyrille.marcel@ictuniversity.org
NOTE: Late submission = - 50% of the assignment Points. Acceptable ONCE.
ASSIGNMENT1b
ENDOF CHAPTER1- GENERALINTRODUCTIONTO
PARALLEL ANDDISTRIBUTEDCOMPUTING
Further Reading
• Recommended reading:"Designing and Building Parallel
Programs". Ian Foster.
https://www.mcs.anl.gov/~itf/dbpp/
• "Introduction to Parallel Computing". Ananth Grama, Anshul
Gupta, George Karypis, Vipin Kumar.
http://www-users.cs.umn.edu/~karypis/parbook/
• "Overview of Recent Supercomputers". A.J. van der Steen,
Jack Dongarra.
OverviewRecentSupercomputers.2008.pdf
REFERENCESK. Hwang, Z. Xu, “ Scalable Parallel Computing”, Boston:
WCB/McGraw-Hill, c1998.
2. I. Foster, “ Designing and Building Parallel Programs”, Reading,
Mass: Addison-Wesley, c1995.
3. D. J. Evans, “Parallel SOR Iterative Methods”, Parallel Computing,
Vol.1, pp. 3-8, 1984.
4. L. Adams, “Reordering Computations for Parallel Execution”,
Commun. Appl. Numer. Methods, Vol.2, pp 263-271, 1985.
5. K. P. Wang and J. C. Bruch, Jr., “A SOR Iterative Algorithm for the
Finite Difference and Finite Element Methods that is Efficient and
Parallelizable”, Advances in Engineering Software, 21(1), pp. 37-48,
1994.
6. Lecture Notes on Parallel Computation, Stefan Boeriu, Kai-Ping Wang and John C.
Bruch Jr. Office of Information Technology and
Department of Mechanical and Environmental Engineering, University of California,
Santa Barbara, CA
7. John Mellor-Crummey , COMP 422/534 Parallel Computing: An Introduction ,
Department of Computer Science Rice University, johnmc@rice.edu, January 2020
8. Roshan Karunarathna, Introduction to parallel Computing,2020
9. Safwat HAMAD , DistriByted Computing, Lecture 1- Introduction - FCIS SCience
Department - FCIS SC, 2020.

Weitere ähnliche Inhalte

Was ist angesagt?

load balancing in public cloud ppt
load balancing in public cloud pptload balancing in public cloud ppt
load balancing in public cloud pptKrishna Kumar
 
Cluster computing ppt
Cluster computing pptCluster computing ppt
Cluster computing pptDC Graphics
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster ComputingNIKHIL NAIR
 
Load Balancing in Cloud
Load Balancing in CloudLoad Balancing in Cloud
Load Balancing in CloudMphasis
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computersshopnil786
 
Simulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningSimulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningCloudLightning
 
Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...hrmalik20
 
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
An Efficient Decentralized Load Balancing Algorithm in Cloud ComputingAn Efficient Decentralized Load Balancing Algorithm in Cloud Computing
An Efficient Decentralized Load Balancing Algorithm in Cloud ComputingAisha Kalsoom
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsYong Feng
 
Task Scheduling Using Firefly algorithm with cloudsim
Task Scheduling Using Firefly algorithm with cloudsimTask Scheduling Using Firefly algorithm with cloudsim
Task Scheduling Using Firefly algorithm with cloudsimAqilIzzuddin
 

Was ist angesagt? (20)

Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
load balancing in public cloud ppt
load balancing in public cloud pptload balancing in public cloud ppt
load balancing in public cloud ppt
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Cluster computer
Cluster  computerCluster  computer
Cluster computer
 
Hpc 1
Hpc 1Hpc 1
Hpc 1
 
Cluster computing2
Cluster computing2Cluster computing2
Cluster computing2
 
Cluster computing ppt
Cluster computing pptCluster computing ppt
Cluster computing ppt
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
cluster computing
cluster computingcluster computing
cluster computing
 
Load Balancing in Cloud
Load Balancing in CloudLoad Balancing in Cloud
Load Balancing in Cloud
 
Computer cluster
Computer clusterComputer cluster
Computer cluster
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computers
 
Beowulf cluster
Beowulf clusterBeowulf cluster
Beowulf cluster
 
Simulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningSimulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightning
 
Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...
 
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
An Efficient Decentralized Load Balancing Algorithm in Cloud ComputingAn Efficient Decentralized Load Balancing Algorithm in Cloud Computing
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Task Scheduling Using Firefly algorithm with cloudsim
Task Scheduling Using Firefly algorithm with cloudsimTask Scheduling Using Firefly algorithm with cloudsim
Task Scheduling Using Firefly algorithm with cloudsim
 

Ähnlich wie Chap 1(one) general introduction

Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxkrnaween
 
5.7 Parallel Processing - Reactive Programming.pdf.pptx
5.7 Parallel Processing - Reactive Programming.pdf.pptx5.7 Parallel Processing - Reactive Programming.pdf.pptx
5.7 Parallel Processing - Reactive Programming.pdf.pptxMohamedBilal73
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1AbdullahMunir32
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computingMehul Patel
 
Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Marcirio Chaves
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014ijcite
 
Dataintensive
DataintensiveDataintensive
Dataintensivesulfath
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesSubhajit Sahu
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptRubenGabrielHernande
 
Concurrency and Parallelism, Asynchronous Programming, Network Programming
Concurrency and Parallelism, Asynchronous Programming, Network ProgrammingConcurrency and Parallelism, Asynchronous Programming, Network Programming
Concurrency and Parallelism, Asynchronous Programming, Network ProgrammingPrabu U
 
System models for distributed and cloud computing
System models for distributed and cloud computingSystem models for distributed and cloud computing
System models for distributed and cloud computingpurplesea
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semesterRafi Ullah
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.pptssuser413a98
 

Ähnlich wie Chap 1(one) general introduction (20)

Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
 
5.7 Parallel Processing - Reactive Programming.pdf.pptx
5.7 Parallel Processing - Reactive Programming.pdf.pptx5.7 Parallel Processing - Reactive Programming.pdf.pptx
5.7 Parallel Processing - Reactive Programming.pdf.pptx
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Seminar
SeminarSeminar
Seminar
 
Par com
Par comPar com
Par com
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
 
Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014
 
Dataintensive
DataintensiveDataintensive
Dataintensive
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.ppt
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Concurrency and Parallelism, Asynchronous Programming, Network Programming
Concurrency and Parallelism, Asynchronous Programming, Network ProgrammingConcurrency and Parallelism, Asynchronous Programming, Network Programming
Concurrency and Parallelism, Asynchronous Programming, Network Programming
 
System models for distributed and cloud computing
System models for distributed and cloud computingSystem models for distributed and cloud computing
System models for distributed and cloud computing
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.ppt
 

Kürzlich hochgeladen

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 

Kürzlich hochgeladen (20)

NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 

Chap 1(one) general introduction

  • 1. UBa/NAHPI-2020 DepartmentofComputer Engineering PARALLEL AND DISTRIBUTED COMPUTING By Malobe LOTTIN Cyrille .M Network and Telecoms Engineer PhD Student- ICT–U USA/CAMEROON Contact Email:malobecyrille.marcel@ictuniversity.org Phone:243004411/695654002
  • 2. CONTENT  Part 1- Introducing Parallel and Distributed Computing • Background Review of Parallel and Distributed Computing • INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING • Some keys terminologies • Why parallel Computing? • Parallel Computing: the Facts • Basic Design Computer Architecture: the von Neumann Architecture • Classification of Parallel Computers (SISD,SIMD,MISD,MIMD) • Assignment 1a  Part 2- Initiation to Parallel Programming Principles • High Performance Computing (HPC) • Speed: a need to solve Complexity • Some Case Studies Showing the need of Parallel Computing • Challenge of explicit Parallelism • General Structure of Parallel Programs • Introduction to the Amdahl's LAW • The GUSTAFSON’s LAW • SCALIBILITY • Fixed Size Versus Scale Size • Assignment 1b • Conclusion
  • 3. BACKGROUND • Interest on PARALLEL COMPUTING dates back to the late 1950’s. • Supercomputer surfaced throughout the 60’s and 70’s introducing shared memory multiprocessors, with multiple processors working side-by-side on shared data. • By 1980’s, a new kind of parallel computing was launched. - Introduction of a Supercomputer for scientific applications from 64 Intel 8086/8087 processors designed by Caltech Concurrent Computation. - Assurance that great performance could be achieved with Massive Parallel Processors (MPP). • By 1997, ASCI Red supercomputer computer break the barrier of one trillion floating point operations per second(FLOPS). • 1980’s introduced concept of CLUSTERS (Many computers operating as a single unit, performing the same task under the supervision of a Software), main system that competed and displaced MPPs from various applications.
  • 4. BACKGROUND • TODAY: Parallel computing is becoming mainstream based on multi-core processors  Chips manufacturers are increasing overall processing performance by adding additional CPU cores (Dual Core, Quad Core, etc.). WHY? • Increasing performance through parallel processing appears to be far more energy-efficient than increasing microprocessor clock frequencies. • Better performance is predicted by Moore’s LAW who believe in the ability of Transistors to empower systems. Consequence: We shall be going from few Cores to Many…. • Besides, Software development has been very active in the evolution of Parallel Computing. They field must run after if Parallel and distributed systems must expand ! • BUT, Parallel programs have been harder to write than sequential ones. Why? - Difficulties in synchronization of the multiple tasks that run by those program at the same time. • For MPP and Clusters: Application Programming Interfaces (API) converged to a single standard called MPI (Message Passive Interface) that handle Parallel Computing architectures. • For Shared Memory Multiprocessor Computing , convergence is towards two standards pthreadsand OpenMP. KEY CHALLENGE: Ensure effective transition of the software industry to parallel programming so that a new generation of systems can take place and offer a more powerful user-experience of Digital technologies, solutions and applications.
  • 5. INTRODUCTIONTO PARALLELCOMPUTING WHAT IS PARALLEL COMPUTING( Parallel Execution) ? Traditionally, software are written to operate following serial computation. That is: – RUN on a single computer having a single Central Processing Unit (CPU); – Brake any given problem into a discrete series of instructions. – Those Instructions are then executed one after another. – Important: Only one instruction may execute at any moment in time. Generally, serial processing compared to Parallel processing is as followed:
  • 6.  CASE OF SERIAL COMPUTATION What about executing these micro- programs Simultaneously ?  PARALLEL COMPUTATION • Here, Problem is broken into discrete parts that can be solved at the same time (Concurrently) • Discrete parts are then broken down to a series of instructions • These instructions from each part execute simultaneously on different processors under the supervision of an overall control/coordination mechanism INTRODUCTIONTO PARALLELCOMPUTING(Cont...) Function- solve Payroll Problem Run Micro- Program one after another
  • 7. Discrete Parts of the Problem INTRODUCTIONTO PARALLELCOMPUTING(Cont..) It means to compute in parallel: the problem must be broken apart into discrete pieces of work that can be solved simultaneously; at a give time t, instructions from multiple program should be able to be executed;  the time taken to solve the problem should be far more shorter than with serial computation( Single compute resource). WHO DOES THE COMPUTING? • A single computer with multiple processors/cores • Can also be an arbitrary number of such computers connected by a network
  • 8. There is a “jargon” used in the area of PARALLEL COMPUTING. Some key terminology are:  PARALLELISM: ability to execute parts of a computation concurrently  Supercomputing / High Performance Computing (HPC) : refers to world's fastest and largest computers with the ability to solve large problems  Node: A standalone «single» computer that will form the Super computer once network together.  Thread: a unit of execution consisting of a sequence of instructions that is managed by either the operating system or a runtime system.  CPU / Socket / Processor / Core: basically a singular execution component of a computer. Individual CPUs are subdivided into multiple cores that constitute individual execution unit. SOCKET expresses CPU with multiple cores. Anyway…. Terminology can be confusing. However, this is the center of computing operations.  Task: Program or program-like set of instructions that is executed by a processor. Parallelism involve multiple tasks running on multiple processors.  Shared Memory Architecture where all computer processors have direct (usually bus based) access to common physical memory.  Symmetric Multi-Processor (SMP) Shared memory hardware architecture where multiple processors share a single address space and have equal access to all resources.  Granularity (Grain Size) often refer to a given task and represent the measure of the amount of work (Or Computation) which is performed by that task. When a program is split into large task generating a large amount of computation in processors, it is called Coarse-grained parallelism. Otherwise, when the splitting generate small task with minimum requirement in processing, it is called FINE.  Massively Parallel refer to hardware of parallel systems with many processors (“many” = hundreds of thousands)  Pleasantly Parallel solving many similar but independent tasks simultaneously. Requires very little communication  Scalability a proportionate increase in parallel speedup with the addition of more processors CONCEPT AND TERMINOLOGY
  • 9. • TO Address limitations of serial computing:  Expensive in attempt to make single processing faster.  Serialization speed is directly dependent upon how fast data can move through hardware (Transmission bus). Must minimize distance between processing element to achieve improve speed Do not satisfy constraint of reality where event often happened consecutively. There is need for a solution that is suitable for modeling and simulating complex world Phenomena ( example: Modeling processes of assembling of Cars, or Jet, or Traffic during Rush hours,..) Also o Physical limitation of hardware components o Economical reasons – more complex = more expensive o Issues of Performance limits – double frequency <> double performance o Large applications – demand too much memory & time SO…. We need to :  Save time - wall clock time  Solve larger problems in the most efficient way  Provide concurrency (do multiple things at the same time) IT MEANS… with more parallelism, – We solve larger problems in the same time – AND, solve a fixed size problem in shorter time NOTE: if we agree that most stand alone computers handle multiple functional unit (L1 cache, L2 cache, branch, prefetch, decode, floating-point, graphics processing (GPU), integer, etc), have Multiple execution Unit or cores and Multiple hardware threat, THEN: ALL STAND ALONE COMPUTERs today can be Characterize as implementing PARALLEL COMPUTING. WHY PARALLEL COMPUTING ?
  • 10. Future of computing cannot be conceived without parallel processing .  Continuous development and expansion of the Internet and the improvement in network management schemes. Having better means available for a group of computers to cooperate in solving a computational problem will inevitably translate into a higher profile of clusters and grids in the landscape of parallel and distributed computing. Akl S.G., Nagy M. (2009) The increase boosting of computer power will provide more SCALING ability to parallel computing programs. PARALLEL COMPUTING: THE FACTS
  • 11. BasicDesignArchitectureofParallelComputing: thevonNeumannArchitecture From "hard wiring“ computers ( computers were programmed through wiring) to "stored-program computer" where both program instructions and data are kept in electronic memory, all computer basically have the same design comprising: • Four main components: Memory, Control Unit, Arithmetic Logic Unit and Input / Output  Read/write, Random Access Memory used to store both program instructions (coded data which tell the computer to do something and data (information to be used by the program)  Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task.  Arithmetic Unit performs basic arithmetic operations  Input / Output is the interface to the human operator NOTE: Parallel computers still follow this basic design, Only that Units are Multiplied. The basic, fundamental architecture remains the same
  • 12. • VARIOUS ways to classify parallel computers • THE MOST WIDELY USED CLASSIFICATION: Flynn's Taxonomy. - Classification here is made following two independent dimension: Instruction Stream and Data Stream. - No matter the dimension, only one possible state can be manifested: Single Operation (=instruction) or Multiple Operations (=instructions) CLASSIFICATIONOFPARALLELCOMPUTERS SISD Single Instruction Stream Single Data Stream (One CPU)+ Memory SIMD Single Instruction Stream Multiple Data Stream (One CPU) + Memories (+) MISD Multiple Instruction Streams Single Data Stream (Multiple CPU)+ Memory MIMD Multiple Instruction Streams Multiple Data Streams (Multiple CPU)+ Memory(+) SINGLE MULTIPLE INSTRUCTIONS DATA STREAM
  • 13. The SISD: Single Instruction (Only one instruction stream is being acted on by the CPU during any one clock cycle) stream, Single Data (Only one data stream is being used as input during any one clock cycle) stream. - This is the most popular (Common) Computer produced and used. Example: Workstations, PCs, etc. - Here, only one CPU is present and instruction operates on 1 data item at a time. - Execution is Deterministic CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…) MEMORYCPU Bus Instruction Data (operand)
  • 14. SIMD: Single Instruction stream, Multiple Data Stream • Computers here operates parallel computing and are known as VECTOR COMPUTER • First type of computers having a system with a massive amount of processors with computational power above Giga FLOP range. • Machines executes ONE instruction stream but on MULTIPLE (Different) data items considered as multiple data streams. • It means all processing units execute the same instruction at any given clock cycle (Instruction stream) with the flexibility that Each processing unit can operate on a different data element (Data stream) • This type of processing is suitable for problems requesting a high degree of regularity. Example: graphics/image processing • Processing method is characterized as Synchronous and Deterministic • They are two variety of such processing: Processors arrays and vectors pipelines CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…) Memory Bank 1 Memory Bank 2 CPU Instruction Data (operand) di dii
  • 15. With the Single Instruction stream, Multiple Data Stream (SIMD), • Only One CPU is Present is in the Computer. This CPU has: - One instruction register but Multiples Arithmetic Logic Unit (ALU) and uses multiple data buses to get multiple operands simultaneously. It uses multiple data buses to handle multiple operands simultaneously. - The memories are divided into multiple banks that are accessed independently. They are also multiple data buses for the CPU to access data simultaneously. Operational behavior  only 1 instruction is fetched by the CPU at the time  Some instructions, known as VECTOR INTRUCTIONS ( witch will fetch multiple operands simultaneously) , operate on multiple data Items at once. EXAMPLES OF SIMD COMPUTERS • Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2, ILLIAC IV • Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 • Most modern computer integrating graphics processor units (GPUs) employ SIMD instructions and execution units. CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
  • 16. MIMD: Multiple Instructions streams, Multiple Data Streams • Here, multiple instructions are executed at ONCE. So they are multiple Data streams. • Each instruction operates on its own data independently (Multiple operations on the same data is rare !!) • Two main type of MIMD computers: shared Memory and Message Passing Shared Memory MIMD  In shared Memory MIMD, Memory locations are all accessible by all the processors (CPUs): this type of computer is a Multi processor Computer  Most workstations and high end PC are Multi processor based today. CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…) CPU 1 CPU 2 CPU N+1 CPU n Memory Bank Instruction i Data (operand i) Buses Buses Instruction ii Data (operand k)
  • 17. Shared Memory MIMD computers are characterize by:  Multiple CPUs in a computer sharing the Same memory  Even though CPUs are coupled Tightly, Each CPU fetches ONE INSTRUCTION at a time t, and, different CPUs can fetch different instructions by so generating multiple instructions streams  The structure of the memory must be Multiple access points designed (that is, organized into multiple independent memory banks) so that multiple instructions/operands can be transferred simultaneously. This structuration help in avoiding Conflict of Access by CPUs on the same memory bank. Finally, an instruction operates one data item a time ! CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
  • 18.  Message Passing MIMD ( CLUSTER computing) • In This architecture, Multiple SISD computers are interconnected by a COMMUNICATION NETWORK • And, each CPU has its own private memory. It means there is not sharing of memory among the various CPUs • It is possible for programs running on different CPUs to exchange information if required. In that case, exchange is done through MESSAGES. • More cheaper to Manufacture Message Passing Computers than Shared-memory MIMD • There is a need of a dedicated HIGH SPEED NETWORK SWITCH to perform the interconnection role of the SISD computers • The MIMD Message Passing computers always provide Message Passing API (Aplication Programming Interface) so that programmers can be able to include in their programs, statements that permits exchange of messages. Example: the Message Passing Interface (MPI) CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
  • 19.  Message Passing MIMD ( CLUSTERcomputing) Operationnal architeture MEMORYCPU Bus Instruction Data (operand) MEMORYCPU Bus Instruction Data (operand) MEMORYCPU Bus Instruction Data (operand) MEMORYCPU Bus Instruction Data (operand) Data Exchange Data Exchange Data Exchange Data Exchange SWITCH SISD 1 SISD n-1 SISD n SISD n+1 CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
  • 20. MISD: Multiple Instructions stream, Single Data Stream Assignment 1a: Research on Multiple Instruction stream, Single Data stream (MISD) Parallel Computing. You will emphasize on: 1. Architecture design and modeling 2. Properties of such a design 3. Operational details 4. Practical example and Specifications. Submission Date: 28 October 2020 Time: 12 Pm Email: malobecyrille.marcel@ictuniversity.org NOTE: Late submission = - 50% of the assignment Points. Acceptable ONCE. CLASSIFICATIONOFPARALLELCOMPUTERS(Cont…)
  • 21. END OF PART 1 What are we saying ? Looking at it VIRTUALLY, All stand-alone computers today are parallel. from a hardware perspective, computers have:  Multiple functional units (floating point, integer, GPU, etc.)  Multiple execution units / cores  Multiple hardware threads
  • 22. CHECK YOUR PROGRESS ….. • Check Your Progress 1 1) What are various criteria for classification of parallel computers? ………………………………………………………………………………………… ………………………………………………………………………………………… 2) Define instruction and data streams. ………………………………………………………………………………………… ………………………………………………………………………………………… 3) State whether True or False for the following: If Is=Instruction Stream and Ds=Data Stream, a) SISD computers can be characterized as Is > 1 and Ds > 1 b) SIMD computers can be characterized as Is > 1 and Ds = 1 c) MISD computers can be characterized as Is = 1 and Ds = 1 d) MIMD computers can be characterized as Is > 1 and Ds > 1 4) Why do we need Parallel Computing ? ………………………………………………………………………………………… …………………………………………………………………………………………
  • 23. • Why do we need HPC ? 1. Save time and/or money: the more you allocate resources on a given task, the faster you expect to see it completed and save some money. Consider that Parallel clusters can be built from cheap, commodity components. 2. Solve larger problems: They are so many complex problems that can’t be solved with single Computer especially considering their limited computer memory. 3. Provide concurrency: A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things simultaneously. 4. Use of non-local resources: HPC will provide the flexibility to use compute resources on a wide area network, or even the Internet when local compute resources are scarce. SO…. High-performance computing (HPC) is the use of parallel processing for running advanced application programs efficiently, reliably and quickly. HIGHPERFORMANCECOMPUTING(HPC) CLOSE TO REALITY
  • 24. THE MOORE’S LAW PREDICTION  Statement [1965]: `The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.'' REVISION in [1975]: `There is no room left to squeeze anything out by being clever. Going forward from here we have to depend on the two size factors - bigger dies and finer dimensions.'' IT MEANS: - Prioritize minimum size and improve in power by increasing Transistors. That is…. - More transistors = ↑ opportunities for exploiting parallelism If one is to buy into Moore's law, the question still remains • how does one translate transistors into useful OPS (operations per second)? If Moore believes that : the transistor density of semiconductor chips would double roughly every 18 months. A tangible solution is to rely on parallelism, both implicit and explicit. TWO Possible way to implement parallelism:  Implicit parallelism: invisible to the programmer – pipelined execution of instructions, using conventional language such as C, Fortran or Pascal to write the source Program – code source program is sequential and translated into parallel object code by a Parallelizing Compiler that will detect and assign target machine resources. This is apply in programming shared multiprocessors and require less effort from the programmer.  Explicit parallelism – Long instruction words (VLIW) and require more effort by the programmer to develop a source program -- Made of bundles of independent instructions that can be issued together, reducing the burden on the compiler to detect parallelism, which will detect and assign target machine resources when needed.. Example: Intel Itanium processor 2000-2017 HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
  • 25. IS MOORE’S LAW STILL APPLICABLE ? • Up to early 2000s, transistor count was a valid indication of how much additional processing power could be packed into an area. • Moore’s law and Dennard scaling, when combined, held that more transistors could be packed into the same space and that those transistors would use less power while operating at a higher clock speed. ARGUMENTS  Because Classic Dennard scaling no longer occurs at each lower node, packing more transistors into a smaller space no longer guarantees lower total power consumption. Consequently, does no longer correlates directly to higher performance. • The Major Limiting Factor: Hot spot formation Possible way forward TODAY: - Focus on Improving CPU Cooling (One of the biggest barriers to higher CPU clock speeds is hot spots) by either ameliorating the efficiency of the Thermal interface Material(TIM) or improving lateral heat transfer within the CPU itself or making used of computational sprinting to increase thermal dissipation. HOWEVER: This won’t improve compute performance over sustained periods of time BUT it would speed latency-sensitive applications like web page loads or brief, intensive computations. HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
  • 26. 42 Years of Microprocessor Trend Data . Orange: Moore’s Law trend; Purpule: Dennard scaling breakdown; Green & Red: Immediate implications of Dennard scaling breakdown; Blue: Slowdown of ST increase in performances; Black: The age of increase parallelism HIGHPERFORMANCECOMPUTING(HPC)(Cont…) SOURCE: Karl Rupp. 42 Years of Microprocessor Trend Data. https://www.karlrupp.net/ 2018/02/42-years-of-microprocessor-trend-data/, 2018. [Online]. ILLUSTRATION
  • 27. ORTHER HPC LIMITING FACTORS - Disparity between the clock rates growth of High-end Processors and Memory Access time: Clock rate (40%)/year over the past decade while DRAM (10%)/year over the same period. This is a significant performance bottleneck. This is issue is addressed by Parallel Computing by : • providing increased bandwidth to the memory system • offering higher aggregate caches. This explain why Some of the fastest growing applications of parallel computing utilize not their raw computational speed, rather their ability to pump data to memory and disk faster. HIGHPERFORMANCECOMPUTING(HPC)(Cont…) Source: How Multithreading Addresses the Memory Wall - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/The-growing-CPU-DRAM-speed-gap-expressed-as-relative-speed-over-time-log-scale-The_fig1_37616249 [accessed 22 Oct, 2020]
  • 28. PROCESSORS EVOLUTION: CASE OF INTEL PROCESSORS From 2017, 9 generations of Processors have been developed. HIGHPERFORMANCECOMPUTING(HPC)(Cont…) Intel core i10 processor Source: Retrieve from https://strackit.com/tech-news/intel-core-i10-processor/
  • 29. • differences between series of Processor Core i3, i5, i7 and i9. Generation Number of Cores Specifications Cache Size (MB) Core i3 2 physical cores - Cheapest processors - used INTEL® HYPER THREADING TECHNOLOGY creates a 2 physical cores and 2 more virtual operating system determines that the processor has 4 cores Memory 3-4MB Core i5 4 physical cores, some models have only 2 physical cores + 2 virtual Higher performance is achieved by the presence of 4 physical cores and increased volume of cache memory 4 or 8MB Core i7 4 to 8 physical cores , use INTEL® HYPER THREADING TECHNOLOGY Performance increased virtual cores and a large volume of cache memory. Processors for mobile devices can have 2 physical cores. from 8MB to 20MB Core i9 6 - 8 physical cores i9 series was conceived as a competitor to AMD game processors. More cores, more speed but not much. Since i9 is slightly better than i7. there is practically no sense in the development of this processor line. cache 10Mb-20Mb HIGHPERFORMANCECOMPUTING(HPC)(Cont…)
  • 30.  ACHIEVING GREATER SPEED WILL HELP IN UNDERTANDING VARIOUS PHENOMENON APLICABLE TO DIFFERENT DOMAIN OF LIFE. • Science —understanding matter from elementary particles to cosmology —storm forecasting and climate prediction —understanding biochemical processes of living organisms • Engineering —multi-scale simulations of metal additive manufacturing processes —understanding quantum properties of materials —understanding reaction dynamics of heterogeneous catalysts —earthquake and structural modeling —pollution modeling and remediation planning —molecular nanotechnology • Business —computational finance - high frequency trading —information retrieval —data mining “big data” • Defense —nuclear weapons stewardship • Computers: — Embedded systems increasingly rely on distributed control algorithms. — Network intrusion detection, cryptography, etc. — Optimizing performance of modern automobile. — Networks, mail-servers, search engines… — Visualization SPEED:ANEEDTOSOLVECOMPLEXITY
  • 31. • Parallelism finds applications in very diverse domains for different motivating reasons. These range from improved application performance to cost considerations. CASE 1: Earthquake Simulation in Japan HOW DO WE PREVENT SUCH TO HAPPEN AGAIN ? • We need Computers , with ability to put together Computation Power in order to be able to simulate and calculate High level operations for better prediction of natural Phenomenon SOMECASESTUDIES SOURCE: Earthquake Research Institute, University of Tokyo Tonankai-Tokai Earthquake Scenario. Video Credit: The Earth Simulator Art Gallery, CD-ROM, March 2004
  • 32. SOMECASESTUDIES(Cont…) CASE 2: El Niño El Niño is an anomalous, yet periodic, warming of the central and eastern equatorial Pacific Ocean. For reasons still not well understood, every 2-7 years, this patch of ocean warms for six to 18 months El Niño was strong through the Northern Hemisphere winter 2015-16, with a transition to ENSO- neutral in May 2016. HOW DO WE EXPLAIN SUCH PHENOMENON ? May be bringing into collaboration various Processors of distributed computers (placed in different location ) can help provide an answer. Parallel Programming must therefore be able to develop compatible software that can contribute in collection and analyses of data.
  • 33. • Most of Parallelism concepts we shall study are from the Explicit orientation. Challenges related to Explicit Parallelism are:  Algorithm development is harder —complexity of specifying and coordinating concurrent activities  Software development is much harder —lack of standardized & effective development tools and programming models —subtle program errors: race conditions  Rapid pace of change in computer system architecture —a great parallel algorithm for one machine may not be suitable for another – example: homogeneous multicore processors vs. GPUs ChallengesofEXPLICITPARALLELISM
  • 34. Parallel science applications are often very sophisticated —e.g. adaptive algorithms may require dynamic load balancing • Multilevel parallelism is difficult to manage • Extreme scale exacerbates inefficiencies —algorithmic scalability losses —serialization and load imbalance —communication or I/O bottlenecks —insufficient or inefficient parallelization • Hard to achieve top performance even on individual nodes —contention for shared memory bandwidth —memory hierarchy utilization on multicore processors Challenges of PARALLELISM INGENERAL
  • 35. • IT IS NOT ALL ABOUT COMPUTATION. There is also a need to: Improve on Memory latency and bandwidth. Because, —CPU rates are > 200x faster than memory —bridge speed gap using memory hierarchy —more cores exacerbates demand  Improve on the Interprocessor communication Improve ON Input/output Correlation — I/O bandwidth to disk typically needs to grow linearly with the # processors ACHIEVINGHIGHPERFORMANCEONPARALLELSYSTEMS
  • 36. EXPLICITLY define tasks, work decomposition, data decomposition, communication, synchronization. EXAMPLE: MPI is a library for fully explicit parallelization. “It is either All or nothing”. IMPLICITLY define tasks only, rest implied; or define tasks and work decomposition rest implied; EXAMPLE OpenMP is a high-level parallel programming model, which is mostly an implicit model HOWDOWEEXPRESSPARALLELISMINAPROGRAM?
  • 37. All parallel programs contain: - Parallel sections And, - Serial sections (Serial sections are when work is being duplicated or no useful work is being done,(waiting for others)) We therefore need to Build efficient algorithms by avoiding: - Communication delay - Idling - Synchronization QUICKVIEWONTHESTRUCTUREOFAPARALLEL PROGRAM
  • 38. Generally, Parallel thinking is closer to us than what we believe. Daily, We try to do things simultaneously, avoiding to have delay in any. Parallel Computing thinking is not far from this…. For a given task to be done by many, WE MAY ASK OURSELVES: How many people are involve in the work ?. (Degree of Parallelism)  What is needed to begin the work? (Initialization) Who does what ? (Work distribution)  How do we regulate Access to work part. (Data/IO access)  Find OUT: Whether they need info from each other to finish their own job. (Communication)  When are they all done ? (Synchronization)  What needs to be done to collate the result. AWAYTOTHINK:PARALLELAPPROACH
  • 39. • Development of parallel programming impose the need of Performance metrics and Software tools in order to evaluate the performance of parallel algorithm. • Some factors can help in achieving this goal: - Type of Hardware used - The degree of parallelism of the problem - The type of parallel model to use The goal is : To compare what is obtained (Parallel program) from what was there (Original Sequence). Analyses focuses on the number of threads and/or the number of processes used. Note: Ahmdal’s Law will introduce the limitations related to Parallel computation. And, the Gustafson’s Law will evaluate the degree of efficiency of Parallelization of a sequential algorithm. EVALUATIONMETRICS
  • 40.  Relation between the Execution time (Tp) and Speedup, (S) S(p, n) = T(1, n) / T(p, n) - Usually, S(p, n) < p - Sometimes S(p, n) > p (super linear speedup)  Efficiency, E E(p, n) = S(p, n)/p - Usually, E(p, n) < 1, Sometimes, greater than 1  Scalability – Limitations in parallel computing, relation to n and p. EVALUATIONMETRICS(Cont…) SpeedUP Measurement (S) • Speedup is a MEASURE • It help in appreciating the benefit in solving a problem in parallel • It is given by: Ratio of the time taken to solve a Problem on a Single processing element (Ts) to the time required to solve the same problem on a p identical processing elements (Tp). • That is : S= Ts/Tp - IF S = p (Ideal condition) LINEAR SPEEDUP (Speed of execution is with the number of processors.) - IF S < p,  Real speedup - IF S > p, Super Linear Speedup.
  • 41. EFFICIENCY (E) • Another performance metric • Will estimate the ability of the processors to solve a given task in comparison of how much effort is wasted in communication and Synchronization. • Ideal Condition of a parallel system: S=P (Speedup is equal to p Processing elements--- VERY RARE !!!) • Efficiency (E) is given by: E= S/p = TS/pTp - When E=1 - It is a LINEAR Case - When E<1, It is a REAL Case - When E<<1, It is a problem that is parallelizable with low efficiency EVALUATIONMETRICS(Cont…)
  • 42. SCALABILITY • RULE: Efficiency decreases with increasing P; increases with increasing N. But here are the fundamentals questions: 1- How effectively the parallel algorithm can use an increasing number of processors ? 2- How the amount of computation performed must scale with P to keep E constant ? • SCALING is simply the ability to be efficient on a parallel machine. - It identifies the Computing Power ( How fast task are executed) proportionally to the number processors - IF, we increase the problem size (n) and the number of Processors (p) at the same time, THERE WILL BE NO LOSS IN TERM OF PERFORMANCE. - It all depends on how increments is done so that Efficiency should be maintained or improved. SCALABILITYEVALUATIONMETRICS(Cont…)
  • 43. APPRECIATING SPEEDANDEFFICIENCY Note: Serial sections limit the parallel effectiveness REMEMEBER: If you have a lot of serial computation then you will not get good speedup BECAUSE - No serial work “allows” prefect speedup - REFERS TO Amdahl’s Law to appreciate this truth
  • 44. THE AMD • How many processors can we really use? Let’s say we have a legacy code such that is it only feasible to convert half of the heavily used routines to parallel. • Amdahl’s Law is widely used to design processors and parallel algorithms • Statement: the maximum speedup that can be achieved is limited by serial components of the program: S= 1/(1-p), with (1-p) been the serial component( part not parallelized) of a program. Example: A program has 90% of the code parallelize and 10 % that must remain serialized. What is the maximum achievable speedup ? Answer: S=9, with (1-p)=10, S=90/10=9. THEAMDHAL’SLAW
  • 45. If we run this on a parallel machine with five processors: Our code now takes about 60s.  We have speed it up by about 40%. Let’s say we use a thousand processors ?? We have now speed our code by about a factor of two. Is this a big enough win ? THEAMDHAL’SLAW(Cont…)
  • 49. • Handle Most fundamental limitation on parallel speedup If fraction s of execution time is serial then speedup < 1/s Is this realistic? We know that inherently parallel code can be executed in “no time” but inherently sequential code still needs fraction s of time. Example : if s is 0.2 then speedup cannot exceed 1/0.2 = 5. - Using p processors, we can find the Speedup : - Total sequential execution time on a single processor is normalized to 1 - Serial code on p processors requires fraction s of time - Parallel code on p processors requires fraction (1 – s)/p of time THEAMDHAL’SLAW(Cont…)
  • 51. Example: 2-phase calculation * Sweep over n-by-n grid and do some independent computation *Sweep again and add each value to global sum - Time for first phase on p parallel processors = n²/p Second phase serialized at global variable, so time = n² Speedup <= or at most 2 for large p Improvement: divide second phase into two - Accumulate p private sums during first sweep - Add per-process private sums into global sum - Parallel time is: n²/p + n²/p + p, and speedup <= APRACTICALAPPLICATIONOFAMDHAL’SLAW
  • 52. Amdahl's law is based on fixed workload or fixed problem size.  It implies that the sequential part of a program does not change with respect to machine size (i.e., the number of processors).  the parallel part is evenly distributed over P processors.  Gustafson's law was to select or reformulate problems in order to minimize the sequential part of a program so that solving a larger problem in the same amount of time would be possible. This Law therefore consider that: - While increasing the dimension of a problem, its sequential parts remain constant - While increasing the number of processors, the work require on each them still remains the same. Mathematically: S(P)=P-α(P-1), with P: Number of Processors, S is the Speedup and α is the non parallelize fraction of any parallel process. NOTE: This expression contrast the Amdahl's Law which consider a single process execution time as a fixe quantity and compares it to a shrinking per process parallel execution time. Amdhal assume a fixe problem size because he believes that the overall workload of a program does not change according to the machine size ( number of processors). Gustafson’s Law therefore address the deficiency of Amdahl's Law which does not consider the total number of Computing resources involve in solving a task. Gustafson suggest to consider all computer resources if we intend to achieve efficient parallelism. FIXEDSIZEVSSCALESIZE
  • 53. Let n be a measure of the problem size.  The execution of the program on a parallel computer : a(n) + b(n) = 1 where a is the sequential fraction and b is the parallel fraction  On a sequential computer: - a(n) + p.b(n) , where p is the number of processors in the parallel case. And, Speedup = a(n) + p.b(n) = a(n) + p.(1-a(n)) Assume serial function a(n) diminishes with problem size n, then speedup approaches p as n approaches infinity, as desired. WHAT DO WE MEAN ? WHATABOUTGUSTAFSON’SLAW
  • 56. • Parallel and Distributed Computing aims at satisfying requirement of Next generation Computing Systems by: - Providing Platform for fast processing - Providing Platform where management of large and complex amount of data does no more constitute a major bottleneck to the understanding of complex phenomenon. The domain intends to provide a far more better user experience, so far software development field will succeed in satisfying requirement of such a design, and, the technology will finally solve issues related to the noticeable Disparity between the clock rates growth of High-end Processors and Memory Access time. CONCLUSION
  • 57. • Kindly look at the diagram below and answer to the following questions: 1- How do you classify such a design: serialization or Parallelism? Justify your answer 2- Kindly explain what M1, P1 and D1 represent 3 – what are the functions of : - Processor-Memory Interconnection Network (PMIN) - Input-Output-Processor Interconnection Network (IOPIN) - Interrupt Signal Interconnection Network (ISIN) 4- Explain in your terms, the concept of Shared Memory System / Tightly Coupled System. 5- Flynn’s classification of Computers is based on multiplicity of instruction streams and data streams observed by the CPU during program execution. Can you identify another way Computer can be Classified? Elaborate the Concept according to the author. Submission Date: 4 October 2020 Time: 12 Pm Email: malobecyrille.marcel@ictuniversity.org NOTE: Late submission = - 50% of the assignment Points. Acceptable ONCE. ASSIGNMENT1b
  • 59. Further Reading • Recommended reading:"Designing and Building Parallel Programs". Ian Foster. https://www.mcs.anl.gov/~itf/dbpp/ • "Introduction to Parallel Computing". Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar. http://www-users.cs.umn.edu/~karypis/parbook/ • "Overview of Recent Supercomputers". A.J. van der Steen, Jack Dongarra. OverviewRecentSupercomputers.2008.pdf
  • 60. REFERENCESK. Hwang, Z. Xu, “ Scalable Parallel Computing”, Boston: WCB/McGraw-Hill, c1998. 2. I. Foster, “ Designing and Building Parallel Programs”, Reading, Mass: Addison-Wesley, c1995. 3. D. J. Evans, “Parallel SOR Iterative Methods”, Parallel Computing, Vol.1, pp. 3-8, 1984. 4. L. Adams, “Reordering Computations for Parallel Execution”, Commun. Appl. Numer. Methods, Vol.2, pp 263-271, 1985. 5. K. P. Wang and J. C. Bruch, Jr., “A SOR Iterative Algorithm for the Finite Difference and Finite Element Methods that is Efficient and Parallelizable”, Advances in Engineering Software, 21(1), pp. 37-48, 1994. 6. Lecture Notes on Parallel Computation, Stefan Boeriu, Kai-Ping Wang and John C. Bruch Jr. Office of Information Technology and Department of Mechanical and Environmental Engineering, University of California, Santa Barbara, CA 7. John Mellor-Crummey , COMP 422/534 Parallel Computing: An Introduction , Department of Computer Science Rice University, johnmc@rice.edu, January 2020 8. Roshan Karunarathna, Introduction to parallel Computing,2020 9. Safwat HAMAD , DistriByted Computing, Lecture 1- Introduction - FCIS SCience Department - FCIS SC, 2020.

Hinweis der Redaktion

  1. Moore’s Law: the number of Transistors on a Microchip doubles every two years. So we can expect the speed and capability of our computers to increase every couple of years