SlideShare ist ein Scribd-Unternehmen logo
1 von 45
UBa/NAHPI-2020
DepartmentofComputer
Engineering
PARALLEL AND DISTRIBUTED
COMPUTING
By
Malobe LOTTIN Cyrille .M
Network and Telecoms Engineer
PhD Student- ICT–U USA/CAMEROON
Contact
Email:malobecyrille.marcel@ictuniversity.org
Phone:243004411/695654002
CHAPTER 2
Parallel and Distributed Computer
Architectures, Performance Metrics
And Parallel Programming Models
Previous … Chap 1: General Introduction (Parallel and Distributed Computing)
CONTENTS
• INTRODUCTION
• Why parallel Architecture ?
• Modern Classification of Parallel Computers
• Structural Classification of Parallel Computers
• Parallel Computers Memory Architectures
• Hardware Classification
• Performance of Parallel Computers architectures
- Peak and Sustained Performance
• Measuring Performance of Parallel Computers
• Other Common Benchmarks
• Parallel Programming Models
- Shared Memory Programming Model
- Thread Model
- Distributed Memory
- Data Parallel
- SPMD/MPMD
• Conclusion
Exercises ( Check your Progress, Further Reading and Evaluation)
Previously on Chap 1
 Part 1- Introducing Parallel and Distributed Computing
• Background Review of Parallel and Distributed Computing
• INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING
• Some keys terminologies
• Why parallel Computing?
• Parallel Computing: the Facts
• Basic Design Computer Architecture: the von Neumann Architecture
• Classification of Parallel Computers (SISD,SIMD,MISD,MIMD)
• Assignment 1a
 Part 2- Initiation to Parallel Programming Principles
• High Performance Computing (HPC)
• Speed: a need to solve Complexity
• Some Case Studies Showing the need of Parallel Computing
• Challenge of explicit Parallelism
• General Structure of Parallel Programs
• Introduction to the Amdahl's LAW
• The GUSTAFSON’s LAW
• SCALIBILITY
• Fixed Size Versus Scale Size
• Assignment 1b
• Conclusion
INTRODUCTION
• Parallel Computer Architecture is the method that consist of
Maximizing and organizing computer resources to achieve Maximum
performance.
- Performance at any instance of time, is achievable within the limit given
by the technology.
- The same system may be characterized both as "parallel" and
"distributed"; the processors in a typical distributed system run
concurrently in parallel.
• The use of more processors to compute tasks simultaneously
contribute in providing more features to computers systems.
• In the Parallel architecture, Processors during computation may have
access to a shared memory to exchange information between them.
•
imagesSource:Wikipedia,DistributingComputing,2020
• In a Distributed architecture, each processor during computation,
make use of its own private memory (distributed memory). In this
case, Information is exchanged by passing messages between the
processors.
• Significant characteristics of distributed systems are: concurrency of
components, lack of a global clock (Clock synchronization) , and
independent failure of components.
• The use of distributed systems to solve computational problems is
Called Distributed Computing (Divide problem into many tasks, each task is handle by one or
more computers, which communicate with each other via message passing).
• High-performance parallel computation operating shared-memory
multiprocessor uses parallel algorithms while the coordination of
a large-scale distributed system uses distributed algorithms.
INTRODUCTION
imagesSource:Wikipedia,DistributingComputing,2020
• Parallelism is nowadays in all levels of computer architectures.
• It is the Enhancements of Processors that justify the success in the
development of Parallelism.
• Today, they are superscalar (Execute several instructions in parallel each clock cycle).
- besides, The advancement of the underlying Very Large-Scale Integration (VLSI )technology,
which allows larger and larger numbers of components to fit on a chip and clock rates to increase.
• Three main elements define structure and performance of Multiprocessor:
- Processors
- Memory Hierarchies (registers, cache, main memory, magnetic discs, magnetic tapes)
- Interconnection Network
• But, the gap of performance between the processor and the memory is still
increasing ….
• Parallelism is used by computer architecture to translate the raw potential of
the technology into greater performance and expanded capability of the
computer system
• Diversity in parallel computer architecture makes the field challenging to learn
and challenging to present.
INTRODUCTION ( Cont…)
Remember that:
A parallel computer is a collection of processing elements that
cooperate and communicate to solve large problems fast.
• The attempt to solve this large problems raises some fundamental
questions which the answer can only by satisfy by understanding:
- Various components of Parallel and Distributed systems( Design
and operation),
- How much problems a given Parallel and Distributed system can
solve,
- How processors corporate, communicate / transmit data between
them,
- The primitive abstractions that the hardware and software provide
to the programmer for better control,
- And, How to ensure a proper translation to performance once these
elements are under control.
INTRODUCTION (Cont…)
Why Parallel Architecture ?
• No matter the performance of a single processor at a given time, we can
achieve in principle higher performance by utilizing many such processors
so far we are ready to pay the price (Cost).
Parallel Architecture is needed To:
 Respond to Applications Trends
• Advances in hardware capability enable new application functionality 
drives parallel architecture harder, since parallel architecture focuses on the
most demanding of these applications.
• At the Low end level, we have the largest volume of machines and greatest
number of users; at the High end, most demanding applications.
• Consequence: pressure for increased performance  most demanding
applications must be written as parallel programs to respond to this
demand generated from the High end
 Satisfy the need of High Computing in the field of computational science
and engineering
- A response to simulate physical phenomena impossible or very
costly to observe through empirical means (modeling global climate change
over long periods, the evolution of galaxies, the atomic structure of materials,
etc…)
 Respond to Technology Trends
• Can’t “wait for the single processor to get fast enough ”
Respond to Architectural Trends
• Advances in technology determine what is possible; architecture
translates the potential of the technology into performance and
capability .
• Four generation of Computer architectures (tubes, transistors,
integrated circuits, and VLSI ) where strong distinction is function of
the type of parallelism implemented ( Bit level parallelism  4-bits
to 64 bits, 128 bits is the future).
• There has been tremendous architectural advances over this period
: Bit level parallelism, Instruction level Parallelism, Thread Level
Parallelism
All these forces driving the development of parallel architectures are
resumed under one main quest: Achieve absolute maximum
performance ( Supercomputing)
Why Parallel Architecture ? (Cont …)
Modernclassification
Accordingto(Sima,Fountain,Kacsuk)
Before modern classification,
Recall Flynn’s taxonomy classification of Computers
- based on the number of instructions that can be executed and how they operate on data.
Four Main Type:
• SISD: traditional sequential architecture
• SIMD: processor arrays, vector processor
• Parallel computing on a budget – reduced control unit cost
• Many early supercomputers
• MIMD: most general purpose parallel computer today
• Clusters, MPP, data centers
• MISD: not a general purpose architecture
Note: Globally four type of parallelism are implemented:
- Bit Level Parallelism: performance of processors based on word size ( bits)
- Instruction Level Parallelism: give ability to processors to execute more than instruction
per clock cycle
- Task Parallelism: characterize Parallel programs
- Superword Level Parallelism: Based on vectorization Techniques
Computer Architectures
SISD SIMD MIMD MISD
• Classification here is based on how parallelism is achieved
• by operating on multiple data: Data parallelism
• by performing many functions in parallel: Task parallelism (function)
• Control parallelism, task parallelism depending on the level of the functional
parallelism.
ModernClassification
Accordingto(Sima,Fountain,Kacsuk)
Parallel architectures
Data-parallel
architectures
Function-parallel
architectures
- Different operations are
performed on the same or
different data
- Asynchronous computation
- Speedup is less as each
processor will execute a different
thread or process on the same or
different set of data.
- Amount of parallelization is
proportional to the number of
independent tasks to be
performed
- Load balancing depends on the
availability of the hardware and
scheduling algorithms like static
and dynamic scheduling.
- Applicability : pipelining
- Same operations are
performed on different
subsets of same data
- Synchronous computation
- Speedup is more as there is
only one execution thread
operating on all sets of data.
- Amount of parallelization is
proportional to the input data
size
- Designed for optimum load
balance on multi processor
system
Applicability: Arrays, Matrix
• Flynn’s classification Focus on the behavioral aspect of computers .
• Looking at the structure, Parallel computers can be classified based on a focus on
how processors communicate with the memory.
 When multiprocessors communicate through the global shared memory modules
then this organization is called Shared memory computer or Tightly
 when every processor in a multiprocessor system, has its own local memory and
the processors communicate via messages transmitted between their local memories,
then this organization is called Distributed memory computer or Loosely coupled system
StructuralClassificationof ParallelComputers
Parallel Computer Memory Architectures
Shared Memory Parallel Computers architecture
- Processors can access all memory as global
address space
- Multi-processors can operate independently but
share the same memory resources
- Changes in a memory location effected by one
processor are visible to all other processors
Based on memory access time, we can
classify Shared memory Parallel Computers into
two:
 Uniform Memory Access (UMA)
 Non-Uniform Memory Access (NUMA)
ParallelComputerMemoryArchitectures(Cont…)
 Uniform Memory Access (UMA) (known as Cache Coherent -
UMA)
• Commonly represented today by Symmetric
Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
Note: Cache coherent is a hardware operation where any update of a
location in shared memory by one processor , is announce to all the
other processors .
Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
Non-Uniform Memory Access (NUMA)
• The architecture often link two or more SMPs
In such that :
- One SMP can directly access memory of another SMP
- Not all processors have equal access time to all memories
- Memory access across link is slower
Note: if Cache coherent is implemented, then we can also call it
Cache Coherent NUMA
• The proximity of memory to CPUs on Shared Memory parallel computer
makes Data sharing between tasks fast and uniform.
• But, there is a lack of scalability between memory and CPUs.
ParallelComputerMemoryArchitectures(Cont…)
Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
BruceJacob,...DavidT.Wang,inMemorySystems,2008
ParallelComputerMemoryArchitectures(Cont…)
 Distributed Memory Parallel Computer Architecture
• Different varieties as Shared Memory Computer.
• Require a communication network to connect inter-processor memory.
- Each processor operates independently with its own local memory
- individual processors changes does not affect the memory of other
processors.
- Cache Coherency does not apply here !
• Access to data in another processor is usually the task of the
programmer(explicitly define how and when data is communicated)
• This architecture is cost effective (can use commodity, off-the-shelf
processors and networking).
• But, the responsibility of the programmer is more engage for data
communication between processors
Source:Retrievedfrom
https://www.futurelearn.com/courses/supercomputing/0/steps/24022
ParallelComputerMemoryArchitectures(Cont…)
Source:NikolaosPloskas,NikolaosSamaras,inGPUProgramminginMATLAB,2016
ParallelComputerMemoryArchitectures(Cont…)
Overview of Parallel Memory Architecture
Note: - The largest and fastest computers in the world today employ both
shared and distributed memory architectures (Hybrid Memory)
- In hybrid design, Shared memory component here can be a shared
memory machine and/or graphics processing units (GPU)
- And, Distributed memory component is the networking of multiple
shared memory/GPU machines
- This type of memory architecture will continue to prevail and increase
• Parallel computers can be roughly classified according to the level
at which the hardware in the parallel architecture supports
parallelism.
 Multicore Computing
Symmetric multiprocessing ( tightly coupled multiprocessing)
Hardwareclassification
- Made of computer system with multiple
identical processors that share memory
and connect via a bus
- Do not comprise more than 32 processors
to minimize bus contention
- Symmetric multiprocessors are extremely
cost-effective
retrievedfromhttps://en.wikipedia.org/wiki/Parallel_computing#Bit-
level_parallelism,2020
- Processor includes multiple processing units (called "cores") on the
same chip.
- issue multiple instructions per clock cycle from multiple instruction
streams
- Differs from a superscalar processor. But, Each core in a multi-core
processor can potentially be superscalar as well.
Superscalar: issue multiple instructions per clock cycle from one instruction stream
(thread).
- Example: IBM's Cell microprocessor in Sony PlayStation 3
 Distributed Computing (distributed memory multiprocessor)
Cluster Computing
Hardwareclassification(Cont…)
• Not to be confused with Decentralized computing
- Allocation of resources (Hardware + software) to individual
workstations
• components are located on different networked computers,
which communicate and coordinate their actions by passing
messages to one another
• Interaction of components is done to achieve a common goal
• Characterize by concurrency of components, lack of a global
clock, and independent failure of components.
• can include heterogeneous computations where some nodes
may perform a lot more computation, some perform very
little computation and a few others may perform specialized
functionality
• Example: Multiplayer Online game
• loosely coupled computers that work together closely
• in some respects they can be regarded as a single computer
• multiple standalone machines constitute a cluster and
connected by a network.
• computer clusters have each node set to perform the same
task, controlled and scheduled by software.
• Computer clustering relies on a centralized management
approach which makes the nodes available as orchestrated
shared servers.
• Example: IBM's Sequoia
Sources:DinkarSitaram,GeethaManjunath,inMovingToTheCloud,2012
CiscoSystems,2003
PERFORMANCE METRICS
Performance of parallel architectures
 Various ways to measure the performance of a parallel algorithm running
on a parallel processor.
 Most commonly used measurements:
- speed-up
- Efficiency/ Isoefficiency
- Elapsed time (Very important factor Elapsed time for a program divided by the cost of the machine that ran the job.
- price/performance
Note: none of these metrics should be used independent of the run time of the parallel system
 Common metrics of Performance
• FLOPS and MIPS are units of measure for the numerical computing performance of a
computer
• Distributed computing uses the Internet to link personal computers to achieve more
FLOPS
- MIPS: million instructions per second
MIPS = instruction count/(execution time x 106)
- MFLOPS: million floating point operations per second.
FLOPS = FP ops in program/(execution time x 106)
• Which of the metric is better?
• FLOP is more related to the time of a task in numerical code.
# of FLOP / program is determined by the matrix size
See Chapter 1
“In June 2020, Fugaku turned in a High Performance Linpack (HPL) result
of 415.5 petaFLOPS, besting the now second-place Summit system by a
factor of 2.8x. Fugaku is powered by Fujitsu’s 48-core A64FX SoC,
becoming the first number one system on the list to be powered by ARM
processors. In single or further reduced precision, used in machine learning
and AI applications, Fugaku’s peak performance is over 1,000 petaflops (1
exaflops). The new system is installed at RIKEN Center for Computational
Science (R-CCS) in Kobe, Japan ” (wikipedia Flops, 2020).
Performance of parallel architectures
Here we are !
Single CPU Performance
The future
Peak and sustained performance
Peak performance
• Measured in MFLOPS
• Highest possible MFLOPS when the system does nothing but
numerical computation
• Rough hardware measure
• Little indication on how the system will perform in practice.
Peak Theoretical Performance
• Node performance in GFlops = (CPU speed in GHz) x (number of
CPU cores) x (CPU instruction per cycle) x (number of CPUs per
node)
Peak and sustained performance
• Sustained performance
• The MFLOPS rate that a program achieves over the entire run.
• Measuring sustained performance
• Using benchmarks
• Peak MFLOPS is usually much larger than sustained MFLOPS
• Efficiency rate = sustained MFLOPS / peak MFLOPS
Measuring the performance of
parallel computers
• Benchmarks: programs that are used to measure the
performance.
• LINPACK benchmark: a measure of a system’s floating point
computing power
• Solving a dense N by N system of linear equations Ax=b
• Use to rank supercomputers in the top500 list.
No. 1 since June 2020
Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, becoming the first
number one system on the list to be powered by ARM processors.
Other common benchmarks
• Micro benchmarks suit
• Numerical computing
• LAPACK
• ScaLAPACK
• Memory bandwidth
• STREAM
• Kernel benchmarks
• NPB (NAS parallel benchmark)
• PARKBENCH
• SPEC
• Splash
PARALLEL PROGRAMMING MODELS
A programming perspective of Parallelism implementation in parallel
and distributed Computer architectures
Parallel Programming Models
Parallel programming models exist as an abstraction above hardware
and memory architectures.
 There are commonly several parallel programming models used
• Shared Memory (without threads)
• Threads
• Distributed Memory / Message Passing
• Data Parallel
• Hybrid
• Single Program Multiple Data (SPMD)
• Multiple Program Multiple Data (MPMD)
 These models are NOT specific to a particular type of machine or
memory architecture (a given model can be implemented on any
underlying hardware).
Example: - SHARED memory model on a DISTRIBUTED memory
machine ( Machine memory is physically distributed across networked
machines, but at the user level as a single shared memory global address
space --- Kendall Square Research (KSR) ALLCACHE---
Which Model to USE ??
There is no "best" model
However, there are certainly better implementations of some models over others
Parallel Programming Models
SharedMemoryProgramming Model
(WithoutThread)
• A thread is the basic unit to which the operating system allocates
processor time. They are smallest sequence of programmed
instructions
• In a Share Memory programming model,
- Processes/tasks share a common address space, which they
read and write to asynchronously.
- Make use of mechanisms such as locks / semaphores to control
access to the shared memory, resolve contentions and to prevent race
conditions and deadlocks.
• This may be consider as the simplest parallel programming model
• Note: Locks, Mutexe and semaphore are type of
synchronization objects in a share resources
environment. Abstract concepts.
-Locks protects access to some kind of shared resource, and give
right to access the protected share resource when owned.
Example, if you have a lockable object ABC you may:
- acquire the lock on ABC,
- take the lock on ABC,
- lock ABC,
- take ownership of ABC, or relinquish ownership of ABC if not needed
- Mutexe (Mutual EXclusion): lockable object that can be owned by
exactly one thread at a time
• Example: in C++, std::mutex, std::timed_mutex, std::recursive_mutex
-- Semaphore: A semaphore is a very relaxed type of lockable object,
with a predefined maximum count, and a current count.
Shared MemoryProgramming Model(Cont..)
Advantages Disadvantages
• No need to specify explicitly the
communication of data between tasks,
so no need to implement “ownership”.
Very advantageous for a Programmer
It becomes more difficult to understand
and manage data locality.
• All processes see and have equal access
to shared memory
There is Conservation of memory access,
cache refresh and bus traffic when keeping
data local to a given process
• Open for simplification during the
development of the program
controlling data locality is hard to
understand and may be beyond the control
of the average user.
Shared MemoryProgramming Model(Cont..)
During Implementation,
• Case: stand-alone shared memory machines
- native operating systems, compilers and/or hardware provide support for
shared memory programming. E.g. POSIX standard provides an API for using shared memory.
• Case: distributed memory machines:
- memory is physically distributed across a network of machines, but made
global through specialized hardware and software
• This is a type of shared memory programming.
• Here, a single "heavy weight" process can have multiple "light weight",
concurrent execution paths.
• To understand this model, let us consider the execution of a main
program a.out , scheduled to run by the native operating system.
Thread Model
 a.out start by loading and acquiring all of the necessary system and user resources
to run. This constitute the "heavy weight" process
 a.out performs some serial work, and then creates a number of tasks (threads) that
can be scheduled and run by the operating system concurrently
 Each thread has local data, but also, shares the entire resources of a.out “Light
weight” and benefit from a global memory view because it shares the memory
space of a.out
 Need for synchronization coordination to ensure that more than one thread is not
updating the same global address at any time.
• During Implementation, threads implementations commonly comprise:
 A library of subroutines that are called from within parallel source code
 A set of compiler directives imbedded in either serial or parallel source
code.
Note: Often , the programmer is responsible for determining the parallelism.
• Unrelated standardization efforts have resulted in two very different
implementations of threads:
- POSIX Threads
* Specified by the IEEE POSIX 1003.1c standard (1995). C Language only, Part of Unix/Linux operating systems and
Very explicit parallelism--requires significant programmer attention to detail.
- OpenMP ( Used for Tutorial in the context of this course).
* Industry standard, Compiler directive based Portable / multi-platform, including Unix and Windows
platforms, available in C/C++ and Fortran implementations, Can be very easy and simple to use - provides for
"incremental parallelism". Can begin with serial code.
Others include: - Microsoft threads
- Java, Python threads
- CUDA threads for GPUs
Thread Model (Cont…)
• In this Model,
A set of tasks uses their own local memory during computation
Multiple tasks can reside on the same physical machine and/or across an arbitrary
number of machines.
Exchange of data by tasks is done through communication( sending/ receiving
messages).
But, there must be a certain Process Cooperation during data transfer.
During Implementation,
• The programmer is responsible for determining all parallelism
• Message passing implementations usually comprise a library of subroutines that
are imbedded in source code.
• MPI is the "de facto" industry standard for message passing.
- Message Passing Interface (MPI), specification available at http://www.mpi-
forum.org/docs/.
DistributedMemory/MessagePassingModel
Can also be referred to as the Partitioned Global Address Space (PGAS) model.
Here,
 Address space is treated globally
 Most of the parallel work focuses on performing operations on a data set
typically organized into a common structure, such as an array or cube
 A set of tasks work collectively on the same data structure, however, each task
works on a different partition of the same data structure.
 Tasks perform the same operation on their partition of work, for example, "add 4
to every array element“
 Can be implemented on share memory (data structure is accessed through
global memory) and distributed memory architectures (Global Data structure
can be logically/Physical split across tasks).
Data Parallel Model
For the Implementation,
• Various popular, and sometimes developmental parallel
programming based on the Data Parallel / PGAS model.
• - Coarray Fortran, compiler dependent
* further reading (https://en.wikipedia.org/wiki/Coarray_Fortran)
• - Unified Parallel C (UPC), extension to the C programming
language for SPMD parallel programming.
* further reading http://upc.lbl.gov/
- Global Arrays , shared memory style programming environment in the context of
distributed array data structures.
* Further reading on https://en.wikipedia.org/wiki/Global_Arrays
Data Parallel Model ( Cont…)
Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD)
"high level" programming model (Can be build based on any parallel programming
model)
Why SINGLE PROGRAM ?
All tasks execute their copy of the same
program (threads, message passing, data
parallel or hybrid) simultaneously
Why MULTIPLE PROGRAM ?
Tasks may execute different programs
(threads, message passing, data parallel or
hybrid) simultaneously
Why MULTIPLE DATA ?
All tasks may use different data
Why MULTIPLE Data ?
All tasks may use different data
Intelligent Enough: tasks do not necessarily
have to execute the entire program.
Not intelligent enough has SPMD.
But, may be better suited for certain types
of problems (functional decomposition
problems)
Single ProgramMultipleData (SPMD)/
MultipleProgram MultipleData (MPMD)
Conclusion
• Parallel computer architectures contribute in achieving maximum performance within the limit
given by the technology.
• Diversity in parallel computer architecture makes the field challenging to learn and challenging to
present
• Classification can be based on the number of instructions that can be executed and how they
operate on data- Flynn (SISD,SIMD,MISD,MIMD)
• Also, classification can be based on how parallelism is achieved (Data parallel architectures,
Function-parallel architectures)
• Classification can as well be focus on how processors communicate with the memory (Shared
memory computer or Tightly , Distributed memory computer or Loosely coupled system)
• There must be a way to appreciate the performance of the parallelize architecture
• FLOPS and MIPS are units of measure for the numerical computing performance of a computer.
• Parallelism is made possible with implementation of adequate parallel programming models.
• The most simple model appears to be the Shared Memory Programming Model.
• The SPMD and MPMD programming required mastering of the previous programming model for
Proper implementation.
• How do we then design a Parallel Program for effective parallelism?
See Next Chapter: Designing Parallel Programs and understanding notion of
Concurrency and Decomposition.
Challenge your understanding
1- What difference do you make between Parallel computer and Parallel Computing ?
2- What do you understand by True data dependency and Resource dependency?
3- Illustrate the notion of Vertical Waste and Horizontal Waste.
4- According to you, which of the design architecture can provide better performance ?. Use
performance metrics to justify your arguments.
6- what is Concurrent-read, concurrent-write (CRCW) PRAM
5-
On this Figure, we have an illustration of a Bus-based interconnects (a) with no local caches and (b)
Bus-based interconnects with local memory/caches.
Explain the difference focusing on :
- The design architecture
- The operation
- The Pros and Cons
6- Discuss on the HANDLER’S CLASSIFICATION Computers architectures compares to Flynn and others classifications .
Class Work Group and Presentation
• Purpose: Demonstrate Condition to detect eventual
Parallelism.
“Parallel computing requires that the segments to be executed
in parallel must be independent of each other. So, before
executing parallelism, all the conditions of parallelism between
the segments must be analyzed”.
Use Bernstein Conditions for Detection of Parallelism to demonstrate when
instructions i1, i2,….,in can be said “ Parallelized”.
REFERENCES
1. Xin Yuan, CIS4930/CDA5125: Parrallel and Distributed Systems,
Retrieve from http://www.cs.fsu.edu/~xyuan/cda5125/index.html
2. EECC722 – Shaaban, #1 lec # 3 Fall 2000 9-18-2000
3. Blaise Barney, Lawrence Livermore National Laboratory,
https://computing.llnl.gov/tutorials/parallel_comp/#ModelsOverv
iew, Last Modified: 11/02/2020 16:39:01
4. J BlazeWich et al, Handbook on Parallel and distributed
Processing, International Handbook of Information Systems,
spinger, 2000
5. Phillip J. windley, Parallel Architectures, lesson 6, CS462, Large
scale Distributed Systems, 2020
6. A. Grana, et al. Introduction to Parallel Computing, lecture 3
END.

Weitere ähnliche Inhalte

Was ist angesagt?

Distributed computing
Distributed computingDistributed computing
Distributed computingshivli0769
 
daa-unit-3-greedy method
daa-unit-3-greedy methoddaa-unit-3-greedy method
daa-unit-3-greedy methodhodcsencet
 
Threads (operating System)
Threads (operating System)Threads (operating System)
Threads (operating System)Prakhar Maurya
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureBalaji Vignesh
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsDanish Javed
 
Architectural Development Tracks
Architectural Development TracksArchitectural Development Tracks
Architectural Development TracksANJALIG10
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsArush Nagpal
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputersPankaj Kumar Jain
 
Chomsky classification of Language
Chomsky classification of LanguageChomsky classification of Language
Chomsky classification of LanguageDipankar Boruah
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessorKishan Panara
 
Message passing ( in computer science)
Message   passing  ( in   computer  science)Message   passing  ( in   computer  science)
Message passing ( in computer science)Computer_ at_home
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
Cloud Application architecture styles
Cloud Application architecture styles Cloud Application architecture styles
Cloud Application architecture styles Nilay Shrivastava
 
Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)Mukesh Chinta
 

Was ist angesagt? (20)

operating system structure
operating system structureoperating system structure
operating system structure
 
Course outline of parallel and distributed computing
Course outline of parallel and distributed computingCourse outline of parallel and distributed computing
Course outline of parallel and distributed computing
 
Distributed computing
Distributed computingDistributed computing
Distributed computing
 
daa-unit-3-greedy method
daa-unit-3-greedy methoddaa-unit-3-greedy method
daa-unit-3-greedy method
 
Threads (operating System)
Threads (operating System)Threads (operating System)
Threads (operating System)
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Aca2 01 new
Aca2 01 newAca2 01 new
Aca2 01 new
 
Architectural Development Tracks
Architectural Development TracksArchitectural Development Tracks
Architectural Development Tracks
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputers
 
Chomsky classification of Language
Chomsky classification of LanguageChomsky classification of Language
Chomsky classification of Language
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
Lecture 03 - Synchronous and Asynchronous Communication - Concurrency - Fault...
Lecture 03 - Synchronous and Asynchronous Communication - Concurrency - Fault...Lecture 03 - Synchronous and Asynchronous Communication - Concurrency - Fault...
Lecture 03 - Synchronous and Asynchronous Communication - Concurrency - Fault...
 
Message passing ( in computer science)
Message   passing  ( in   computer  science)Message   passing  ( in   computer  science)
Message passing ( in computer science)
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Cloud Application architecture styles
Cloud Application architecture styles Cloud Application architecture styles
Cloud Application architecture styles
 
Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)
 
TOC 3 | Different Operations on DFA
TOC 3 | Different Operations on DFATOC 3 | Different Operations on DFA
TOC 3 | Different Operations on DFA
 

Ähnlich wie Chap 2 classification of parralel architecture and introduction to parllel program. models

Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxkrnaween
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1AbdullahMunir32
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptRubenGabrielHernande
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Sudarshan Mondal
 
Computing notes
Computing notesComputing notes
Computing notesthenraju24
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptxAbcvDef
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelNikhil Sharma
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
introduction to cloud computing for college.pdf
introduction to cloud computing for college.pdfintroduction to cloud computing for college.pdf
introduction to cloud computing for college.pdfsnehan789
 
Week 1 lecture material cc
Week 1 lecture material ccWeek 1 lecture material cc
Week 1 lecture material ccAnkit Gupta
 
_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdfTyStrk
 
Week 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfWeek 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfJohn422973
 

Ähnlich wie Chap 2 classification of parralel architecture and introduction to parllel program. models (20)

Chap 1(one) general introduction
Chap 1(one)  general introductionChap 1(one)  general introduction
Chap 1(one) general introduction
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
CC unit 1.pptx
CC unit 1.pptxCC unit 1.pptx
CC unit 1.pptx
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
CCUnit1.pdf
CCUnit1.pdfCCUnit1.pdf
CCUnit1.pdf
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.ppt
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
 
Computing notes
Computing notesComputing notes
Computing notes
 
Par com
Par comPar com
Par com
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptx
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
introduction to cloud computing for college.pdf
introduction to cloud computing for college.pdfintroduction to cloud computing for college.pdf
introduction to cloud computing for college.pdf
 
Week 1 lecture material cc
Week 1 lecture material ccWeek 1 lecture material cc
Week 1 lecture material cc
 
_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf
 
Week 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfWeek 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdf
 
Pdc lecture1
Pdc lecture1Pdc lecture1
Pdc lecture1
 

Kürzlich hochgeladen

Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...jabtakhaidam7
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...vershagrag
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptxrouholahahmadi9876
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 

Kürzlich hochgeladen (20)

Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 

Chap 2 classification of parralel architecture and introduction to parllel program. models

  • 1. UBa/NAHPI-2020 DepartmentofComputer Engineering PARALLEL AND DISTRIBUTED COMPUTING By Malobe LOTTIN Cyrille .M Network and Telecoms Engineer PhD Student- ICT–U USA/CAMEROON Contact Email:malobecyrille.marcel@ictuniversity.org Phone:243004411/695654002
  • 2. CHAPTER 2 Parallel and Distributed Computer Architectures, Performance Metrics And Parallel Programming Models Previous … Chap 1: General Introduction (Parallel and Distributed Computing)
  • 3. CONTENTS • INTRODUCTION • Why parallel Architecture ? • Modern Classification of Parallel Computers • Structural Classification of Parallel Computers • Parallel Computers Memory Architectures • Hardware Classification • Performance of Parallel Computers architectures - Peak and Sustained Performance • Measuring Performance of Parallel Computers • Other Common Benchmarks • Parallel Programming Models - Shared Memory Programming Model - Thread Model - Distributed Memory - Data Parallel - SPMD/MPMD • Conclusion Exercises ( Check your Progress, Further Reading and Evaluation)
  • 4. Previously on Chap 1  Part 1- Introducing Parallel and Distributed Computing • Background Review of Parallel and Distributed Computing • INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING • Some keys terminologies • Why parallel Computing? • Parallel Computing: the Facts • Basic Design Computer Architecture: the von Neumann Architecture • Classification of Parallel Computers (SISD,SIMD,MISD,MIMD) • Assignment 1a  Part 2- Initiation to Parallel Programming Principles • High Performance Computing (HPC) • Speed: a need to solve Complexity • Some Case Studies Showing the need of Parallel Computing • Challenge of explicit Parallelism • General Structure of Parallel Programs • Introduction to the Amdahl's LAW • The GUSTAFSON’s LAW • SCALIBILITY • Fixed Size Versus Scale Size • Assignment 1b • Conclusion
  • 5. INTRODUCTION • Parallel Computer Architecture is the method that consist of Maximizing and organizing computer resources to achieve Maximum performance. - Performance at any instance of time, is achievable within the limit given by the technology. - The same system may be characterized both as "parallel" and "distributed"; the processors in a typical distributed system run concurrently in parallel. • The use of more processors to compute tasks simultaneously contribute in providing more features to computers systems. • In the Parallel architecture, Processors during computation may have access to a shared memory to exchange information between them. • imagesSource:Wikipedia,DistributingComputing,2020
  • 6. • In a Distributed architecture, each processor during computation, make use of its own private memory (distributed memory). In this case, Information is exchanged by passing messages between the processors. • Significant characteristics of distributed systems are: concurrency of components, lack of a global clock (Clock synchronization) , and independent failure of components. • The use of distributed systems to solve computational problems is Called Distributed Computing (Divide problem into many tasks, each task is handle by one or more computers, which communicate with each other via message passing). • High-performance parallel computation operating shared-memory multiprocessor uses parallel algorithms while the coordination of a large-scale distributed system uses distributed algorithms. INTRODUCTION imagesSource:Wikipedia,DistributingComputing,2020
  • 7. • Parallelism is nowadays in all levels of computer architectures. • It is the Enhancements of Processors that justify the success in the development of Parallelism. • Today, they are superscalar (Execute several instructions in parallel each clock cycle). - besides, The advancement of the underlying Very Large-Scale Integration (VLSI )technology, which allows larger and larger numbers of components to fit on a chip and clock rates to increase. • Three main elements define structure and performance of Multiprocessor: - Processors - Memory Hierarchies (registers, cache, main memory, magnetic discs, magnetic tapes) - Interconnection Network • But, the gap of performance between the processor and the memory is still increasing …. • Parallelism is used by computer architecture to translate the raw potential of the technology into greater performance and expanded capability of the computer system • Diversity in parallel computer architecture makes the field challenging to learn and challenging to present. INTRODUCTION ( Cont…)
  • 8. Remember that: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. • The attempt to solve this large problems raises some fundamental questions which the answer can only by satisfy by understanding: - Various components of Parallel and Distributed systems( Design and operation), - How much problems a given Parallel and Distributed system can solve, - How processors corporate, communicate / transmit data between them, - The primitive abstractions that the hardware and software provide to the programmer for better control, - And, How to ensure a proper translation to performance once these elements are under control. INTRODUCTION (Cont…)
  • 9. Why Parallel Architecture ? • No matter the performance of a single processor at a given time, we can achieve in principle higher performance by utilizing many such processors so far we are ready to pay the price (Cost). Parallel Architecture is needed To:  Respond to Applications Trends • Advances in hardware capability enable new application functionality  drives parallel architecture harder, since parallel architecture focuses on the most demanding of these applications. • At the Low end level, we have the largest volume of machines and greatest number of users; at the High end, most demanding applications. • Consequence: pressure for increased performance  most demanding applications must be written as parallel programs to respond to this demand generated from the High end  Satisfy the need of High Computing in the field of computational science and engineering - A response to simulate physical phenomena impossible or very costly to observe through empirical means (modeling global climate change over long periods, the evolution of galaxies, the atomic structure of materials, etc…)
  • 10.  Respond to Technology Trends • Can’t “wait for the single processor to get fast enough ” Respond to Architectural Trends • Advances in technology determine what is possible; architecture translates the potential of the technology into performance and capability . • Four generation of Computer architectures (tubes, transistors, integrated circuits, and VLSI ) where strong distinction is function of the type of parallelism implemented ( Bit level parallelism  4-bits to 64 bits, 128 bits is the future). • There has been tremendous architectural advances over this period : Bit level parallelism, Instruction level Parallelism, Thread Level Parallelism All these forces driving the development of parallel architectures are resumed under one main quest: Achieve absolute maximum performance ( Supercomputing) Why Parallel Architecture ? (Cont …)
  • 11. Modernclassification Accordingto(Sima,Fountain,Kacsuk) Before modern classification, Recall Flynn’s taxonomy classification of Computers - based on the number of instructions that can be executed and how they operate on data. Four Main Type: • SISD: traditional sequential architecture • SIMD: processor arrays, vector processor • Parallel computing on a budget – reduced control unit cost • Many early supercomputers • MIMD: most general purpose parallel computer today • Clusters, MPP, data centers • MISD: not a general purpose architecture Note: Globally four type of parallelism are implemented: - Bit Level Parallelism: performance of processors based on word size ( bits) - Instruction Level Parallelism: give ability to processors to execute more than instruction per clock cycle - Task Parallelism: characterize Parallel programs - Superword Level Parallelism: Based on vectorization Techniques Computer Architectures SISD SIMD MIMD MISD
  • 12. • Classification here is based on how parallelism is achieved • by operating on multiple data: Data parallelism • by performing many functions in parallel: Task parallelism (function) • Control parallelism, task parallelism depending on the level of the functional parallelism. ModernClassification Accordingto(Sima,Fountain,Kacsuk) Parallel architectures Data-parallel architectures Function-parallel architectures - Different operations are performed on the same or different data - Asynchronous computation - Speedup is less as each processor will execute a different thread or process on the same or different set of data. - Amount of parallelization is proportional to the number of independent tasks to be performed - Load balancing depends on the availability of the hardware and scheduling algorithms like static and dynamic scheduling. - Applicability : pipelining - Same operations are performed on different subsets of same data - Synchronous computation - Speedup is more as there is only one execution thread operating on all sets of data. - Amount of parallelization is proportional to the input data size - Designed for optimum load balance on multi processor system Applicability: Arrays, Matrix
  • 13. • Flynn’s classification Focus on the behavioral aspect of computers . • Looking at the structure, Parallel computers can be classified based on a focus on how processors communicate with the memory.  When multiprocessors communicate through the global shared memory modules then this organization is called Shared memory computer or Tightly  when every processor in a multiprocessor system, has its own local memory and the processors communicate via messages transmitted between their local memories, then this organization is called Distributed memory computer or Loosely coupled system StructuralClassificationof ParallelComputers
  • 14. Parallel Computer Memory Architectures Shared Memory Parallel Computers architecture - Processors can access all memory as global address space - Multi-processors can operate independently but share the same memory resources - Changes in a memory location effected by one processor are visible to all other processors Based on memory access time, we can classify Shared memory Parallel Computers into two:  Uniform Memory Access (UMA)  Non-Uniform Memory Access (NUMA)
  • 15. ParallelComputerMemoryArchitectures(Cont…)  Uniform Memory Access (UMA) (known as Cache Coherent - UMA) • Commonly represented today by Symmetric Multiprocessor (SMP) machines • Identical processors • Equal access and access times to memory Note: Cache coherent is a hardware operation where any update of a location in shared memory by one processor , is announce to all the other processors . Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
  • 16. Non-Uniform Memory Access (NUMA) • The architecture often link two or more SMPs In such that : - One SMP can directly access memory of another SMP - Not all processors have equal access time to all memories - Memory access across link is slower Note: if Cache coherent is implemented, then we can also call it Cache Coherent NUMA • The proximity of memory to CPUs on Shared Memory parallel computer makes Data sharing between tasks fast and uniform. • But, there is a lack of scalability between memory and CPUs. ParallelComputerMemoryArchitectures(Cont…) Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory BruceJacob,...DavidT.Wang,inMemorySystems,2008
  • 18.  Distributed Memory Parallel Computer Architecture • Different varieties as Shared Memory Computer. • Require a communication network to connect inter-processor memory. - Each processor operates independently with its own local memory - individual processors changes does not affect the memory of other processors. - Cache Coherency does not apply here ! • Access to data in another processor is usually the task of the programmer(explicitly define how and when data is communicated) • This architecture is cost effective (can use commodity, off-the-shelf processors and networking). • But, the responsibility of the programmer is more engage for data communication between processors Source:Retrievedfrom https://www.futurelearn.com/courses/supercomputing/0/steps/24022 ParallelComputerMemoryArchitectures(Cont…)
  • 19. Source:NikolaosPloskas,NikolaosSamaras,inGPUProgramminginMATLAB,2016 ParallelComputerMemoryArchitectures(Cont…) Overview of Parallel Memory Architecture Note: - The largest and fastest computers in the world today employ both shared and distributed memory architectures (Hybrid Memory) - In hybrid design, Shared memory component here can be a shared memory machine and/or graphics processing units (GPU) - And, Distributed memory component is the networking of multiple shared memory/GPU machines - This type of memory architecture will continue to prevail and increase
  • 20. • Parallel computers can be roughly classified according to the level at which the hardware in the parallel architecture supports parallelism.  Multicore Computing Symmetric multiprocessing ( tightly coupled multiprocessing) Hardwareclassification - Made of computer system with multiple identical processors that share memory and connect via a bus - Do not comprise more than 32 processors to minimize bus contention - Symmetric multiprocessors are extremely cost-effective retrievedfromhttps://en.wikipedia.org/wiki/Parallel_computing#Bit- level_parallelism,2020 - Processor includes multiple processing units (called "cores") on the same chip. - issue multiple instructions per clock cycle from multiple instruction streams - Differs from a superscalar processor. But, Each core in a multi-core processor can potentially be superscalar as well. Superscalar: issue multiple instructions per clock cycle from one instruction stream (thread). - Example: IBM's Cell microprocessor in Sony PlayStation 3
  • 21.  Distributed Computing (distributed memory multiprocessor) Cluster Computing Hardwareclassification(Cont…) • Not to be confused with Decentralized computing - Allocation of resources (Hardware + software) to individual workstations • components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another • Interaction of components is done to achieve a common goal • Characterize by concurrency of components, lack of a global clock, and independent failure of components. • can include heterogeneous computations where some nodes may perform a lot more computation, some perform very little computation and a few others may perform specialized functionality • Example: Multiplayer Online game • loosely coupled computers that work together closely • in some respects they can be regarded as a single computer • multiple standalone machines constitute a cluster and connected by a network. • computer clusters have each node set to perform the same task, controlled and scheduled by software. • Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. • Example: IBM's Sequoia Sources:DinkarSitaram,GeethaManjunath,inMovingToTheCloud,2012 CiscoSystems,2003
  • 23. Performance of parallel architectures  Various ways to measure the performance of a parallel algorithm running on a parallel processor.  Most commonly used measurements: - speed-up - Efficiency/ Isoefficiency - Elapsed time (Very important factor Elapsed time for a program divided by the cost of the machine that ran the job. - price/performance Note: none of these metrics should be used independent of the run time of the parallel system  Common metrics of Performance • FLOPS and MIPS are units of measure for the numerical computing performance of a computer • Distributed computing uses the Internet to link personal computers to achieve more FLOPS - MIPS: million instructions per second MIPS = instruction count/(execution time x 106) - MFLOPS: million floating point operations per second. FLOPS = FP ops in program/(execution time x 106) • Which of the metric is better? • FLOP is more related to the time of a task in numerical code. # of FLOP / program is determined by the matrix size See Chapter 1
  • 24. “In June 2020, Fugaku turned in a High Performance Linpack (HPL) result of 415.5 petaFLOPS, besting the now second-place Summit system by a factor of 2.8x. Fugaku is powered by Fujitsu’s 48-core A64FX SoC, becoming the first number one system on the list to be powered by ARM processors. In single or further reduced precision, used in machine learning and AI applications, Fugaku’s peak performance is over 1,000 petaflops (1 exaflops). The new system is installed at RIKEN Center for Computational Science (R-CCS) in Kobe, Japan ” (wikipedia Flops, 2020). Performance of parallel architectures Here we are ! Single CPU Performance The future
  • 25. Peak and sustained performance Peak performance • Measured in MFLOPS • Highest possible MFLOPS when the system does nothing but numerical computation • Rough hardware measure • Little indication on how the system will perform in practice. Peak Theoretical Performance • Node performance in GFlops = (CPU speed in GHz) x (number of CPU cores) x (CPU instruction per cycle) x (number of CPUs per node)
  • 26. Peak and sustained performance • Sustained performance • The MFLOPS rate that a program achieves over the entire run. • Measuring sustained performance • Using benchmarks • Peak MFLOPS is usually much larger than sustained MFLOPS • Efficiency rate = sustained MFLOPS / peak MFLOPS
  • 27. Measuring the performance of parallel computers • Benchmarks: programs that are used to measure the performance. • LINPACK benchmark: a measure of a system’s floating point computing power • Solving a dense N by N system of linear equations Ax=b • Use to rank supercomputers in the top500 list. No. 1 since June 2020 Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, becoming the first number one system on the list to be powered by ARM processors.
  • 28. Other common benchmarks • Micro benchmarks suit • Numerical computing • LAPACK • ScaLAPACK • Memory bandwidth • STREAM • Kernel benchmarks • NPB (NAS parallel benchmark) • PARKBENCH • SPEC • Splash
  • 29. PARALLEL PROGRAMMING MODELS A programming perspective of Parallelism implementation in parallel and distributed Computer architectures
  • 30. Parallel Programming Models Parallel programming models exist as an abstraction above hardware and memory architectures.  There are commonly several parallel programming models used • Shared Memory (without threads) • Threads • Distributed Memory / Message Passing • Data Parallel • Hybrid • Single Program Multiple Data (SPMD) • Multiple Program Multiple Data (MPMD)  These models are NOT specific to a particular type of machine or memory architecture (a given model can be implemented on any underlying hardware). Example: - SHARED memory model on a DISTRIBUTED memory machine ( Machine memory is physically distributed across networked machines, but at the user level as a single shared memory global address space --- Kendall Square Research (KSR) ALLCACHE---
  • 31. Which Model to USE ?? There is no "best" model However, there are certainly better implementations of some models over others Parallel Programming Models
  • 32. SharedMemoryProgramming Model (WithoutThread) • A thread is the basic unit to which the operating system allocates processor time. They are smallest sequence of programmed instructions • In a Share Memory programming model, - Processes/tasks share a common address space, which they read and write to asynchronously. - Make use of mechanisms such as locks / semaphores to control access to the shared memory, resolve contentions and to prevent race conditions and deadlocks. • This may be consider as the simplest parallel programming model
  • 33. • Note: Locks, Mutexe and semaphore are type of synchronization objects in a share resources environment. Abstract concepts. -Locks protects access to some kind of shared resource, and give right to access the protected share resource when owned. Example, if you have a lockable object ABC you may: - acquire the lock on ABC, - take the lock on ABC, - lock ABC, - take ownership of ABC, or relinquish ownership of ABC if not needed - Mutexe (Mutual EXclusion): lockable object that can be owned by exactly one thread at a time • Example: in C++, std::mutex, std::timed_mutex, std::recursive_mutex -- Semaphore: A semaphore is a very relaxed type of lockable object, with a predefined maximum count, and a current count. Shared MemoryProgramming Model(Cont..)
  • 34. Advantages Disadvantages • No need to specify explicitly the communication of data between tasks, so no need to implement “ownership”. Very advantageous for a Programmer It becomes more difficult to understand and manage data locality. • All processes see and have equal access to shared memory There is Conservation of memory access, cache refresh and bus traffic when keeping data local to a given process • Open for simplification during the development of the program controlling data locality is hard to understand and may be beyond the control of the average user. Shared MemoryProgramming Model(Cont..) During Implementation, • Case: stand-alone shared memory machines - native operating systems, compilers and/or hardware provide support for shared memory programming. E.g. POSIX standard provides an API for using shared memory. • Case: distributed memory machines: - memory is physically distributed across a network of machines, but made global through specialized hardware and software
  • 35. • This is a type of shared memory programming. • Here, a single "heavy weight" process can have multiple "light weight", concurrent execution paths. • To understand this model, let us consider the execution of a main program a.out , scheduled to run by the native operating system. Thread Model  a.out start by loading and acquiring all of the necessary system and user resources to run. This constitute the "heavy weight" process  a.out performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently  Each thread has local data, but also, shares the entire resources of a.out “Light weight” and benefit from a global memory view because it shares the memory space of a.out  Need for synchronization coordination to ensure that more than one thread is not updating the same global address at any time.
  • 36. • During Implementation, threads implementations commonly comprise:  A library of subroutines that are called from within parallel source code  A set of compiler directives imbedded in either serial or parallel source code. Note: Often , the programmer is responsible for determining the parallelism. • Unrelated standardization efforts have resulted in two very different implementations of threads: - POSIX Threads * Specified by the IEEE POSIX 1003.1c standard (1995). C Language only, Part of Unix/Linux operating systems and Very explicit parallelism--requires significant programmer attention to detail. - OpenMP ( Used for Tutorial in the context of this course). * Industry standard, Compiler directive based Portable / multi-platform, including Unix and Windows platforms, available in C/C++ and Fortran implementations, Can be very easy and simple to use - provides for "incremental parallelism". Can begin with serial code. Others include: - Microsoft threads - Java, Python threads - CUDA threads for GPUs Thread Model (Cont…)
  • 37. • In this Model, A set of tasks uses their own local memory during computation Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. Exchange of data by tasks is done through communication( sending/ receiving messages). But, there must be a certain Process Cooperation during data transfer. During Implementation, • The programmer is responsible for determining all parallelism • Message passing implementations usually comprise a library of subroutines that are imbedded in source code. • MPI is the "de facto" industry standard for message passing. - Message Passing Interface (MPI), specification available at http://www.mpi- forum.org/docs/. DistributedMemory/MessagePassingModel
  • 38. Can also be referred to as the Partitioned Global Address Space (PGAS) model. Here,  Address space is treated globally  Most of the parallel work focuses on performing operations on a data set typically organized into a common structure, such as an array or cube  A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure.  Tasks perform the same operation on their partition of work, for example, "add 4 to every array element“  Can be implemented on share memory (data structure is accessed through global memory) and distributed memory architectures (Global Data structure can be logically/Physical split across tasks). Data Parallel Model
  • 39. For the Implementation, • Various popular, and sometimes developmental parallel programming based on the Data Parallel / PGAS model. • - Coarray Fortran, compiler dependent * further reading (https://en.wikipedia.org/wiki/Coarray_Fortran) • - Unified Parallel C (UPC), extension to the C programming language for SPMD parallel programming. * further reading http://upc.lbl.gov/ - Global Arrays , shared memory style programming environment in the context of distributed array data structures. * Further reading on https://en.wikipedia.org/wiki/Global_Arrays Data Parallel Model ( Cont…)
  • 40. Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD) "high level" programming model (Can be build based on any parallel programming model) Why SINGLE PROGRAM ? All tasks execute their copy of the same program (threads, message passing, data parallel or hybrid) simultaneously Why MULTIPLE PROGRAM ? Tasks may execute different programs (threads, message passing, data parallel or hybrid) simultaneously Why MULTIPLE DATA ? All tasks may use different data Why MULTIPLE Data ? All tasks may use different data Intelligent Enough: tasks do not necessarily have to execute the entire program. Not intelligent enough has SPMD. But, may be better suited for certain types of problems (functional decomposition problems) Single ProgramMultipleData (SPMD)/ MultipleProgram MultipleData (MPMD)
  • 41. Conclusion • Parallel computer architectures contribute in achieving maximum performance within the limit given by the technology. • Diversity in parallel computer architecture makes the field challenging to learn and challenging to present • Classification can be based on the number of instructions that can be executed and how they operate on data- Flynn (SISD,SIMD,MISD,MIMD) • Also, classification can be based on how parallelism is achieved (Data parallel architectures, Function-parallel architectures) • Classification can as well be focus on how processors communicate with the memory (Shared memory computer or Tightly , Distributed memory computer or Loosely coupled system) • There must be a way to appreciate the performance of the parallelize architecture • FLOPS and MIPS are units of measure for the numerical computing performance of a computer. • Parallelism is made possible with implementation of adequate parallel programming models. • The most simple model appears to be the Shared Memory Programming Model. • The SPMD and MPMD programming required mastering of the previous programming model for Proper implementation. • How do we then design a Parallel Program for effective parallelism? See Next Chapter: Designing Parallel Programs and understanding notion of Concurrency and Decomposition.
  • 42. Challenge your understanding 1- What difference do you make between Parallel computer and Parallel Computing ? 2- What do you understand by True data dependency and Resource dependency? 3- Illustrate the notion of Vertical Waste and Horizontal Waste. 4- According to you, which of the design architecture can provide better performance ?. Use performance metrics to justify your arguments. 6- what is Concurrent-read, concurrent-write (CRCW) PRAM 5- On this Figure, we have an illustration of a Bus-based interconnects (a) with no local caches and (b) Bus-based interconnects with local memory/caches. Explain the difference focusing on : - The design architecture - The operation - The Pros and Cons 6- Discuss on the HANDLER’S CLASSIFICATION Computers architectures compares to Flynn and others classifications .
  • 43. Class Work Group and Presentation • Purpose: Demonstrate Condition to detect eventual Parallelism. “Parallel computing requires that the segments to be executed in parallel must be independent of each other. So, before executing parallelism, all the conditions of parallelism between the segments must be analyzed”. Use Bernstein Conditions for Detection of Parallelism to demonstrate when instructions i1, i2,….,in can be said “ Parallelized”.
  • 44. REFERENCES 1. Xin Yuan, CIS4930/CDA5125: Parrallel and Distributed Systems, Retrieve from http://www.cs.fsu.edu/~xyuan/cda5125/index.html 2. EECC722 – Shaaban, #1 lec # 3 Fall 2000 9-18-2000 3. Blaise Barney, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#ModelsOverv iew, Last Modified: 11/02/2020 16:39:01 4. J BlazeWich et al, Handbook on Parallel and distributed Processing, International Handbook of Information Systems, spinger, 2000 5. Phillip J. windley, Parallel Architectures, lesson 6, CS462, Large scale Distributed Systems, 2020 6. A. Grana, et al. Introduction to Parallel Computing, lecture 3
  • 45. END.