SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Downloaden Sie, um offline zu lesen
,
WhenWe Need High Performance
Computing?
► Case1
To perform time-consuming operations in less time/ before a tighter deadline.
►I am a bioinformatic engineer.
►I need to run computationally complex programs.
►I’d rather have the result in 5 minutes than in 5 days.
►Case 2
To do a high number of operations per seconds
►I am an engineer ofAmazon.com
►My Web server gets 1,000 hits per seconds
►I’d like my web server and my databases to handle 1,000 transactions per seconds so that
customers do not experience bad delays
Amazon does “process” several GBytes of data per seconds
,
What Does High Performance
Computing Study
► It includes following subjects
 Hardware
- Computer Architecture
- Network Connections
 Software
- Programming paradigms
- Languages
- Middleware
3
1 103 106 109 1012
1015
KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS
One OPS
1951
Univac 1
1949
Edsac
1976
Cray 1
1982
Cray XMP
1988
Cray YMP
1964
CDC 6600
1996
T3E
1823
Babbage Difference
Engine
1991
Intel Delta
1997
ASCI Red
2001
Earth
Simulator
2003
Cray X1
1943
Harvard
Mark 1
1959
IBM 7094
2006
BlueGene/L
2009
Cray XT5
,
INTERNATIONAL COMPETITION FOR
HPC LEADERSHIP
Nations and global regions including China, the United States, Japan, and
Russia, are racing ahead and have created national programs that are
investing large sums of money to develop exascale supercomputers.
5
Supercomputer: A computing system exhibiting high-end performance
capabilities and resource capacities within practical constraints of technology, cost,
power, and reliability. Thomas Sterling, 2007
Supercomputer: a large very fast mainframe used especially for scientific
computations. Merriam-Webster Online
Supercomputer: any of a class of extremely powerful computers. The term is
commonly applied to the fastest high-performance systems available at any given time.
Such computers are used primarily for scientific and engineering work requiring
exceedingly high-speed computations. Encyclopedia Britannica Online
,
Research Applications of HPCs
• Finance
• Sports and Entertainment
• Weather Forecasting
• Space Research
• Health-Care-Related Applications
• to unravel the morphology of cancer cells,
• to diagnose and treat cancers and improve the safety of cancer
treatments
• medical research,
• biomedicine,
• bioinformatics,
• epidemiology,
• Personalized medicine
include
‘Big Data’
aspects
,
THE GLOBAL HPC MARKET
Company Share of Global HPC Revenues
THE GLOBAL HPC MARKET
Countries Share of Global HPC Revenues
9
 Introduction - today’s lecture
 System Architectures (Single Instruction - Single Data, Single Instruction - Multiple
Data, Multiple Instruction - Multiple Data, Shared Memory, Distributed Memory,
Cluster, Multiple Instruction - Single Data)
 Performance Analysis of parallel calculations (the speedup, efficiency, time execution
of algorithm…)
 Parallel numerical methods (Principles of Parallel Algorithm Design, Analytical
Modeling of Parallel Programs, Matrices Operations, Matrix-Vector Operations, Graph
Algorithms…)
 Software (Programming Using Compute Unified Device Architecture)
In the first section we will discuss the importance of parallel computing to high
performance computing. We will show the basic concepts of parallel computing.
The advantages and disadvantages of parallel computing will be discussed. We
will present an overview of current and future trends in HPC hardware.
The second section will provide an introduction to parallel GPU implementations
of numerical methods, such as Matrices Operations, Matrix-Vector Operations,
Graph Algorithms. Also second section will present the application of parallel
computing techniques using Graphic Processing Unit (GPU) in order to improve
the computational efficiency of numerical methods
The third session will briefly discuss such important HPC topics like computing
using graphic processing units (GPUs) with CUDA running relatively simple
examples on this hardware. As tradition dictates, we will show how to write
"Hello World" in CUDA. Some computational libraries available for HPC with GPU
will be highlighted.
10
11
 What is traditional programming view?
 Why Use Parallel Computing?
 Motivation for parallelism (Moor’s law).
 An Overview of Parallel Processing
 Parallelism in Uniprocessor Systems
 Organization of Multiprocessor
 Flynn’s Classification
 System Topologies
 MIMD System Architectures
 Concepts and terminology.
Parallelism is a method to improve computer
system performance by executing two or more
instructions simultaneously.
 The goals of parallel processing.
 One goal is to reduce the “wall-clock” time or the
amount of real time that you need to wait for a
problem to be solved.
 Another goal is to solve bigger problems that
might not fit in the limited memory of a single
CPU.
12
Consider your favorite computational application
• One processor can give me results in N hours
• Why not use N processors -- and get the
results in just one hour?
13
Parallel computing: the use
of multiple computers or
processors working together
on a common task.
 Each processor works on
its section of the problem.
 Processors are allowed to
exchange information
with other processors.
15
Grid of a Problem to be
Solved
16
Flynn’s Classification (Taxonomy)
 Was proposed by researcher Michael J. Flynn in
1966.
 It is the most commonly accepted taxonomy of
computer organization.
 In this classification, computers are classified by
whether it processes a single instruction at a time or
multiple instructions simultaneously, and whether it
operates on one or multiple data sets.
17
4 categories of Flynn’s classification of multiprocessor
systems by their instruction and data streams
Simple Diagrammatic Representation
18
 SISD machines executes a single instruction
on individual data values using a single
processor.
 Based on traditional Von Neumann
uniprocessor architecture, instructions are
executed sequentially or serially, one step
after the next.
 Until most recently, most computers are of
SISD type.
19
20
 An SIMD machine executes a single
instruction on multiple data values
simultaneously using many processors.
 Since there is only one instruction, each
processor does not have to fetch and
decode each instruction. Instead, a single
control unit does the fetch and decoding for
all processors.
 SIMD architectures include array processors.
21
22
 This category does not actually exist. This
category was included in the taxonomy for
the sake of completeness.
23
 MIMD machines are usually referred to as
multiprocessors or multicomputers.
 It may execute multiple instructions
simultaneously, contrary to SIMD machines.
 Each processor must include its own control
unit that will assign to the processors parts
of a task or a separate task.
 It has two subclasses: Shared memory and
distributed memory
24
Shared memory UMA (all processors have equal
access to memory. Can talk via memory.)
Distributed Memory
Processors only Have access to their
local memory “talk” to other processors
over a network
Hybrid
Shared memory nodes connected by a network
Hybrid Machines
•Add special purpose
processors to normal
processors
•Not a new concept
but, regaining traction
Example: our Tesla
Nvidia node,
cuda
25
26
 Shared memory and distributed memory
machines becomes mainstream.
 Manycore architectures: GPUs used for
computing (GPGPU)
 High performance computing almost equals
to parallel computing
27
 Finally, the architecture of a MIMD system,
contrast to its topology, refers to its
connections to its system memory.
 A systems may also be classified by their
architectures. Two of these are:
 Uniform memory access (UMA)
 Nonuniform memory access (NUMA)
28
 The UMA is a type of symmetric
multiprocessor, or SMP, that has two or more
processors that perform symmetric functions.
UMA gives all CPUs equal (uniform) access to
all memory locations in shared memory.
They interact with shared memory by some
communications mechanism like a simple bus
or a complex multistage interconnection
network.
29
Shared
Memory
Processor 2
Processor 1
Processor n
Communications
mechanism
30
 NUMA architectures, unlike UMA architectures
do not allow uniform access to all shared
memory locations. This architecture still
allows all processors to access all shared
memory locations but in a nonuniform way,
each processor can access its local shared
memory more quickly than the other memory
modules not next to it.
31
Memory 1
Processor 1
Communications mechanism
Memory 2
Processor 2
Memory n
Processor n
32
An analogy of Flynn’s classification is the
check-in desk at an airport
 SISD: a single desk
 SIMD: many desks and a supervisor with a
megaphone giving instructions that every desk
obeys
 MIMD: many desks working at their own pace,
synchronized through a central database
34
All parallel programs contain:
• Parallel sections
• Serial sections
• Serial sections are when work is being duplicated or no
useful work is being done, (waiting for others)
• Serial sections limit the parallel effectiveness
• If you have a lot of serial computation then you will not
get good speedup
• No serial work “allows” perfect speedup
• Amdahl’s Law states this formally
35
 Amdahl’s Law places a strict limit on the speedup that can be
realized by using multiple processors.
 Effect of multiple processors on run time
• Effect of multiple processors on speed up
• Where
• fS = serial fraction of code
• fp = parallel fraction of code
• N = number of processors
• Perfect speedup t=t1/n or
S(n)=n
36
37
 Amdahl’s Law provides a theoretical upper limit
on parallel speedup assuming that there are no
costs for communications.
 In reality, communications will result in a
further degradation of performance
38
Writing effective parallel application is difficult
• Communication can limit parallel efficiency
• Serial time can dominate
• Load balance is important
Is it worth your time to rewrite your application
• Do the CPU requirements justify
parallelization?
• Will the code be used just once?
39
S(n) > n,
may be seen on occasion, but usually this is due to using a
suboptimal sequential algorithm or some unique feature of the
architecture that favors the parallel formation.
One common reason for superlinear speedup is the extra cache
in the multiprocessor system which can hold more of the
problem data at any instant, it leads to less, relatively slow
memory traffic.
42
Efficiency = Execution time using one
processor over the
Execution time using a number of processors
Its just the speedup divided by the number of
processors
43
Used to indicate a hardware design that allows the system to be
increased in size and in doing so to obtain increased
performance - could be described as architecture or hardware
scalability.
Scalability is also used to indicate that a parallel algorithm can
accommodate increased data items with a low and bounded
increase in computational steps - could be described as
algorithmic scalability.
44
Problem size: the number of basic steps in the
best sequential algorithm for a given problem and
data set size
•Intuitively, we would think of the number of data elements
being processed in the algorithm as a measure of size.
•However, doubling the date set size would not necessarily
double the number of computational steps. It will depend upon
the problem.
•For example, adding two matrices has this effect, but
multiplying matrices quadruples operations.
Note: bad sequential algorithms tend to scale well.
45
• Latency
• How long to get between nodes in the
network.
• Bandwidth
• How much data can be moved per unit time.
• Bandwidth is limited by the number of wires
and the rate at which each wire can accept data and
choke points.
46
 For ultimate performance you may be
concerned how your nodes are connected.
 Avoid communications between distant node.
 For some machines it might be difficult to
control or know the placement of applications.
47
48
Topologies
 A system may also be classified by its topology.
 A topology is the pattern of connections between
processors.
 The cost-performance tradeoff determines which
topologies to use for a multiprocessor system.
49
A topology is characterized by its diameter,
total bandwidth, and bisection bandwidth
◦ Diameter – the maximum distance between two
processors in the computer system.
◦ Total bandwidth – the capacity of a
communications link multiplied by the number of
such links in the system.
◦ Bisection bandwidth – represents the maximum
data transfer that could occur at the bottleneck in
the topology.
50
51
 Shared Bus
Topology
◦ Processors communicate
with each other via a single
bus that can only handle
one data transmissions at
a time.
◦ In most shared buses,
processors directly
communicate with their
own local memory.
M
P
M
P
M
P
Global
memory
Shared Bus
Network Topologies
52
 Ring Topology
◦ Uses direct connections
between processors
instead of a shared bus.
◦ Allows communication
links to be active
simultaneously but data
may have to travel
through several
processors to reach its
destination.
P
P
P
P
P
P
53
 Tree Topology
◦ Uses direct
connections between
processors; each
having three
connections.
◦ There is only one
unique path between
any pair of
processors.
P
P
P
P
P
P
P
54
 Mesh Topology
◦ In the mesh topology,
every processor
connects to the
processors above and
below it, and to its
right and left.
P
P
P P
P
P
P
P P
55
 Hypercube
Topology
◦ Is a multiple mesh
topology.
◦ Each processor
connects to all other
processors whose
binary values differ
by one bit. For
example, processor
0(0000) connects to
1(0001) or 2(0010).
P
P
P
P
P
P
P
P P
P
P
P
P
P
P
P
56
 Completely Connected
Topology
 Every processor has
n-1 connections, one to
each of the other
processors.
 There is an increase in
complexity as the system
grows but this offers
maximum communication
capabilities.
P
P
P
P
P
P
P
P
57
Moore's Law describes a long-
term trend in the history of
computing hardware, in which
the number of transistors that
can be placed inexpensively on
an integrated circuit has doubled
approximately every two years.
 Mechanical Computing
◦ Babbage, Hollerith, Aiken
 Electronic Digital Calculating
◦ Atanasoff, Eckert, Mauchly
 von Neumann Architecture
◦ Turing, von Neumann, Eckert, Mauchly, Foster, Wilkes
 Semiconductor Technologies
 Birth of the Supercomputer
◦ Cray, Watanabe
 The Golden Age
◦ Batcher, Dennis, S. Chen, Hillis, Dally, Blank, B. Smith
 Common Era of Killer Micros
◦ Scott, Culler, Sterling/Becker, Goodhue, A. Chen, Tomkins
 Petaflops
◦ Messina, Sterling, Stevens, P. Smith,
58
59
Historical Machines
• Leibniz Stepped Reckoner
• Babbage Difference Engine
• Hollerith Tabulator
• Harvard Mark 1
• Un. of Pennsylvania Eniac
• Cambridge Edsac
• MIT Whirlwind
• Cray 1
• TMC CM-2
• Intel Touchstone Delta
• Beowulf
• IBM Blue Gene/L
 Eckert and Mauchly,
1946.
 Vacuum tubes.
 Numerical solutions
to problems in fields
such as atomic
energy and ballistic
trajectories.
60
 Maurice Wilkes, 1949.
 Mercury delay lines for
memory and vacuum
tubes for logic.
 Used one of the first
assemblers called
Initial Orders.
 Calculation of prime
numbers, solutions of
algebraic equations,
etc.
61
 Jay Forrester, 1949.
 Fastest computer.
 First computer to use
magnetic core
memory.
 Displayed real time
text and graphics on
a large oscilloscope
screen.
62
 Cray Research,
1976.
 Pipelined vector
arithmetic units.
 Unique C-shape to
help increase the
signal speeds from
one end to the other.
63
 Thinking Machines
Corporation, 1987.
 Hypercube
architecture with
65,536 processors.
 SIMD.
 Performance in the
range of GFLOPS.
64
 INTEL, 1990.
 MIMD hypercube.
 LINPACK rating of
13.9 GFLOPS .
 Enough computing
power for
applications like real-
time processing of
satellite images and
molecular models for
AIDS research.
65
 Thomas Sterling and
Donald Becker, 1994.
 Cluster formed of one
head node and
one/more compute
nodes.
 Nodes and network
dedicated to the
Beowulf.
 Compute nodes are
mass produced
commodities.
 Use open source
software including
Linux.
66
 Japan, 1997.
 Fastest
supercomputer from
2002-2004: 35.86
TFLOPS.
 640 nodes with eight
vector processors
and 16 gigabytes of
computer memory at
each node.
67
 IBM, 2004.
 First supercomputer
ever to run over 100
TFLOPS sustained
on a real world
application, namely a
three-dimensional
molecular dynamics
code (ddcMD).
68
 1975 – 1992
 Vector
◦ Cray-1&2, NEC SX,
Fujitsu VPP
 SIMD
◦ Maspar, CM-2
 Systolic
◦ Warp
 Dataflow
◦ Manchester, Sigma,
Monsoon
 Multithreaded
◦ HEP, MTA
 Actor-based
◦ J-Machine
69
1976
Cray 1
 1992 to present
 Killer Micro and mass market
PCs
 High density DRAM
 High cost of fab lines
 CSP
◦ Message passing
 Economy of scale S-curve
 MPP
 Weak scaling
◦ Gustafson et al
 Beowulf, NOW Clusters
 MPI
 Ethernet, Myrinet
 Linux
70
 Automated calculating
◦ 17th century
 Stored program digital electronic
◦ 1948
 Vector
◦ 1975
 SIMD
◦ 1980s
 MPPs
◦ 1991
 Commodity Clusters
◦ 1993/4
 Multicore
◦ 2006
71
 Hybrid cluster solutions & services that fully leverage the performance
of accelerators
Demand for computing power is growing steadily, as scientists &
engineers seek to tackle increasingly complex problems. The emergence
of multi-core CPUs has allowed to keep pace with their demands, but
energy consumption, space, & cooling have become major inhibitors to
computing systems expansion. Hence the success of acceleration
technologies such as GPGPUs (General-Purpose Graphics Processing
Units), which offer both breakthrough performance & outstanding space &
energy efficiency
GPGPUs can accelerate processing by a factor of 1 to 100!
72
 Starvation
◦ Not enough work to do due to insufficient parallelism or
poor load balancing among distributed resources
 Latency
◦ Waiting for access to memory or other parts of the system
 Overhead
◦ Extra work that has to be done to manage program
concurrency and parallel resources the real work you
want to perform
 Waiting for Contention
◦ Delays due to fighting over what task gets to use a shared
resource next. Network bandwidth is a major constraint.
73
Lecture 2-3/10
74
Programming of Graphic Processors
Olesia Barkovskaya , KHTURE, 2018
Electronic Computers Department
Simply saying, in architecture sense, CPU is composed of few huge
Arithmetic Logic Unit (ALU) cores for general purpose processing with lots
of cache memory and one huge control module that can handle a few
software threads at a time. CPU is optimized for serial operations since its
clock is very high. While GPU, on the other hand, has many small ALUs,
small control modules and small cache. GPU is optimized for parallel
operations.
75
A simple way to understand the difference between a GPU and a CPU is to compare how
they process tasks. A CPU consists of a few cores optimized for sequential serial processing
while a GPU has a massively parallel architecture consisting of thousands of smaller, more
efficient cores designed for handling multiple tasks simultaneously.
GPUs have thousands of cores to process parallel workloads efficiently
76
Environments
 CUDA: More mature, bigger ‘ecosystem’, NVIDIA
only
 OpenCL: Vendor-independent, open industry
standard
 Interfaces to C/C++, Fortran, Python, .NET, . . .
 Important: Hardware abstraction and
‘expressiveness’ are identical
77
Fourier Transforms
 CUFFT: NVIDIA, part of the CUDA toolkit
 APPML (formerly ACML-GPU): AMD Accelerated
Parallel Processing Math Libraries
Dense linear algebra
 CUBLAS: NVIDIA’s basic linear algebra subprograms
 APPML (formerly ACML-GPU): AMD Accelerated
Parallel Processing Math Libraries
 CULA: Third-party LAPACK, matrix decompositions
and eigenvalue problems
 MAGMA and PLASMA: BLAS/LAPACK for multicore
and manycore (ICL, Tennessee)
78
Ten years ago, when GPUs were rst used to perform general-purpose computation, they were programmed using
low-level mechanism such as the interruption services of the BIOS, or by using graphic APIs such as OpenGL and
DirectX [16]. Later, the programs for GPU were developed in assembly language for each card model, and they had
very limited portability. So, high-level languages were developed to fully exploit the capabilities of the GPUs. In
2007, NVIDIA introduced CUDA [25], a software architecture for managing the GPU as a parallel computing device
without requiring to map the data and the computation into a graphic API. CUDA is based in an extension of the C
language, and it is available for graphic cards GeForce 8 Series and superior, using the 32 and 64 bits versions of
the Linux and Windows (XP and successors) operating systems. Three software layers are used in CUDA to
communicate with the GPU (see Figure 1): a lowlevel hardware driver that performs the data communications
between the CPU and the GPU, a high-level API, and a set of libraries that includes CUBLAS for linear algebra
calculations and CUFFT for Fourier transforms calculation. For the CUDA programmer, the GPU is a computing
device which is able to execute a large number of threads in parallel. A specic procedure to be executed many times
over dierent data can be isolated in a GPU-function using many execution threads. The function is compiled using a
specic set of instructions and the resulting program named kernel is loaded in the GPU. The GPU has its own DRAM,
and the data are copied from the DRAM of the GPU to the RAM of the host (and viceversa) using optimized calls to
the CUDA API. The CUDA architecture is built around a scalable array of multiprocessors, each one of them having
eight scalar processors, one multithreading unit, and a shared memory chip. The multiprocessors are able to create,
manage, and execute parallel threads, with reduced overhead. The threads are grouped in blocks (with up to 512
threads), which are executed in a single multiprocessor of the graphic card, and the blocks are grouped in grids.
Each time that a CUDA program calls a grid to be executed in the GPU, each one of the blocks in the grid is
numbered and distributed to an available multiprocessor.
When a multiprocessor receives one (or more) blocks to execute, it splits the threads in warps a set of 32
consecutive threads. Each warp executes a single instruction at a time, so the best eciency is achieved when the 32
threads in the warp executes the same instruction. Otherwise, the warp serializes the threads. Each time that a
block nishes its execution, a new block is assigned to the available multiprocessor. The threads are able to access
the data using three memory spaces: the shared memory of the block, which can be used by the threads in the
block; the local memory of the thread; and the global memory of the GPU. Minimizing the access to the slower
memory spaces (the local memory of the thread and the global memory of the GPU) is a very important feature to
achieve eciency in GPU programming. On the other side, the shared memory is placed within the GPU chip, thus it
provides a faster way to store the data.
79

Weitere ähnliche Inhalte

Ähnlich wie intro, definitions, basic laws+.pptx

Real-Time Scheduling Algorithms
Real-Time Scheduling AlgorithmsReal-Time Scheduling Algorithms
Real-Time Scheduling AlgorithmsAJAL A J
 
Lecture 2
Lecture 2Lecture 2
Lecture 2Mr SMAK
 
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptBIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptKadri20
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computingHeman Pathak
 
Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Fazli Amin
 
(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptx(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptxNithishaYadavv
 
Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.gadisaAdamu
 
A Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High PerformanceA Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High PerformanceAM Publications
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Sudarshan Mondal
 
Distributed system lectures
Distributed system lecturesDistributed system lectures
Distributed system lecturesmarwaeng
 
The evolution of computer
The evolution of computerThe evolution of computer
The evolution of computerLolita De Leon
 
EMBEDDED OS
EMBEDDED OSEMBEDDED OS
EMBEDDED OSAJAL A J
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architecturesA B Shinde
 

Ähnlich wie intro, definitions, basic laws+.pptx (20)

Parallel processing
Parallel processingParallel processing
Parallel processing
 
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computingUnderlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
 
Real-Time Scheduling Algorithms
Real-Time Scheduling AlgorithmsReal-Time Scheduling Algorithms
Real-Time Scheduling Algorithms
 
Module 2.pdf
Module 2.pdfModule 2.pdf
Module 2.pdf
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptBIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computing
 
Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Lecture 1 (distributed systems)
Lecture 1 (distributed systems)
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptx(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptx
 
Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.
 
A Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High PerformanceA Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High Performance
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
 
Distributed system lectures
Distributed system lecturesDistributed system lectures
Distributed system lectures
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
The evolution of computer
The evolution of computerThe evolution of computer
The evolution of computer
 
EMBEDDED OS
EMBEDDED OSEMBEDDED OS
EMBEDDED OS
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures
 
Lecture 04 chapter 2 - Parallel Programming Platforms
Lecture 04  chapter 2 - Parallel Programming PlatformsLecture 04  chapter 2 - Parallel Programming Platforms
Lecture 04 chapter 2 - Parallel Programming Platforms
 

Mehr von ssuser413a98

Визначення терміну динамічний об’єкт на зображеннях .pptx
Визначення терміну динамічний об’єкт на зображеннях .pptxВизначення терміну динамічний об’єкт на зображеннях .pptx
Визначення терміну динамічний об’єкт на зображеннях .pptxssuser413a98
 
Фильтрация шумов на цифровом изображении.ppt
Фильтрация шумов на цифровом изображении.pptФильтрация шумов на цифровом изображении.ppt
Фильтрация шумов на цифровом изображении.pptssuser413a98
 
Шумоподавление в цифровых изображениях.ppt
Шумоподавление в цифровых  изображениях.pptШумоподавление в цифровых  изображениях.ppt
Шумоподавление в цифровых изображениях.pptssuser413a98
 
Подавление шума в цифровых изображениях.ppt
Подавление шума в цифровых изображениях.pptПодавление шума в цифровых изображениях.ppt
Подавление шума в цифровых изображениях.pptssuser413a98
 
Сегментация изображений в компьютерной графике.ppt
Сегментация  изображений в компьютерной графике.pptСегментация  изображений в компьютерной графике.ppt
Сегментация изображений в компьютерной графике.pptssuser413a98
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 

Mehr von ssuser413a98 (6)

Визначення терміну динамічний об’єкт на зображеннях .pptx
Визначення терміну динамічний об’єкт на зображеннях .pptxВизначення терміну динамічний об’єкт на зображеннях .pptx
Визначення терміну динамічний об’єкт на зображеннях .pptx
 
Фильтрация шумов на цифровом изображении.ppt
Фильтрация шумов на цифровом изображении.pptФильтрация шумов на цифровом изображении.ppt
Фильтрация шумов на цифровом изображении.ppt
 
Шумоподавление в цифровых изображениях.ppt
Шумоподавление в цифровых  изображениях.pptШумоподавление в цифровых  изображениях.ppt
Шумоподавление в цифровых изображениях.ppt
 
Подавление шума в цифровых изображениях.ppt
Подавление шума в цифровых изображениях.pptПодавление шума в цифровых изображениях.ppt
Подавление шума в цифровых изображениях.ppt
 
Сегментация изображений в компьютерной графике.ppt
Сегментация  изображений в компьютерной графике.pptСегментация  изображений в компьютерной графике.ppt
Сегментация изображений в компьютерной графике.ppt
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 

Kürzlich hochgeladen

BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...Nguyen Thanh Tu Collection
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Comparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxComparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxAvaniJani1
 
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEPART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEMISSRITIMABIOLOGYEXP
 
Objectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxObjectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxMadhavi Dharankar
 
Employablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxEmployablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxryandux83rd
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesVijayaLaxmi84
 
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Osopher
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6Vanessa Camilleri
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 

Kürzlich hochgeladen (20)

BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Comparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxComparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptx
 
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEPART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
 
Objectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxObjectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptx
 
Employablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxEmployablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptx
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their uses
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 

intro, definitions, basic laws+.pptx

  • 1. , WhenWe Need High Performance Computing? ► Case1 To perform time-consuming operations in less time/ before a tighter deadline. ►I am a bioinformatic engineer. ►I need to run computationally complex programs. ►I’d rather have the result in 5 minutes than in 5 days. ►Case 2 To do a high number of operations per seconds ►I am an engineer ofAmazon.com ►My Web server gets 1,000 hits per seconds ►I’d like my web server and my databases to handle 1,000 transactions per seconds so that customers do not experience bad delays Amazon does “process” several GBytes of data per seconds
  • 2. , What Does High Performance Computing Study ► It includes following subjects  Hardware - Computer Architecture - Network Connections  Software - Programming paradigms - Languages - Middleware
  • 3. 3 1 103 106 109 1012 1015 KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS One OPS 1951 Univac 1 1949 Edsac 1976 Cray 1 1982 Cray XMP 1988 Cray YMP 1964 CDC 6600 1996 T3E 1823 Babbage Difference Engine 1991 Intel Delta 1997 ASCI Red 2001 Earth Simulator 2003 Cray X1 1943 Harvard Mark 1 1959 IBM 7094 2006 BlueGene/L 2009 Cray XT5
  • 4. , INTERNATIONAL COMPETITION FOR HPC LEADERSHIP Nations and global regions including China, the United States, Japan, and Russia, are racing ahead and have created national programs that are investing large sums of money to develop exascale supercomputers.
  • 5. 5 Supercomputer: A computing system exhibiting high-end performance capabilities and resource capacities within practical constraints of technology, cost, power, and reliability. Thomas Sterling, 2007 Supercomputer: a large very fast mainframe used especially for scientific computations. Merriam-Webster Online Supercomputer: any of a class of extremely powerful computers. The term is commonly applied to the fastest high-performance systems available at any given time. Such computers are used primarily for scientific and engineering work requiring exceedingly high-speed computations. Encyclopedia Britannica Online
  • 6. , Research Applications of HPCs • Finance • Sports and Entertainment • Weather Forecasting • Space Research • Health-Care-Related Applications • to unravel the morphology of cancer cells, • to diagnose and treat cancers and improve the safety of cancer treatments • medical research, • biomedicine, • bioinformatics, • epidemiology, • Personalized medicine include ‘Big Data’ aspects
  • 7. , THE GLOBAL HPC MARKET Company Share of Global HPC Revenues
  • 8. THE GLOBAL HPC MARKET Countries Share of Global HPC Revenues
  • 9. 9  Introduction - today’s lecture  System Architectures (Single Instruction - Single Data, Single Instruction - Multiple Data, Multiple Instruction - Multiple Data, Shared Memory, Distributed Memory, Cluster, Multiple Instruction - Single Data)  Performance Analysis of parallel calculations (the speedup, efficiency, time execution of algorithm…)  Parallel numerical methods (Principles of Parallel Algorithm Design, Analytical Modeling of Parallel Programs, Matrices Operations, Matrix-Vector Operations, Graph Algorithms…)  Software (Programming Using Compute Unified Device Architecture)
  • 10. In the first section we will discuss the importance of parallel computing to high performance computing. We will show the basic concepts of parallel computing. The advantages and disadvantages of parallel computing will be discussed. We will present an overview of current and future trends in HPC hardware. The second section will provide an introduction to parallel GPU implementations of numerical methods, such as Matrices Operations, Matrix-Vector Operations, Graph Algorithms. Also second section will present the application of parallel computing techniques using Graphic Processing Unit (GPU) in order to improve the computational efficiency of numerical methods The third session will briefly discuss such important HPC topics like computing using graphic processing units (GPUs) with CUDA running relatively simple examples on this hardware. As tradition dictates, we will show how to write "Hello World" in CUDA. Some computational libraries available for HPC with GPU will be highlighted. 10
  • 11. 11  What is traditional programming view?  Why Use Parallel Computing?  Motivation for parallelism (Moor’s law).  An Overview of Parallel Processing  Parallelism in Uniprocessor Systems  Organization of Multiprocessor  Flynn’s Classification  System Topologies  MIMD System Architectures  Concepts and terminology.
  • 12. Parallelism is a method to improve computer system performance by executing two or more instructions simultaneously.  The goals of parallel processing.  One goal is to reduce the “wall-clock” time or the amount of real time that you need to wait for a problem to be solved.  Another goal is to solve bigger problems that might not fit in the limited memory of a single CPU. 12
  • 13. Consider your favorite computational application • One processor can give me results in N hours • Why not use N processors -- and get the results in just one hour? 13
  • 14. Parallel computing: the use of multiple computers or processors working together on a common task.  Each processor works on its section of the problem.  Processors are allowed to exchange information with other processors. 15 Grid of a Problem to be Solved
  • 15. 16 Flynn’s Classification (Taxonomy)  Was proposed by researcher Michael J. Flynn in 1966.  It is the most commonly accepted taxonomy of computer organization.  In this classification, computers are classified by whether it processes a single instruction at a time or multiple instructions simultaneously, and whether it operates on one or multiple data sets.
  • 16. 17 4 categories of Flynn’s classification of multiprocessor systems by their instruction and data streams Simple Diagrammatic Representation
  • 17. 18  SISD machines executes a single instruction on individual data values using a single processor.  Based on traditional Von Neumann uniprocessor architecture, instructions are executed sequentially or serially, one step after the next.  Until most recently, most computers are of SISD type.
  • 18. 19
  • 19. 20  An SIMD machine executes a single instruction on multiple data values simultaneously using many processors.  Since there is only one instruction, each processor does not have to fetch and decode each instruction. Instead, a single control unit does the fetch and decoding for all processors.  SIMD architectures include array processors.
  • 20. 21
  • 21. 22  This category does not actually exist. This category was included in the taxonomy for the sake of completeness.
  • 22. 23  MIMD machines are usually referred to as multiprocessors or multicomputers.  It may execute multiple instructions simultaneously, contrary to SIMD machines.  Each processor must include its own control unit that will assign to the processors parts of a task or a separate task.  It has two subclasses: Shared memory and distributed memory
  • 23. 24 Shared memory UMA (all processors have equal access to memory. Can talk via memory.) Distributed Memory Processors only Have access to their local memory “talk” to other processors over a network Hybrid Shared memory nodes connected by a network
  • 24. Hybrid Machines •Add special purpose processors to normal processors •Not a new concept but, regaining traction Example: our Tesla Nvidia node, cuda 25
  • 25. 26  Shared memory and distributed memory machines becomes mainstream.  Manycore architectures: GPUs used for computing (GPGPU)  High performance computing almost equals to parallel computing
  • 26. 27  Finally, the architecture of a MIMD system, contrast to its topology, refers to its connections to its system memory.  A systems may also be classified by their architectures. Two of these are:  Uniform memory access (UMA)  Nonuniform memory access (NUMA)
  • 27. 28  The UMA is a type of symmetric multiprocessor, or SMP, that has two or more processors that perform symmetric functions. UMA gives all CPUs equal (uniform) access to all memory locations in shared memory. They interact with shared memory by some communications mechanism like a simple bus or a complex multistage interconnection network.
  • 29. 30  NUMA architectures, unlike UMA architectures do not allow uniform access to all shared memory locations. This architecture still allows all processors to access all shared memory locations but in a nonuniform way, each processor can access its local shared memory more quickly than the other memory modules not next to it.
  • 30. 31 Memory 1 Processor 1 Communications mechanism Memory 2 Processor 2 Memory n Processor n
  • 31. 32 An analogy of Flynn’s classification is the check-in desk at an airport  SISD: a single desk  SIMD: many desks and a supervisor with a megaphone giving instructions that every desk obeys  MIMD: many desks working at their own pace, synchronized through a central database
  • 32. 34
  • 33. All parallel programs contain: • Parallel sections • Serial sections • Serial sections are when work is being duplicated or no useful work is being done, (waiting for others) • Serial sections limit the parallel effectiveness • If you have a lot of serial computation then you will not get good speedup • No serial work “allows” perfect speedup • Amdahl’s Law states this formally 35
  • 34.  Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors.  Effect of multiple processors on run time • Effect of multiple processors on speed up • Where • fS = serial fraction of code • fp = parallel fraction of code • N = number of processors • Perfect speedup t=t1/n or S(n)=n 36
  • 35. 37
  • 36.  Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications.  In reality, communications will result in a further degradation of performance 38
  • 37. Writing effective parallel application is difficult • Communication can limit parallel efficiency • Serial time can dominate • Load balance is important Is it worth your time to rewrite your application • Do the CPU requirements justify parallelization? • Will the code be used just once? 39
  • 38. S(n) > n, may be seen on occasion, but usually this is due to using a suboptimal sequential algorithm or some unique feature of the architecture that favors the parallel formation. One common reason for superlinear speedup is the extra cache in the multiprocessor system which can hold more of the problem data at any instant, it leads to less, relatively slow memory traffic. 42
  • 39. Efficiency = Execution time using one processor over the Execution time using a number of processors Its just the speedup divided by the number of processors 43
  • 40. Used to indicate a hardware design that allows the system to be increased in size and in doing so to obtain increased performance - could be described as architecture or hardware scalability. Scalability is also used to indicate that a parallel algorithm can accommodate increased data items with a low and bounded increase in computational steps - could be described as algorithmic scalability. 44
  • 41. Problem size: the number of basic steps in the best sequential algorithm for a given problem and data set size •Intuitively, we would think of the number of data elements being processed in the algorithm as a measure of size. •However, doubling the date set size would not necessarily double the number of computational steps. It will depend upon the problem. •For example, adding two matrices has this effect, but multiplying matrices quadruples operations. Note: bad sequential algorithms tend to scale well. 45
  • 42. • Latency • How long to get between nodes in the network. • Bandwidth • How much data can be moved per unit time. • Bandwidth is limited by the number of wires and the rate at which each wire can accept data and choke points. 46
  • 43.  For ultimate performance you may be concerned how your nodes are connected.  Avoid communications between distant node.  For some machines it might be difficult to control or know the placement of applications. 47
  • 44. 48 Topologies  A system may also be classified by its topology.  A topology is the pattern of connections between processors.  The cost-performance tradeoff determines which topologies to use for a multiprocessor system.
  • 45. 49 A topology is characterized by its diameter, total bandwidth, and bisection bandwidth ◦ Diameter – the maximum distance between two processors in the computer system. ◦ Total bandwidth – the capacity of a communications link multiplied by the number of such links in the system. ◦ Bisection bandwidth – represents the maximum data transfer that could occur at the bottleneck in the topology.
  • 46. 50
  • 47. 51  Shared Bus Topology ◦ Processors communicate with each other via a single bus that can only handle one data transmissions at a time. ◦ In most shared buses, processors directly communicate with their own local memory. M P M P M P Global memory Shared Bus Network Topologies
  • 48. 52  Ring Topology ◦ Uses direct connections between processors instead of a shared bus. ◦ Allows communication links to be active simultaneously but data may have to travel through several processors to reach its destination. P P P P P P
  • 49. 53  Tree Topology ◦ Uses direct connections between processors; each having three connections. ◦ There is only one unique path between any pair of processors. P P P P P P P
  • 50. 54  Mesh Topology ◦ In the mesh topology, every processor connects to the processors above and below it, and to its right and left. P P P P P P P P P
  • 51. 55  Hypercube Topology ◦ Is a multiple mesh topology. ◦ Each processor connects to all other processors whose binary values differ by one bit. For example, processor 0(0000) connects to 1(0001) or 2(0010). P P P P P P P P P P P P P P P P
  • 52. 56  Completely Connected Topology  Every processor has n-1 connections, one to each of the other processors.  There is an increase in complexity as the system grows but this offers maximum communication capabilities. P P P P P P P P
  • 53. 57 Moore's Law describes a long- term trend in the history of computing hardware, in which the number of transistors that can be placed inexpensively on an integrated circuit has doubled approximately every two years.
  • 54.  Mechanical Computing ◦ Babbage, Hollerith, Aiken  Electronic Digital Calculating ◦ Atanasoff, Eckert, Mauchly  von Neumann Architecture ◦ Turing, von Neumann, Eckert, Mauchly, Foster, Wilkes  Semiconductor Technologies  Birth of the Supercomputer ◦ Cray, Watanabe  The Golden Age ◦ Batcher, Dennis, S. Chen, Hillis, Dally, Blank, B. Smith  Common Era of Killer Micros ◦ Scott, Culler, Sterling/Becker, Goodhue, A. Chen, Tomkins  Petaflops ◦ Messina, Sterling, Stevens, P. Smith, 58
  • 55. 59 Historical Machines • Leibniz Stepped Reckoner • Babbage Difference Engine • Hollerith Tabulator • Harvard Mark 1 • Un. of Pennsylvania Eniac • Cambridge Edsac • MIT Whirlwind • Cray 1 • TMC CM-2 • Intel Touchstone Delta • Beowulf • IBM Blue Gene/L
  • 56.  Eckert and Mauchly, 1946.  Vacuum tubes.  Numerical solutions to problems in fields such as atomic energy and ballistic trajectories. 60
  • 57.  Maurice Wilkes, 1949.  Mercury delay lines for memory and vacuum tubes for logic.  Used one of the first assemblers called Initial Orders.  Calculation of prime numbers, solutions of algebraic equations, etc. 61
  • 58.  Jay Forrester, 1949.  Fastest computer.  First computer to use magnetic core memory.  Displayed real time text and graphics on a large oscilloscope screen. 62
  • 59.  Cray Research, 1976.  Pipelined vector arithmetic units.  Unique C-shape to help increase the signal speeds from one end to the other. 63
  • 60.  Thinking Machines Corporation, 1987.  Hypercube architecture with 65,536 processors.  SIMD.  Performance in the range of GFLOPS. 64
  • 61.  INTEL, 1990.  MIMD hypercube.  LINPACK rating of 13.9 GFLOPS .  Enough computing power for applications like real- time processing of satellite images and molecular models for AIDS research. 65
  • 62.  Thomas Sterling and Donald Becker, 1994.  Cluster formed of one head node and one/more compute nodes.  Nodes and network dedicated to the Beowulf.  Compute nodes are mass produced commodities.  Use open source software including Linux. 66
  • 63.  Japan, 1997.  Fastest supercomputer from 2002-2004: 35.86 TFLOPS.  640 nodes with eight vector processors and 16 gigabytes of computer memory at each node. 67
  • 64.  IBM, 2004.  First supercomputer ever to run over 100 TFLOPS sustained on a real world application, namely a three-dimensional molecular dynamics code (ddcMD). 68
  • 65.  1975 – 1992  Vector ◦ Cray-1&2, NEC SX, Fujitsu VPP  SIMD ◦ Maspar, CM-2  Systolic ◦ Warp  Dataflow ◦ Manchester, Sigma, Monsoon  Multithreaded ◦ HEP, MTA  Actor-based ◦ J-Machine 69 1976 Cray 1
  • 66.  1992 to present  Killer Micro and mass market PCs  High density DRAM  High cost of fab lines  CSP ◦ Message passing  Economy of scale S-curve  MPP  Weak scaling ◦ Gustafson et al  Beowulf, NOW Clusters  MPI  Ethernet, Myrinet  Linux 70
  • 67.  Automated calculating ◦ 17th century  Stored program digital electronic ◦ 1948  Vector ◦ 1975  SIMD ◦ 1980s  MPPs ◦ 1991  Commodity Clusters ◦ 1993/4  Multicore ◦ 2006 71
  • 68.  Hybrid cluster solutions & services that fully leverage the performance of accelerators Demand for computing power is growing steadily, as scientists & engineers seek to tackle increasingly complex problems. The emergence of multi-core CPUs has allowed to keep pace with their demands, but energy consumption, space, & cooling have become major inhibitors to computing systems expansion. Hence the success of acceleration technologies such as GPGPUs (General-Purpose Graphics Processing Units), which offer both breakthrough performance & outstanding space & energy efficiency GPGPUs can accelerate processing by a factor of 1 to 100! 72
  • 69.  Starvation ◦ Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources  Latency ◦ Waiting for access to memory or other parts of the system  Overhead ◦ Extra work that has to be done to manage program concurrency and parallel resources the real work you want to perform  Waiting for Contention ◦ Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint. 73
  • 70. Lecture 2-3/10 74 Programming of Graphic Processors Olesia Barkovskaya , KHTURE, 2018 Electronic Computers Department
  • 71. Simply saying, in architecture sense, CPU is composed of few huge Arithmetic Logic Unit (ALU) cores for general purpose processing with lots of cache memory and one huge control module that can handle a few software threads at a time. CPU is optimized for serial operations since its clock is very high. While GPU, on the other hand, has many small ALUs, small control modules and small cache. GPU is optimized for parallel operations. 75
  • 72. A simple way to understand the difference between a GPU and a CPU is to compare how they process tasks. A CPU consists of a few cores optimized for sequential serial processing while a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously. GPUs have thousands of cores to process parallel workloads efficiently 76
  • 73. Environments  CUDA: More mature, bigger ‘ecosystem’, NVIDIA only  OpenCL: Vendor-independent, open industry standard  Interfaces to C/C++, Fortran, Python, .NET, . . .  Important: Hardware abstraction and ‘expressiveness’ are identical 77
  • 74. Fourier Transforms  CUFFT: NVIDIA, part of the CUDA toolkit  APPML (formerly ACML-GPU): AMD Accelerated Parallel Processing Math Libraries Dense linear algebra  CUBLAS: NVIDIA’s basic linear algebra subprograms  APPML (formerly ACML-GPU): AMD Accelerated Parallel Processing Math Libraries  CULA: Third-party LAPACK, matrix decompositions and eigenvalue problems  MAGMA and PLASMA: BLAS/LAPACK for multicore and manycore (ICL, Tennessee) 78
  • 75. Ten years ago, when GPUs were rst used to perform general-purpose computation, they were programmed using low-level mechanism such as the interruption services of the BIOS, or by using graphic APIs such as OpenGL and DirectX [16]. Later, the programs for GPU were developed in assembly language for each card model, and they had very limited portability. So, high-level languages were developed to fully exploit the capabilities of the GPUs. In 2007, NVIDIA introduced CUDA [25], a software architecture for managing the GPU as a parallel computing device without requiring to map the data and the computation into a graphic API. CUDA is based in an extension of the C language, and it is available for graphic cards GeForce 8 Series and superior, using the 32 and 64 bits versions of the Linux and Windows (XP and successors) operating systems. Three software layers are used in CUDA to communicate with the GPU (see Figure 1): a lowlevel hardware driver that performs the data communications between the CPU and the GPU, a high-level API, and a set of libraries that includes CUBLAS for linear algebra calculations and CUFFT for Fourier transforms calculation. For the CUDA programmer, the GPU is a computing device which is able to execute a large number of threads in parallel. A specic procedure to be executed many times over dierent data can be isolated in a GPU-function using many execution threads. The function is compiled using a specic set of instructions and the resulting program named kernel is loaded in the GPU. The GPU has its own DRAM, and the data are copied from the DRAM of the GPU to the RAM of the host (and viceversa) using optimized calls to the CUDA API. The CUDA architecture is built around a scalable array of multiprocessors, each one of them having eight scalar processors, one multithreading unit, and a shared memory chip. The multiprocessors are able to create, manage, and execute parallel threads, with reduced overhead. The threads are grouped in blocks (with up to 512 threads), which are executed in a single multiprocessor of the graphic card, and the blocks are grouped in grids. Each time that a CUDA program calls a grid to be executed in the GPU, each one of the blocks in the grid is numbered and distributed to an available multiprocessor. When a multiprocessor receives one (or more) blocks to execute, it splits the threads in warps a set of 32 consecutive threads. Each warp executes a single instruction at a time, so the best eciency is achieved when the 32 threads in the warp executes the same instruction. Otherwise, the warp serializes the threads. Each time that a block nishes its execution, a new block is assigned to the available multiprocessor. The threads are able to access the data using three memory spaces: the shared memory of the block, which can be used by the threads in the block; the local memory of the thread; and the global memory of the GPU. Minimizing the access to the slower memory spaces (the local memory of the thread and the global memory of the GPU) is a very important feature to achieve eciency in GPU programming. On the other side, the shared memory is placed within the GPU chip, thus it provides a faster way to store the data. 79