Webinaron muticoreprocessors

Introduction toIntroduction to
Multi-CoreMulti-Core
Architectures: AArchitectures: A
Paradigm Shift in ITParadigm Shift in IT
Dr NB Venkateswarlu
RITCH CENTER,
Visakhapatnam
Prof in CSE, AITAM, TEKKALI

IndexIndex
 What is Computer Architecture
 Which are Computer Components
 Classification of Computers
 The Dawn of Multi-Core (Stream) processors.
 Symmetric Multi-Core (SMP)
 Heterogeneous Multi-Core Processors
 GPU (Graphic Processing Unit)
 Intel Family of Multi-Core Processors
 AMD Family of Multi-Core Processors
 Others
 Multi-Threading, Hyper-Threading, Multi-Core
 OpenMP Programming Paradigm

ConceptsConcepts
What is Computer Architecture
I/O systemInstr. Set Proc.
Compiler
Operating
System
Application
Digital Design
Circuit Design
Instruction Set
Architecture
Firmware
Datapath & Control
Layout

Evolvement of computer architecture
scalar
Pre-
exe
seque
nce
Multi-
processing
array
Multi-
computer
multiprocesin
g
Ledgend
processor
I/E overlap Fun parallel
Multi fun Pipe line
Pseudo-vector Obvious vector
m-m r-r
SIMD MIMD
Multi-Core

Categories of computer system
 Flynn’S Classification
 SISD
 SIMD
 MISD
 MIMD

CU
PU1
IS IS
PUn
DS
DS
MM
1
MM
n

MIMDMIMD
Control
Proc.n-1
Men.n-1
Proc.0
Men.0
Proc.1
Men.1
Proc.2
Men.2
……
NETWORK

MISDMISD
DS
CU1 PE1
IS
PEnCUn
…
…
IS
… MM1 MMn…
……DS
I/O

Amdahl’s LawAmdahl’s Law
 states that the performance improvement to be gained from
using some faster mode of execution is limited by the fraction
of the time the faster mode can be used.

Amdahl’s LawAmdahl’s Law
S P = 1 – S
0 1 time
1
Speedup = ─────────
S + (1 – S)/ N
Where N = number of parallel processors
Example: S = 0.6, N = 10, Speedup = 1.56
S = 0.6, N = ∞, Speedup = 1.67
Gene Amdahl, “Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities,” AFIPS Conference Proceedings,
(30), pp. 483-485, 1967.

CPUCPU
The Central Processing Unit (CPU) Microprocessor (µp)
 The brain of the computer
 Its task is to execute instructions
 It keeps on executing instructions from the moment it is
powered-on to the moment it is powered-off.
 Execution of various instructions causes the computer to
perform various tasks.
 There are a wide range of CPUs
 The dominant CPUs on the market today are from
 Intel: Pentium (Desktop), Xeon (Server), Pentium-M (Mobile), X-
scale (Embedded)
 AMD: Athlon (Desktop), Opteron (Server), Turion (Mobile), Geode
(Embedded)

Key Characteristics of aKey Characteristics of a
CPUCPU
Several metrics are used to describe the characteristics of a CPU
 Native “Word” size
 32-bit or 64-bit
 Clock speeds
 The clock speed of a CPU defines the rate at which the internal clock of the CPU
operates
 The clock sequences and determines the speed at which instructions are executed by the CPU
 Modern CPUs operate in Gigahertz (1 GHz = 109
Hz)
 However, clock speeds alone do not determine the overall ability of a CPU
 Instruction set
 More versatile CPUs support a richer and more efficient instruction sets
 Modern CPUs have new instruction sets such as SSE, SSE2, 3DNow that can be
used to boost performance of many common applications
 These instructions perform the operation of several instructions but take less time
 FLOPS: Floating Point Operations per Second
 FLOPS are a better metric for measuring a CPU’s computational capabilities
 Modern CPUs deliver 2-3 Giga FLOPS (1 GFLOPS = 109
FLOPS)
 FLOPS is also used as a metric for describing computational capabilities of
computer systems
 Modern supercomputers deliver 1 Terra Flop (TFLOP), 1 TFLOP = 1012
FLOPS.

Key Characteristics of a CPU (Contd.)Key Characteristics of a CPU (Contd.)
 Power consumption
 Power is an important factor in today’s computing
 Power is the product of voltage applied to the CPU (V) and the amount of current
drawn by the CPU (I)
 The unit for voltage is volts
 The unit for current is ampere (aka amps)
 Power = V x I
 Power is typically represented in Watts
 Modern desktop processors consume anywhere from 35 to 100 watts
 Lower power is better as power is proportional to heat
 More power implies the processor generates more heat and heating is a big problem
 Number of computational units of core per CPU
 Some CPUs have multiple cores
 Each core is an independent computational unit and can execute instructions in
parallel
 More the number of cores the better the CPU is for multi-threaded applications
 Single threaded applications typically experience a slow down on multi-core processors due to
reduced clock speeds

Key Characteristics of a CPU (Contd.)Key Characteristics of a CPU (Contd.)
 Cache size and configuration
 Cache is a small, but high speed memory that is fabricated along with the CPU
 Size of cache is inversely proportional to its speed
 Cost of the CPU increases as size of cache increases
 It is much faster than RAM
 It is used to minimize the overall latency of accessing RAM
 Microprocessors have a hierarchy of caches
 L1 cache: Fastest and closest to the core components
 L3 cache: Relatively slower and further away from CPU
 Example cache configurations
 Quad core AMD Opteron (Shanghai)
 32 KB (Data) + 32 KB (Instr.) L1 cache per core
 Unified 512 KB L2 cache per core
 Unified 6 MB shared L3 cache (for 4 cores)
 Quad core Intel Xeon (Nehalem)
 32 KB (Data) + 32 KB (Instr.) L1 Cache per core
 Unified 256 KB L2 cache per core
 Unified 8 MB shared L3 cache (for 4 cores)

Trends in computingTrends in computing
 Until recently, hardware performance
improvements have been primarily achieved due
to advancement in microprocessor fabrication
technologies:
 Steady improvement in processor clock speeds
 Faster clocks (with in the same family of processors)
provide higher FLOPS
 Increase in number of transistors on-chip
 More complex and sophisticated hardware to improve
performance
 Larger caches to provide rapid access to instruction and
data

Moore’s LawMoore’s Law
 The steady advancement in microprocessor technology was
predicted by Gordon Moore (in 1965), co-founder of Intel®
 Moore’s law states that the number of transistors on
microprocessors will double approximately every two years.
 Many advancements in digital technologies can be linked to Moore’s
law. This includes:
 Processing speed
 Memory Capacity
 Speed and bandwidth of data communication networks and
 Resolution of monitors and digital cameras
 Thus far, Moore’s law has steadily held true for about 40 years
(from 1965 to about 2005)
 Breakthroughs in miniaturization of transistors has been the
turnkey technology

Moore’s Law vs. Intel’sMoore’s Law vs. Intel’s
roadmaproadmap
 Here is a graph illustrating the progress of Moore’s law based on Intel®
Inc. technological roadmap (obtained from Wikipedia)

Stagnation of Moore’s LawStagnation of Moore’s Law
 In the past few years we have reached the fundamental
limits at IC fabrication technology (particularly
lithography and interconnect)
 It is no longer feasible to further miniaturize the
transistors on the IC
 They are already just a several atoms large and at this
point laws of physics change making it an extremely
challenging task
 Heat dissipation has reached breakdown threshold
 Heat is generated as a part of regular transistor operations
 Higher heat dissipations will cause the transistors to fail
 With the current state of the art a single processor
cannot yield any more than 4 to 5 GFLOPS
 How do we move beyond this barrier?

SIA Roadmap for ProcessorsSIA Roadmap for Processors
(1999)(1999)
Year 1999 2002 2005 2008 2011 2014
Feature size (nm) 180 130 100 70 50 35
Logic transistors/cm2
6.2M 18M 39M 84M 180M 390M
Clock (GHz) 1.25 2.1 3.5 6.0 10.0 16.9
Chip size (mm2
) 340 430 520 620 750 900
Power supply (V) 1.8 1.5 1.2 0.9 0.6 0.5
High-perf. Power (W) 90 130 160 170 175 183
Source: http://www.semichips.org

ISSCC, Feb. 2001, KeynoteISSCC, Feb. 2001, Keynote
“Ten years from now,
microprocessors will run at
10GHz to 30GHz and be capable
of processing 1 trillion operations
per second -- about the same
number of calculations that the
world's fastest supercomputer
can perform now.
“Unfortunately, if nothing
changes these chips will produce
as much heat, for their
proportional size, as a nuclear
reactor. . . .”
Patrick P. Gelsinger
Senior Vice President
General Manager
Digital Enterprise Group
INTEL CORP.

Intel ViewIntel View
 Soon >billion transistors integrated
 Clock frequency can still increase
 Future applications will demand TIPS
 Power? Heat?

TEMPARATURE IMAGE PF CELLTEMPARATURE IMAGE PF CELL
PROCESSORPROCESSOR

Multi-core and Multi-Multi-core and Multi-
processorsprocessors
 The solution to increasing the effective compute power is
via the use of multi-core, multi-processor computer
system along with suitable software
 This is a paradigm shift in hardware and software
technologies
 Multiple cores and multiple processors are
interconnected using a variety of high speed data
communication networks
 Software plays a central role in harnessing the power of
multi-core/multi-processor systems
 Most industry leaders believe this is the near future of
computing!

Everything is to be GREENEverything is to be GREEN
NOW!NOW!
 Green Buildings
 Green-IT

Multi-core TrendsMulti-core Trends
 Multi-core processors are most definitely the future of computing
 Both Intel and AMD are pushing for larger number of cores per CPU
package
 The Cell Broadband Engine (aka Cell) has 8 synergistic processing
elements (SPE)
 The Sun Microsystems Niagara has 8 cores, with each core capable of
running 8-threads.

Sun NiagaraSun Niagara
 32 way parallelism typically delivered 23x
performance improvement
Sun Niagara
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Dense
Protein
FEM-Sphr
FEM-Cant
Tunnel
FEM-Har
QCD
FEM-Ship
Econom
Epidem
FEM-Accel
Circuit
W
ebbase
LP
Median
GFlop/s
8 Cores x 4 Threads[Naïve]
1 Core x 1 Thread[Naïve]

A GENERIC MODERN PCA GENERIC MODERN PC

SINGLE CORE VS MULTI-CORE (8): ASINGLE CORE VS MULTI-CORE (8): A
COMPARISION FROM GEORGIA PACKINGCOMPARISION FROM GEORGIA PACKING
INSTITUTEINSTITUTE

AMD DUAL COREAMD DUAL CORE
PROCESSORPROCESSOR

MulticoreMulticore
ArchitectureArchitecture

Hierarchy of Modular Building BlocksHierarchy of Modular Building Blocks
High Speed Network
Core Core
Mem
Ctrl
I/O
AttachCache
Interconnect Fabric
High Speed Network
SMP Interconnect
Memory Memory
Chip
Board
Rack
Homogenous SMP on
Board
• 2 – 128 HW contexts
on board
Hierarchical SMP servers
with NUMA characteristics
Grid/Cluster
Core
Homogenous SMP on chip
• 2-32 HW contexts on chip
• Various forms of resource
sharing
Core
Core will support multiple
HW threads sharing a
single cache exhibiting
SMP characteristics.
Main Processor(s) with
Accelerator(s)
• Master-Slave relationship
between entities
Hierarchical SMP
servers
with non-
uniform
memory access
characteristics
Heterogenous collection of
processors on chip
• Heterogenity at data and
control flow level
• Systems will increasingly need to
implement a hybrid execution
model
• New programming systems need
to reduce the need for programmer
awareness of the topology on
which their program executes
The next gen programming system
must support programming simplicity
while leveraging the performance of the
underlying HW topology.

Looming “Multicore Crisis”Looming “Multicore Crisis”
Slide Source: Berkeley View of Landscape

Architecture trendsArchitecture trends
 Several processor cores on a chip and specialized computing
engines
 XML processing, cryptography, graphics
 Questions:
 how to interconnect large number of processor cores
 how to provide sufficient memory bandwidth
 how to structure the multilevel caching subsystem
 how to balance the general purpose computing resources with
specialized processing engines and all the supporting memory,
caching and interconnect structure, given a constant power
budget
 Software development processes
 how to program for multicore architectures
 how to test and evaluate the performance of multithreaded
applications

Key Features: Dual CoreKey Features: Dual Core
 Two physical cores in a package
 Each with its own execution
 resources
 Each with its own L1 cache
 32K instruction and 32K data
 8-way set associative; 64-byte line
 Both cores share the L2 cache
 2MB 8-way set associative; 64-byte line size
 10 clock cycles latency; Write Back update policy
Two Actual Processor CoresTwo Actual Processor Cores
EXE CoreEXE Core
FP UnitFP Unit
EXE CoreEXE Core
FP UnitFP Unit
L2 CacheL2 Cache
L1 CacheL1 Cache L1 CacheL1 Cache
System Bus
(667MHz, 5333MB/s)
System Bus
(667MHz, 5333MB/s)
Truly Parallel Multitasking & Multi-threadedTruly Parallel Multitasking & Multi-threaded
ExecutionExecution
Truly Parallel Multitasking & Multi-threadedTruly Parallel Multitasking & Multi-threaded
ExecutionExecution

Key Features: Smart CacheKey Features: Smart Cache
 Shared between the two cores
 Advanced Transfer Cache architecture
 Reduced bus traffic
 Both cores have full access to the entire cache
 Dynamic Cache sizing
BusBus
2 MB L2 Cache2 MB L2 Cache
Core1Core1 Core2Core2
Enables Greater SystemEnables Greater System
ResponsivenessResponsiveness

Advanced Smart CacheAdvanced Smart Cache

Key Features:Key Features:
Enhanced SpeedStepEnhanced SpeedStep
TechnologyTechnology
 Multi-point demand-based switching
 Supports transitions to deeper sleep modes
 Clock partitioning and recovery
 Dynamic Bus Parking
Breakthrough Performance and Improved Battery LifeBreakthrough Performance and Improved Battery Life

Wide Dynamic ExecutionWide Dynamic Execution

Smart Memory AccessSmart Memory Access

Key Features: Digital MediaKey Features: Digital Media
BoostBoost
Streaming SIMD
Extensions (SSE)
Decoder Throughput
Improvement
 High Performance ComputingHigh Performance Computing
 Digital PhotographyDigital Photography
 Digital MusicDigital Music
 Video EditingVideo Editing
 Internet Content CreationInternet Content Creation
 3D & 2D Modeling3D & 2D Modeling
 CAD ToolsCAD Tools
SSE/SSE2
Instruction
Optimization
Floating Point
Performance
Enhancement
New Enhanced
Streaming SIMD
Extensions 3
(SSE3)
Providing True SIMD Integer & Floating PointProviding True SIMD Integer & Floating Point
Performance!Performance!

Key Features: Digital MediaKey Features: Digital Media
Boost – SSE3 InstructionsBoost – SSE3 Instructions
 SIMD FP using
AOS format*
 Thread
Synchronization
 Video encoding
 Complex
arithmetic
 FP to integer
conversions
 HADDPD, HSUBPD
 HADDPS, HSUBPS
 MONITOR, MWAIT
 LDDQU
 ADDSUBPD, ADDSUBPS,
 MOVDDUP, MOVSHDUP,
 MOVSLDUP
 FISTTP
* Also benefits Complex Arithmetic and Vectorization

Intel Xeon(Clovertown)Intel Xeon(Clovertown)
 pseudo quad-core / socket
 4-issue, out of order,
super-scalar
 Fully pumped SSE
 2.33GHz
 21GB/s, 74.6 GFlop/s
4MB
Shared L2
Xeon
FSB
Fully Buffered DRAM
10.6GB/s
Xeon
Blackford
10.6GB/s
10.6 GB/s(write)
4MB
Shared L2
Xeon Xeon
4MB
Shared L2
Xeon
FSB
Xeon
4MB
Shared L2
Xeon Xeon
21.3 GB/s(read)

AMD OpteronAMD Opteron
 Dual core / socket
 3-issue, out of order,
super-scalar
 Half pumped SSE
 2.2GHz
 21GB/s, 17.6 GFlop/s
 Strong NUMA issues
1MB
victim
Opteron
1MB
victim
Opteron
Memory Controller / HT
1MB
victim
Opteron
1MB
victim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
8GB/s

IBM Cell BladeIBM Cell Blade
 Eight SPEs / socket
 Dual-issue, in-order, VLIW like
 SIMD only ISA
 Disjoint Local Store address
space + DMA
 Weak DP FPU
 51.2GB/s, 29.2GFlop/s,
 Strong NUMA issues
XDR DRAM
25.6GB/s
EIB(RingNetwork)
<<20GB/s
each
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB(RingNetwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC

Sun NiagaraSun Niagara
 Eight core
 Each core is 4 way multithreaded
 Single-issue in-order
 Shared, very slow FPU
 1.0GHz
 25.6GB/s, 0.1GFlop/s, (8GIPS)
CrossbarSwitch
25.6GB/s
DDR2DRAM
3MBSharedL2(12way)
8K D$MT UltraSparc
FPU
8K D$MT UltraSparc
8K D$MT UltraSparc
8K D$MT UltraSparc
8K D$MT UltraSparc
8K D$MT UltraSparc
8K D$MT UltraSparc
8K D$MT UltraSparc
64GB/s
(fill)
32GB/s
(writethru)

Cell’s Nine-Processor ChipCell’s Nine-Processor Chip
© IEEE Spectrum, January 2006
Eight Identical
Processors
f = 5.6GHz (max)
44.8 Gflops

QuickTime™ and a
decompressor
are needed to see this picture.
Symmetric Multi-Core ProcessorsSymmetric Multi-Core Processors
Phenom X4

QuickTime™ and a
decompressor
are needed to see this picture.
Symmetric Multi-Core ProcessorsSymmetric Multi-Core Processors
UltraSparc

Future of multi-coresFuture of multi-cores
 Moore’s Law predicts that the number of cores
will double every 18 - 24 months
 2007 - 8 cores on a chip
(IBM Cell has 9 and Sun will has 8)
 2009 - 16 cores
 2013 - 64 cores
 2015 - 128 cores
 2021 - 1k cores
Exploiting TLP has the potential to close Moore’s gap
Real performance has the potential to scale with number of
cores

IBM Cell Example: Will this be theIBM Cell Example: Will this be the
trend for large multi-core designs?trend for large multi-core designs?
Cell Processor Components
• Power Processor
Element (PPE)
• Element Interconnect
Bus (EIB)
• 8 – Synergistic
Processing Element
(SPE)
• Rambus XDR
Memory controller
• Flex I/O
FLEX
I/O
SPE SPESPE SPE
SPE SPESPE SPE
Rambus
XDR
PPE
PowerPC
512K
L2
Cache
4 x 128 bit
Element Interconnect Bus
FLEX
I/O
SPE SPESPE SPE
SPE SPESPE SPE
Rambus
XDR
PPE
PowerPC
512K
L2
Cache
4 x 128 bit
Element Interconnect Bus

Obvious silicon real-estate advantageObvious silicon real-estate advantage
for software memory managedfor software memory managed
processorsprocessors
IBM
AMD
Intel
Cell
BE

Multithreading andMultithreading and
Multi-CoreMulti-Core
ProcessorsProcessors

Instruction-Level ParallelismInstruction-Level Parallelism
(ILP)(ILP)
 Extract parallelism in a single program
 Superscalar processors have multiple execution
units working in parallel
 Challenge to find enough instructions that can be
executed concurrently
 Out-of-order execution => instructions are sent
to execution units based on instruction
dependencies rather than program order

Performance Beyond ILPPerformance Beyond ILP
 Much higher natural parallelism in some applications
 Database, web servers, or scientific codes
 Explicit Thread-Level Parallelism
 Thread: has own instructions and data
 May be part of a parallel program or independent programs
 Each thread has all state (instructions, data, PC, register state,
and so on) needed to execute
 Multithreading: Thread-Level Parallelism within a processor

Thread-Level ParallelismThread-Level Parallelism
(TLP)(TLP)
 ILP exploits implicit parallel operations within loop or
straight-line code segment
 TLP is explicitly represented by multiple threads of
execution that are inherently parallel
 Goal: Use multiple instruction streams to improve
 Throughput of computers that run many programs
 Execution time of multi-threaded programs
 TLP could be more cost-effective to exploit than ILP

Fine-Grained MultithreadingFine-Grained Multithreading
 Switches between threads on each instruction, interleaving
execution of multiple threads
 Usually done round-robin, skipping stalled threads
 CPU must be able to switch threads on every clock cycle
 Pro: Hide latency of both short and long stalls
 Instructions from other threads always available to execute
 Easy to insert on short stalls
 Con: Slow down execution of individual threads
 Thread ready to execute without stalls will be delayed by instructions
from other threads
 Used on Sun’s Niagara

Course-GrainedCourse-Grained
MultithreadingMultithreading
 Switches threads only on costly stalls
 e.g., L2 cache misses
 Pro: No switching each clock cycle
 Relieves need to have very fast thread switching
 No slow down for ready-to-go threads
 Other threads only issue instructions when the main one would stall (for
long time) anyway
 Con: Limitation in hiding shorter stalls
 Pipeline must be emptied or frozen on stall, since CPU issues
instructions from only one thread
 New thread must fill pipe before instructions can complete
 Thus, better for reducing penalty of high-cost stalls where pipeline
refill << stall time
 Used in IBM AS/400

MultithreadingMultithreading ParadigmsParadigms
Thread 1
Unused
ExecutionTime
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Simultaneous
Multithreading
(SMT)
or Intel’s HT
Fine-grained
Multithreading
(cycle-by-cycle
Interleaving)
Thread 2
Thread 3
Thread 4
Thread 5
Coarse-grained
Multithreading
(Block Interleaving)
Chip
Multiprocessor
(CMP)
or called
Multi-Core Processors
today

Simultaneous MultithreadingSimultaneous Multithreading
(SMT)(SMT) Exploits TLP at the same time it exploits ILP
 Intel’s HyperThreading (2-way SMT)
 Others: IBM Power5 and Intel future multicore (8-core, 2-thread 45nm Nehalem)
 Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources
RegReg
FileFile
FMultFMult
(4 cyc(4 cyclesles))
FAddFAdd
(2 cyc)(2 cyc)
ALU1ALU1ALU2ALU2
Load/StoreLoad/Store
(variable)(variable)
Fdiv, unpipeFdiv, unpipe
(16 cyc(16 cyclesles))
RS & ROBRS & ROB
plusplus
PhysicalPhysical
RegisterRegister
FileFile
RS & ROBRS & ROB
plusplus
PhysicalPhysical
RegisterRegister
FileFile
FetchFetch
UnitUnit
FetchFetch
UnitUnit
PCPCPCPCPCPCPCPCPCPCPCPCPCPCPCPC
I-CacheI-CacheI-CacheI-Cache
DecodeDecodeDecodeDecode
RegisterRegister
RRenameenamerr
RegisterRegister
RRenameenamerr RegReg
FileFile
RegReg
FileFile
RegReg
FileFile
RegReg
FileFile
RegReg
FileFile
RegReg
FileFile
Reg
File
DD-Cache-CacheDD-Cache-Cache

Simultaneous MultithreadingSimultaneous Multithreading
(SMT)(SMT)
 Insight that dynamically scheduled processor already has many
HW mechanisms to support multithreading
 Large set of virtual registers that can be used to hold register sets for
independent threads
 Register renaming provides unique register identifiers
 Instructions from multiple threads can be mixed in data path
 Without confusing sources and destinations across threads!
 Out-of-order completion allows the threads to execute out of order,
and get better utilization of the HW
 Just add per-thread renaming table and keep separate PCs
 Independent commitment can be supported via separate reorder
buffer for each thread

SMT PipelineSMT Pipeline
Fetch Decode/
Map
Queue Reg
Read
Execute Dcache/
Store
Buffer
Reg
Write
Retire
Icache
Dcache
PC
Register
Map
Regs Regs
Data from Compaq

Power 4Power 4
Single-threaded predecessor to Power 5. 8
execution units in out-of-order engine;
each can issue instruction each cycle.

Power 4
Power 5
2 fetch (PC),
2 initial decodes
2 commits
(architected
register sets)

Pentium 4 Hyperthreading:Pentium 4 Hyperthreading:
Performance ImprovementsPerformance Improvements

Questions To be Asked WhileQuestions To be Asked While
Migrating to Multi-CoreMigrating to Multi-Core

Application performanceApplication performance
1. Does the concurrency platform allow me to measure the
parallelism I’ve exposed in
my application?
2. Does the concurrency platform address response time bottlenecks,‐
or just offer
more throughput?
3. Does application performance scale up linearly as cores are added,
or does it quickly
reach diminishing returns?
4. Is my multicore enabled code just as fast as my original serial‐
code when run on a
single processor?
5. Does the concurrency platform's scheduler load balance irregular‐
applications
efficiently to achieve full utilization?
6. Will my application “play nicely” with other jobs on the system, or
do multiple jobs
cause thrashing of resources?
7. What tools are available for detecting multicore performance
bottlenecks?

Software reliabilitySoftware reliability
8. How much harder is it to debug my multicore enabled‐
application than to debug my
original application?
9. Can I use my standard, familiar debugging tools?
10. Are there effective debugging tools to identify and localize
parallel programming‐
errors, such as data race bugs?‐
11. Must I use a parallel debugger even if I make an ordinary
serial programming error?
12. What changes must I make to my release engineering‐
processes to ensure that my
delivered software is reliable?
13. Can I use my existing unit tests and regression tests?

Development timeDevelopment time
14. To multicore enable my application, how much logical‐
restructuring of my
application must I do?
15. Can I easily train programmers to use the concurrency platform?
16. Can I maintain just one code base, or must I maintain a serial
and parallel versions?
17. Can I avoid rewriting my application every time a new processor
generation
increases the core count?
18. Can I easily multicore enable ill structured and irregular code, or‐ ‐
is the concurrency
platform limited to data parallel applications?‐
19. Does the concurrency platform properly support modern
programming paradigms,
such as objects, templates, and exceptions?
20. What does it take to handle global variables in my application?

OpenMP: A Portable
Solution for
Threading

What Is OpenMP*?What Is OpenMP*?
 Compiler directives for multithreaded
programming
 Easy to create threaded Fortran and C/C++
codes
 Supports data parallelism model
 Incremental parallelism
 Combines serial and parallel code in single source

OpenMP* ArchitectureOpenMP* Architecture
 Fork-join model
 Work-sharing constructs
 Data environment constructs
 Synchronization constructs
 Extensive Application Program Interface (API) for
finer control

Programming ModelProgramming Model
Fork-join parallelism:
• Master thread spawns a team of threads as needed
• Parallelism is added incrementally: the sequential
program evolves into a parallel program
Parallel Regions
Master
Thread

OpenMP* Pragma SyntaxOpenMP* Pragma Syntax
 Most constructs in OpenMP* are compiler
directives or pragmas.
 For C and C++, the pragmas take the form:
 #pragma omp construct [clause [clause]…]

Parallel RegionsParallel Regions
 Defines parallel region over
structured block of code
 Threads are created as
‘parallel’ pragma is crossed
 Threads block at end of region
 Data is shared among threads
unless specified otherwise
#pragma omp parallel
Thread
1
Thread
2
Thread
3
C/C++ :
{
block
}

How Many Threads?How Many Threads?
 Set environment variable for number of threads
set OMP_NUM_THREADS=4
 There is no standard default for this variable
 Many systems:
 # of threads = # of processors
 Intel®
compilers use this default

Work-sharing ConstructWork-sharing Construct
 Splits loop iterations into threads
 Must be in the parallel region
 Must precede the loop
#pragma omp for
for (I=0;I<N;I++){
Do_Work(I);
}

Work-sharing ConstructWork-sharing Construct
 Threads are assigned an
independent set of
iterations
 Threads must wait at the
end of work-sharing
construct
#pragma omp for
Implicit barrier
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
#pragma omp for
for(i = 1, i < 13, i++)
c[i] = a[i] + b[i]

Combining pragmasCombining pragmas
 These two code segments are equivalent
{
#pragma omp for
for (i=0;i< MAX; i++)
{ res[i]
= huge();
}
}
#pragma omp parallel for
for (i=0;i< MAX; i++) {
res[i] = huge();
}

Data EnvironmentData Environment
 OpenMP uses a shared-memory programming
model
 Most variables are shared by default.
 Global variables are shared among threads
 C/C++: File scope variables, static

Data EnvironmentData Environment
 But, not everything is shared...
 Stack variables in functions called from parallel regions
are PRIVATE
 Automatic variables within a statement block are
PRIVATE
 Loop index variables are private (with exceptions)
 C/C+: The first loop index variable in nested loops
following a #pragma omp for

Data Scope AttributesData Scope Attributes
 The default status can be modified with
 default (shared | none)
 Scoping attribute clauses
 shared(varname,…)
 private(varname,…)

The Private ClauseThe Private Clause
 Reproduces the variable for each thread
 Variables are un-initialized; C++ object is default
constructed
 Any value external to the parallel region is undefined
void* work(float* c, int N) {
float x, y; int i;
#pragma omp parallel for private(x,y)
for(i=0; i<N; i++) {
x = a[i]; y = b[i];
c[i] = x + y;
}
}

Protect Shared DataProtect Shared Data
 Must protect access to shared, modifiable data
float dot_prod(float* a, float* b, int N)
{
float sum = 0.0;
#pragma omp parallel for shared(sum)
for(int i=0; i<N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}

 #pragma omp critical [(lock_name)]
 Defines a critical region on a structured block
OpenMP* Critical ConstructOpenMP* Critical Construct
float RES;
{ float B;
#pragma omp for
for(int i=0; i<niters; i++){
B = big_job(i);
#pragma omp critical (RES_lock)
consum (B, RES);
}
}
Threads wait their turn –at
a time, only one calls
consum() thereby
protecting RES from race
conditions
Naming the critical
construct RES_lock is
optional

OpenMP* Reduction ClauseOpenMP* Reduction Clause
 reduction (op : list)
 The variables in “list” must be shared in the
enclosing parallel region
 Inside parallel or work-sharing construct:
 A PRIVATE copy of each list variable is created and
initialized depending on the “op”
 These copies are updated locally by threads
 At end of construct, local copies are combined through
“op” into a single value and combined with the value in
the original SHARED variable

Reduction ExampleReduction Example
 Local copy of sum for each thread
 All local copies of sum added together and stored
in “global” variable
#pragma omp parallel for reduction(+:sum)
for(i=0; i<N; i++) {
sum += a[i] * b[i];
}

 A range of associative operands can be used with
reduction
 Initial values are the ones that make sense
mathematically
C/C++ Reduction OperationsC/C++ Reduction Operations
Operand Initial Value
+ 0
* 1
- 0
^ 0
Operand Initial Value
& ~0
| 0
&& 1
|| 0

Schedule Clause When To Use
STATIC Predictable and similar
work per iteration
DYNAMIC Unpredictable, highly
variable work per
iteration
GUIDED Special case of dynamic
to reduce scheduling
overhead
Which Schedule to UseWhich Schedule to Use

Parallel SectionsParallel Sections
 Independent sections of code can
execute concurrently
Serial Parallel
#pragma omp parallel sections
{
#pragma omp section
phase1();
#pragma omp section
phase2();
#pragma omp section
phase3();
}

 Denotes block of code to be executed by only
one thread
 First thread to arrive is chosen
 Implicit barrier at end
Single ConstructSingle Construct
{
DoManyThings();
#pragma omp single
{
ExchangeBoundaries();
} // threads wait here for single
DoManyMoreThings();
}

 Denotes block of code to be executed only by
the master thread
 No implicit barrier at end
Master ConstructMaster Construct
{
DoManyThings();
#pragma omp master
{ // if not master skip to next stmt
ExchangeBoundaries();
}
DoManyMoreThings();
}

Implicit BarriersImplicit Barriers
 Several OpenMP* constructs have implicit
barriers
 parallel
 for
 single
 Unnecessary barriers hurt performance
 Waiting threads accomplish no work!
 Suppress implicit barriers, when safe, with the
nowait clause

Barrier ConstructBarrier Construct
 Explicit barrier synchronization
 Each thread waits until all threads arrive
#pragma omp parallel shared (A, B, C)
{
DoSomeWork(A,B);
printf(“Processed A into Bn”);
#pragma omp barrier
DoSomeWork(B,C);
printf(“Processed B into Cn”);
}

Atomic ConstructAtomic Construct
 Special case of a critical section
 Applies only to simple update of memory location
#pragma omp parallel for shared(x, y, index, n)
for (i = 0; i < n; i++) {
#pragma omp atomic
x[index[i]] += work1(i);
y[i] += work2(i);
}

OpenMP* APIOpenMP* API
 Get the thread number within a team
 int omp_get_thread_num(void);
 Get the number of threads in a team
 int omp_get_num_threads(void);
 Usually not needed for OpenMP codes
 Can lead to code not being serially consistent
 Does have specific uses (debugging)
 Must include a header file
#include <omp.h>

More OpenMP*More OpenMP*
 Data environment constructs
 FIRSTPRIVATE
 LASTPRIVATE
 THREADPRIVATE

 Variables initialized from shared variable
 C++ objects are copy-constructed
Firstprivate ClauseFirstprivate Clause
incr=0;
#pragma omp parallel for firstprivate(incr)
for (I=0;I<=MAX;I++) {
if ((I%2)==0) incr++;
A(I)=incr;
}

 Variables update shared variable using value from
last iteration
 C++ objects are updated as if by assignment
Lastprivate ClauseLastprivate Clause
void sq2(int n,
double *lastterm)
{
double x; int i;
#pragma omp for lastprivate(x)
for (i = 0; i < n; i++){
x = a[i]*a[i] + b[i]*b[i];
b[i] = sqrt(x);
}
lastterm = x;
}

 Preserves global scope for per-thread storage
 Legal for name-space-scope and file-scope
 Use copyin to initialize from master thread
Threadprivate ClauseThreadprivate Clause
struct Astruct A;
#pragma omp threadprivate(A)
…
#pragma omp parallel copyin(A)
do_something_to(&A);
…
do_something_else_to(&A);
Private copies of “A”
persist between
regions

Performance IssuesPerformance Issues
 Idle threads do no useful work
 Divide work among threads as evenly as possible
 Threads should finish parallel tasks at same time
 Synchronization may be necessary
 Minimize time waiting for protected resources

Load ImbalanceLoad Imbalance
 Unequal work loads lead to idle threads and
wasted time.
time Busy
Idle
{
#pragma omp for
for( ; ; ){
}
}
time

{
#pragma omp critical
{
...
}
...
}
SynchronizationSynchronization
 Lost time waiting for locks
time
Busy
Idle
In Critical

Performance TuningPerformance Tuning
 Profilers use sampling to provide performance
data.
 Traditional profilers are of limited use for tuning
OpenMP*:
 Measure CPU time, not wall clock time
 Do not report contention for synchronization objects
 Cannot report load imbalance
 Are unaware of OpenMP constructs
Programmers need profilers specifically designed for OpenMP.

Static Scheduling: Doing ItStatic Scheduling: Doing It
By HandBy Hand
 Must know:
 Number of threads (Nthrds)
 Each thread ID number (id)
 Compute start and end iterations:
{
int i, istart, iend;
istart = id * N / Nthrds;
iend = (id+1) * N / Nthrds;
for(i=istart;i<iend;i++){
c[i] = a[i] + b[i];}
}

SummarySummary
 Discussed about Multi-Core Processors
 Discussed about openMP programming.

Webinaron muticoreprocessors

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Webinaron muticoreprocessors

Ähnlich wie Webinaron muticoreprocessors (20)

Mehr von Nagasuri Bala Venkateswarlu

Mehr von Nagasuri Bala Venkateswarlu (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Webinaron muticoreprocessors

Hinweis der Redaktion