SlideShare ist ein Scribd-Unternehmen logo
1 von 111
Introduction toIntroduction to
Multi-CoreMulti-Core
Architectures: AArchitectures: A
Paradigm Shift in ITParadigm Shift in IT
Dr NB Venkateswarlu
RITCH CENTER,
Visakhapatnam
Prof in CSE, AITAM, TEKKALI
IndexIndex
 What is Computer Architecture
 Which are Computer Components
 Classification of Computers
 The Dawn of Multi-Core (Stream) processors.
 Symmetric Multi-Core (SMP)
 Heterogeneous Multi-Core Processors
 GPU (Graphic Processing Unit)
 Intel Family of Multi-Core Processors
 AMD Family of Multi-Core Processors
 Others
 Multi-Threading, Hyper-Threading, Multi-Core
 OpenMP Programming Paradigm
ConceptsConcepts
What is Computer Architecture
I/O systemInstr. Set Proc.
Compiler
Operating
System
Application
Digital Design
Circuit Design
Instruction Set
Architecture
Firmware
Datapath & Control
Layout
Evolvement of computer architecture
scalar
Pre-
exe
seque
nce
Multi-
processing
array
Multi-
computer
multiprocesin
g
Ledgend
processor
I/E overlap Fun parallel
Multi fun Pipe line
Pseudo-vector Obvious vector
m-m r-r
SIMD MIMD
Multi-Core
Categories of computer system
 Flynn’S Classification
 SISD
 SIMD
 MISD
 MIMD
 SISD
CU MUPUI/O
IS DS
IS
CU
PU1
IS IS
PUn
DS
DS
MM
1
MM
n
MIMDMIMD
Control
Proc.n-1
Men.n-1
Proc.0
Men.0
Proc.1
Men.1
Proc.2
Men.2
……
NETWORK
MISDMISD
DS
CU1 PE1
IS
PEnCUn
…
…
IS
… MM1 MMn…
……DS
I/O
Amdahl’s LawAmdahl’s Law
 states that the performance improvement to be gained from
using some faster mode of execution is limited by the fraction
of the time the faster mode can be used.
Amdahl’s LawAmdahl’s Law
S P = 1 – S
0 1 time
1
Speedup = ─────────
S + (1 – S)/ N
Where N = number of parallel processors
Example: S = 0.6, N = 10, Speedup = 1.56
S = 0.6, N = ∞, Speedup = 1.67
Gene Amdahl, “Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities,” AFIPS Conference Proceedings,
(30), pp. 483-485, 1967.
CPUCPU
The Central Processing Unit (CPU) Microprocessor (µp)
 The brain of the computer
 Its task is to execute instructions
 It keeps on executing instructions from the moment it is
powered-on to the moment it is powered-off.
 Execution of various instructions causes the computer to
perform various tasks.
 There are a wide range of CPUs
 The dominant CPUs on the market today are from
 Intel: Pentium (Desktop), Xeon (Server), Pentium-M (Mobile), X-
scale (Embedded)
 AMD: Athlon (Desktop), Opteron (Server), Turion (Mobile), Geode
(Embedded)
Key Characteristics of aKey Characteristics of a
CPUCPU
Several metrics are used to describe the characteristics of a CPU
 Native “Word” size
 32-bit or 64-bit
 Clock speeds
 The clock speed of a CPU defines the rate at which the internal clock of the CPU
operates
 The clock sequences and determines the speed at which instructions are executed by the CPU
 Modern CPUs operate in Gigahertz (1 GHz = 109
Hz)
 However, clock speeds alone do not determine the overall ability of a CPU
 Instruction set
 More versatile CPUs support a richer and more efficient instruction sets
 Modern CPUs have new instruction sets such as SSE, SSE2, 3DNow that can be
used to boost performance of many common applications
 These instructions perform the operation of several instructions but take less time
 FLOPS: Floating Point Operations per Second
 FLOPS are a better metric for measuring a CPU’s computational capabilities
 Modern CPUs deliver 2-3 Giga FLOPS (1 GFLOPS = 109
FLOPS)
 FLOPS is also used as a metric for describing computational capabilities of
computer systems
 Modern supercomputers deliver 1 Terra Flop (TFLOP), 1 TFLOP = 1012
FLOPS.
Key Characteristics of a CPU (Contd.)Key Characteristics of a CPU (Contd.)
Several metrics are used to describe the characteristics of a CPU
 Power consumption
 Power is an important factor in today’s computing
 Power is the product of voltage applied to the CPU (V) and the amount of current
drawn by the CPU (I)
 The unit for voltage is volts
 The unit for current is ampere (aka amps)
 Power = V x I
 Power is typically represented in Watts
 Modern desktop processors consume anywhere from 35 to 100 watts
 Lower power is better as power is proportional to heat
 More power implies the processor generates more heat and heating is a big problem
 Number of computational units of core per CPU
 Some CPUs have multiple cores
 Each core is an independent computational unit and can execute instructions in
parallel
 More the number of cores the better the CPU is for multi-threaded applications
 Single threaded applications typically experience a slow down on multi-core processors due to
reduced clock speeds
Key Characteristics of a CPU (Contd.)Key Characteristics of a CPU (Contd.)
Several metrics are used to describe the characteristics of a CPU
 Cache size and configuration
 Cache is a small, but high speed memory that is fabricated along with the CPU
 Size of cache is inversely proportional to its speed
 Cost of the CPU increases as size of cache increases
 It is much faster than RAM
 It is used to minimize the overall latency of accessing RAM
 Microprocessors have a hierarchy of caches
 L1 cache: Fastest and closest to the core components
 L3 cache: Relatively slower and further away from CPU
 Example cache configurations
 Quad core AMD Opteron (Shanghai)
 32 KB (Data) + 32 KB (Instr.) L1 cache per core
 Unified 512 KB L2 cache per core
 Unified 6 MB shared L3 cache (for 4 cores)
 Quad core Intel Xeon (Nehalem)
 32 KB (Data) + 32 KB (Instr.) L1 Cache per core
 Unified 256 KB L2 cache per core
 Unified 8 MB shared L3 cache (for 4 cores)
Trends in computingTrends in computing
 Until recently, hardware performance
improvements have been primarily achieved due
to advancement in microprocessor fabrication
technologies:
 Steady improvement in processor clock speeds
 Faster clocks (with in the same family of processors)
provide higher FLOPS
 Increase in number of transistors on-chip
 More complex and sophisticated hardware to improve
performance
 Larger caches to provide rapid access to instruction and
data
Moore’s LawMoore’s Law
 The steady advancement in microprocessor technology was
predicted by Gordon Moore (in 1965), co-founder of Intel®
 Moore’s law states that the number of transistors on
microprocessors will double approximately every two years.
 Many advancements in digital technologies can be linked to Moore’s
law. This includes:
 Processing speed
 Memory Capacity
 Speed and bandwidth of data communication networks and
 Resolution of monitors and digital cameras
 Thus far, Moore’s law has steadily held true for about 40 years
(from 1965 to about 2005)
 Breakthroughs in miniaturization of transistors has been the
turnkey technology
Moore’s Law vs. Intel’sMoore’s Law vs. Intel’s
roadmaproadmap
 Here is a graph illustrating the progress of Moore’s law based on Intel®
Inc. technological roadmap (obtained from Wikipedia)
Stagnation of Moore’s LawStagnation of Moore’s Law
 In the past few years we have reached the fundamental
limits at IC fabrication technology (particularly
lithography and interconnect)
 It is no longer feasible to further miniaturize the
transistors on the IC
 They are already just a several atoms large and at this
point laws of physics change making it an extremely
challenging task
 Heat dissipation has reached breakdown threshold
 Heat is generated as a part of regular transistor operations
 Higher heat dissipations will cause the transistors to fail
 With the current state of the art a single processor
cannot yield any more than 4 to 5 GFLOPS
 How do we move beyond this barrier?
SIA Roadmap for ProcessorsSIA Roadmap for Processors
(1999)(1999)
Year 1999 2002 2005 2008 2011 2014
Feature size (nm) 180 130 100 70 50 35
Logic transistors/cm2
6.2M 18M 39M 84M 180M 390M
Clock (GHz) 1.25 2.1 3.5 6.0 10.0 16.9
Chip size (mm2
) 340 430 520 620 750 900
Power supply (V) 1.8 1.5 1.2 0.9 0.6 0.5
High-perf. Power (W) 90 130 160 170 175 183
Source: http://www.semichips.org
ISSCC, Feb. 2001, KeynoteISSCC, Feb. 2001, Keynote
“Ten years from now,
microprocessors will run at
10GHz to 30GHz and be capable
of processing 1 trillion operations
per second -- about the same
number of calculations that the
world's fastest supercomputer
can perform now.
“Unfortunately, if nothing
changes these chips will produce
as much heat, for their
proportional size, as a nuclear
reactor. . . .”
Patrick P. Gelsinger
Senior Vice President
General Manager
Digital Enterprise Group
INTEL CORP.
Intel ViewIntel View
 Soon >billion transistors integrated
 Clock frequency can still increase
 Future applications will demand TIPS
 Power? Heat?
TEMPARATURE IMAGE PF CELLTEMPARATURE IMAGE PF CELL
PROCESSORPROCESSOR
Multi-core and Multi-Multi-core and Multi-
processorsprocessors
 The solution to increasing the effective compute power is
via the use of multi-core, multi-processor computer
system along with suitable software
 This is a paradigm shift in hardware and software
technologies
 Multiple cores and multiple processors are
interconnected using a variety of high speed data
communication networks
 Software plays a central role in harnessing the power of
multi-core/multi-processor systems
 Most industry leaders believe this is the near future of
computing!
Everything is to be GREENEverything is to be GREEN
NOW!NOW!
 Green Buildings
 Green-IT
Multi-core TrendsMulti-core Trends
 Multi-core processors are most definitely the future of computing
 Both Intel and AMD are pushing for larger number of cores per CPU
package
 The Cell Broadband Engine (aka Cell) has 8 synergistic processing
elements (SPE)
 The Sun Microsystems Niagara has 8 cores, with each core capable of
running 8-threads.
Sun NiagaraSun Niagara
 32 way parallelism typically delivered 23x
performance improvement
Sun Niagara
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Dense
Protein
FEM-Sphr
FEM-Cant
Tunnel
FEM-Har
QCD
FEM-Ship
Econom
Epidem
FEM-Accel
Circuit
W
ebbase
LP
Median
GFlop/s
8 Cores x 4 Threads[Naïve]
1 Core x 1 Thread[Naïve]
A GENERIC MODERN PCA GENERIC MODERN PC
SINGLE CORE VS MULTI-CORE (8): ASINGLE CORE VS MULTI-CORE (8): A
COMPARISION FROM GEORGIA PACKINGCOMPARISION FROM GEORGIA PACKING
INSTITUTEINSTITUTE
INTEL CORE-DUOINTEL CORE-DUO
AMD DUAL COREAMD DUAL CORE
PROCESSORPROCESSOR
MulticoreMulticore
ArchitectureArchitecture
ArchitectureArchitecture
Hierarchy of Modular Building BlocksHierarchy of Modular Building Blocks
High Speed Network
Core Core
Mem
Ctrl
I/O
AttachCache
Interconnect Fabric
High Speed Network
SMP Interconnect
Memory Memory
Chip
Board
Rack
Homogenous SMP on
Board
• 2 – 128 HW contexts
on board
Hierarchical SMP servers
with NUMA characteristics
Grid/Cluster
Core
Homogenous SMP on chip
• 2-32 HW contexts on chip
• Various forms of resource
sharing
Core
Core will support multiple
HW threads sharing a
single cache exhibiting
SMP characteristics.
Main Processor(s) with
Accelerator(s)
• Master-Slave relationship
between entities
Hierarchical SMP
servers
with non-
uniform
memory access
characteristics
Heterogenous collection of
processors on chip
• Heterogenity at data and
control flow level
• Systems will increasingly need to
implement a hybrid execution
model
• New programming systems need
to reduce the need for programmer
awareness of the topology on
which their program executes
The next gen programming system
must support programming simplicity
while leveraging the performance of the
underlying HW topology.
Looming “Multicore Crisis”Looming “Multicore Crisis”
Slide Source: Berkeley View of Landscape
Architecture trendsArchitecture trends
 Several processor cores on a chip and specialized computing
engines
 XML processing, cryptography, graphics
 Questions:
 how to interconnect large number of processor cores
 how to provide sufficient memory bandwidth
 how to structure the multilevel caching subsystem
 how to balance the general purpose computing resources with
specialized processing engines and all the supporting memory,
caching and interconnect structure, given a constant power
budget
 Software development processes
 how to program for multicore architectures
 how to test and evaluate the performance of multithreaded
applications
Intel Multi-Core
Processors
Key Features: Dual CoreKey Features: Dual Core
 Two physical cores in a package
 Each with its own execution
 resources
 Each with its own L1 cache
 32K instruction and 32K data
 8-way set associative; 64-byte line
 Both cores share the L2 cache
 2MB 8-way set associative; 64-byte line size
 10 clock cycles latency; Write Back update policy
Two Actual Processor CoresTwo Actual Processor Cores
EXE CoreEXE Core
FP UnitFP Unit
EXE CoreEXE Core
FP UnitFP Unit
L2 CacheL2 Cache
L1 CacheL1 Cache L1 CacheL1 Cache
System Bus
(667MHz, 5333MB/s)
System Bus
(667MHz, 5333MB/s)
Truly Parallel Multitasking & Multi-threadedTruly Parallel Multitasking & Multi-threaded
ExecutionExecution
Truly Parallel Multitasking & Multi-threadedTruly Parallel Multitasking & Multi-threaded
ExecutionExecution
Key Features: Smart CacheKey Features: Smart Cache
 Shared between the two cores
 Advanced Transfer Cache architecture
 Reduced bus traffic
 Both cores have full access to the entire cache
 Dynamic Cache sizing
BusBus
2 MB L2 Cache2 MB L2 Cache
Core1Core1 Core2Core2
Enables Greater SystemEnables Greater System
ResponsivenessResponsiveness
Advanced Smart CacheAdvanced Smart Cache
Key Features:Key Features:
Enhanced SpeedStepEnhanced SpeedStep
TechnologyTechnology
 Multi-point demand-based switching
 Supports transitions to deeper sleep modes
 Clock partitioning and recovery
 Dynamic Bus Parking
Breakthrough Performance and Improved Battery LifeBreakthrough Performance and Improved Battery Life
Wide Dynamic ExecutionWide Dynamic Execution
Macro FusionMacro Fusion
Smart Memory AccessSmart Memory Access
Key Features: Digital MediaKey Features: Digital Media
BoostBoost
Streaming SIMD
Extensions (SSE)
Decoder Throughput
Improvement
 High Performance ComputingHigh Performance Computing
 Digital PhotographyDigital Photography
 Digital MusicDigital Music
 Video EditingVideo Editing
 Internet Content CreationInternet Content Creation
 3D & 2D Modeling3D & 2D Modeling
 CAD ToolsCAD Tools
SSE/SSE2
Instruction
Optimization
Floating Point
Performance
Enhancement
New Enhanced
Streaming SIMD
Extensions 3
(SSE3)
Providing True SIMD Integer & Floating PointProviding True SIMD Integer & Floating Point
Performance!Performance!
Key Features: Digital MediaKey Features: Digital Media
Boost – SSE3 InstructionsBoost – SSE3 Instructions
 SIMD FP using
AOS format*
 Thread
Synchronization
 Video encoding
 Complex
arithmetic
 FP to integer
conversions
 HADDPD, HSUBPD
 HADDPS, HSUBPS
 MONITOR, MWAIT
 LDDQU
 ADDSUBPD, ADDSUBPS,
 MOVDDUP, MOVSHDUP,
 MOVSLDUP
 FISTTP
* Also benefits Complex Arithmetic and Vectorization
Intel Xeon(Clovertown)Intel Xeon(Clovertown)
 pseudo quad-core / socket
 4-issue, out of order,
super-scalar
 Fully pumped SSE
 2.33GHz
 21GB/s, 74.6 GFlop/s
4MB
Shared L2
Xeon
FSB
Fully Buffered DRAM
10.6GB/s
Xeon
Blackford
10.6GB/s
10.6 GB/s(write)
4MB
Shared L2
Xeon Xeon
4MB
Shared L2
Xeon
FSB
Xeon
4MB
Shared L2
Xeon Xeon
21.3 GB/s(read)
AMD OpteronAMD Opteron
 Dual core / socket
 3-issue, out of order,
super-scalar
 Half pumped SSE
 2.2GHz
 21GB/s, 17.6 GFlop/s
 Strong NUMA issues
1MB
victim
Opteron
1MB
victim
Opteron
Memory Controller / HT
1MB
victim
Opteron
1MB
victim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
8GB/s
IBM Cell BladeIBM Cell Blade
 Eight SPEs / socket
 Dual-issue, in-order, VLIW like
 SIMD only ISA
 Disjoint Local Store address
space + DMA
 Weak DP FPU
 51.2GB/s, 29.2GFlop/s,
 Strong NUMA issues
XDR DRAM
25.6GB/s
EIB(RingNetwork)
<<20GB/s
each
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB(RingNetwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
Sun NiagaraSun Niagara
 Eight core
 Each core is 4 way multithreaded
 Single-issue in-order
 Shared, very slow FPU
 1.0GHz
 25.6GB/s, 0.1GFlop/s, (8GIPS)
CrossbarSwitch
25.6GB/s
DDR2DRAM
3MBSharedL2(12way)
8K D$MT UltraSparc
FPU
8K D$MT UltraSparc
8K D$MT UltraSparc
8K D$MT UltraSparc
8K D$MT UltraSparc
8K D$MT UltraSparc
8K D$MT UltraSparc
8K D$MT UltraSparc
64GB/s
(fill)
32GB/s
(writethru)
Cell’s Nine-Processor ChipCell’s Nine-Processor Chip
© IEEE Spectrum, January 2006
Eight Identical
Processors
f = 5.6GHz (max)
44.8 Gflops
QuickTime™ and a
decompressor
are needed to see this picture.
Symmetric Multi-Core ProcessorsSymmetric Multi-Core Processors
Phenom X4
QuickTime™ and a
decompressor
are needed to see this picture.
Symmetric Multi-Core ProcessorsSymmetric Multi-Core Processors
UltraSparc
Future of multi-coresFuture of multi-cores
 Moore’s Law predicts that the number of cores
will double every 18 - 24 months
 2007 - 8 cores on a chip
(IBM Cell has 9 and Sun will has 8)
 2009 - 16 cores
 2013 - 64 cores
 2015 - 128 cores
 2021 - 1k cores
Exploiting TLP has the potential to close Moore’s gap
Real performance has the potential to scale with number of
cores
IBM Cell Example: Will this be theIBM Cell Example: Will this be the
trend for large multi-core designs?trend for large multi-core designs?
Cell Processor Components
• Power Processor
Element (PPE)
• Element Interconnect
Bus (EIB)
• 8 – Synergistic
Processing Element
(SPE)
• Rambus XDR
Memory controller
• Flex I/O
FLEX
I/O
SPE SPESPE SPE
SPE SPESPE SPE
Rambus
XDR
PPE
PowerPC
512K
L2
Cache
4 x 128 bit
Element Interconnect Bus
FLEX
I/O
SPE SPESPE SPE
SPE SPESPE SPE
Rambus
XDR
PPE
PowerPC
512K
L2
Cache
4 x 128 bit
Element Interconnect Bus
Obvious silicon real-estate advantageObvious silicon real-estate advantage
for software memory managedfor software memory managed
processorsprocessors
IBM
AMD
Intel
Cell
BE
Multithreading andMultithreading and
Multi-CoreMulti-Core
ProcessorsProcessors
Instruction-Level ParallelismInstruction-Level Parallelism
(ILP)(ILP)
 Extract parallelism in a single program
 Superscalar processors have multiple execution
units working in parallel
 Challenge to find enough instructions that can be
executed concurrently
 Out-of-order execution => instructions are sent
to execution units based on instruction
dependencies rather than program order
Performance Beyond ILPPerformance Beyond ILP
 Much higher natural parallelism in some applications
 Database, web servers, or scientific codes
 Explicit Thread-Level Parallelism
 Thread: has own instructions and data
 May be part of a parallel program or independent programs
 Each thread has all state (instructions, data, PC, register state,
and so on) needed to execute
 Multithreading: Thread-Level Parallelism within a processor
Thread-Level ParallelismThread-Level Parallelism
(TLP)(TLP)
 ILP exploits implicit parallel operations within loop or
straight-line code segment
 TLP is explicitly represented by multiple threads of
execution that are inherently parallel
 Goal: Use multiple instruction streams to improve
 Throughput of computers that run many programs
 Execution time of multi-threaded programs
 TLP could be more cost-effective to exploit than ILP
Fine-Grained MultithreadingFine-Grained Multithreading
 Switches between threads on each instruction, interleaving
execution of multiple threads
 Usually done round-robin, skipping stalled threads
 CPU must be able to switch threads on every clock cycle
 Pro: Hide latency of both short and long stalls
 Instructions from other threads always available to execute
 Easy to insert on short stalls
 Con: Slow down execution of individual threads
 Thread ready to execute without stalls will be delayed by instructions
from other threads
 Used on Sun’s Niagara
Course-GrainedCourse-Grained
MultithreadingMultithreading
 Switches threads only on costly stalls
 e.g., L2 cache misses
 Pro: No switching each clock cycle
 Relieves need to have very fast thread switching
 No slow down for ready-to-go threads
 Other threads only issue instructions when the main one would stall (for
long time) anyway
 Con: Limitation in hiding shorter stalls
 Pipeline must be emptied or frozen on stall, since CPU issues
instructions from only one thread
 New thread must fill pipe before instructions can complete
 Thus, better for reducing penalty of high-cost stalls where pipeline
refill << stall time
 Used in IBM AS/400
MultithreadingMultithreading ParadigmsParadigms
Thread 1
Unused
ExecutionTime
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Simultaneous
Multithreading
(SMT)
or Intel’s HT
Fine-grained
Multithreading
(cycle-by-cycle
Interleaving)
Thread 2
Thread 3
Thread 4
Thread 5
Coarse-grained
Multithreading
(Block Interleaving)
Chip
Multiprocessor
(CMP)
or called
Multi-Core Processors
today
Simultaneous MultithreadingSimultaneous Multithreading
(SMT)(SMT) Exploits TLP at the same time it exploits ILP
 Intel’s HyperThreading (2-way SMT)
 Others: IBM Power5 and Intel future multicore (8-core, 2-thread 45nm Nehalem)
 Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources
RegReg
FileFile
FMultFMult
(4 cyc(4 cyclesles))
FAddFAdd
(2 cyc)(2 cyc)
ALU1ALU1ALU2ALU2
Load/StoreLoad/Store
(variable)(variable)
Fdiv, unpipeFdiv, unpipe
(16 cyc(16 cyclesles))
RS & ROBRS & ROB
plusplus
PhysicalPhysical
RegisterRegister
FileFile
RS & ROBRS & ROB
plusplus
PhysicalPhysical
RegisterRegister
FileFile
FetchFetch
UnitUnit
FetchFetch
UnitUnit
PCPCPCPCPCPCPCPCPCPCPCPCPCPCPCPC
I-CacheI-CacheI-CacheI-Cache
DecodeDecodeDecodeDecode
RegisterRegister
RRenameenamerr
RegisterRegister
RRenameenamerr RegReg
FileFile
RegReg
FileFile
RegReg
FileFile
RegReg
FileFile
RegReg
FileFile
RegReg
FileFile
Reg
File
DD-Cache-CacheDD-Cache-Cache
Simultaneous MultithreadingSimultaneous Multithreading
(SMT)(SMT)
 Insight that dynamically scheduled processor already has many
HW mechanisms to support multithreading
 Large set of virtual registers that can be used to hold register sets for
independent threads
 Register renaming provides unique register identifiers
 Instructions from multiple threads can be mixed in data path
 Without confusing sources and destinations across threads!
 Out-of-order completion allows the threads to execute out of order,
and get better utilization of the HW
 Just add per-thread renaming table and keep separate PCs
 Independent commitment can be supported via separate reorder
buffer for each thread
SMT PipelineSMT Pipeline
Fetch Decode/
Map
Queue Reg
Read
Execute Dcache/
Store
Buffer
Reg
Write
Retire
Icache
Dcache
PC
Register
Map
Regs Regs
Data from Compaq
Power 4Power 4
Single-threaded predecessor to Power 5. 8
execution units in out-of-order engine;
each can issue instruction each cycle.
Power 4
Power 5
2 fetch (PC),
2 initial decodes
2 commits
(architected
register sets)
Pentium 4 Hyperthreading:Pentium 4 Hyperthreading:
Performance ImprovementsPerformance Improvements
Questions To be Asked WhileQuestions To be Asked While
Migrating to Multi-CoreMigrating to Multi-Core
Application performanceApplication performance
1. Does the concurrency platform allow me to measure the
parallelism I’ve exposed in
my application?
2. Does the concurrency platform address response time bottlenecks,‐
or just offer
more throughput?
3. Does application performance scale up linearly as cores are added,
or does it quickly
reach diminishing returns?
4. Is my multicore enabled code just as fast as my original serial‐
code when run on a
single processor?
5. Does the concurrency platform's scheduler load balance irregular‐
applications
efficiently to achieve full utilization?
6. Will my application “play nicely” with other jobs on the system, or
do multiple jobs
cause thrashing of resources?
7. What tools are available for detecting multicore performance
bottlenecks?
Software reliabilitySoftware reliability
8. How much harder is it to debug my multicore enabled‐
application than to debug my
original application?
9. Can I use my standard, familiar debugging tools?
10. Are there effective debugging tools to identify and localize
parallel programming‐
errors, such as data race bugs?‐
11. Must I use a parallel debugger even if I make an ordinary
serial programming error?
12. What changes must I make to my release engineering‐
processes to ensure that my
delivered software is reliable?
13. Can I use my existing unit tests and regression tests?
Development timeDevelopment time
14. To multicore enable my application, how much logical‐
restructuring of my
application must I do?
15. Can I easily train programmers to use the concurrency platform?
16. Can I maintain just one code base, or must I maintain a serial
and parallel versions?
17. Can I avoid rewriting my application every time a new processor
generation
increases the core count?
18. Can I easily multicore enable ill structured and irregular code, or‐ ‐
is the concurrency
platform limited to data parallel applications?‐
19. Does the concurrency platform properly support modern
programming paradigms,
such as objects, templates, and exceptions?
20. What does it take to handle global variables in my application?
OpenMP: A Portable
Solution for
Threading
What Is OpenMP*?What Is OpenMP*?
 Compiler directives for multithreaded
programming
 Easy to create threaded Fortran and C/C++
codes
 Supports data parallelism model
 Incremental parallelism
 Combines serial and parallel code in single source
OpenMP* ArchitectureOpenMP* Architecture
 Fork-join model
 Work-sharing constructs
 Data environment constructs
 Synchronization constructs
 Extensive Application Program Interface (API) for
finer control
Programming ModelProgramming Model
Fork-join parallelism:
• Master thread spawns a team of threads as needed
• Parallelism is added incrementally: the sequential
program evolves into a parallel program
Parallel Regions
Master
Thread
OpenMP* Pragma SyntaxOpenMP* Pragma Syntax
 Most constructs in OpenMP* are compiler
directives or pragmas.
 For C and C++, the pragmas take the form:
 #pragma omp construct [clause [clause]…]
Parallel RegionsParallel Regions
 Defines parallel region over
structured block of code
 Threads are created as
‘parallel’ pragma is crossed
 Threads block at end of region
 Data is shared among threads
unless specified otherwise
#pragma omp parallel
Thread
1
Thread
2
Thread
3
C/C++ :
#pragma omp parallel
{
block
}
How Many Threads?How Many Threads?
 Set environment variable for number of threads
set OMP_NUM_THREADS=4
 There is no standard default for this variable
 Many systems:
 # of threads = # of processors
 Intel®
compilers use this default
Work-sharing ConstructWork-sharing Construct
 Splits loop iterations into threads
 Must be in the parallel region
 Must precede the loop
#pragma omp parallel
#pragma omp for
for (I=0;I<N;I++){
Do_Work(I);
}
Work-sharing ConstructWork-sharing Construct
 Threads are assigned an
independent set of
iterations
 Threads must wait at the
end of work-sharing
construct
#pragma omp parallel
#pragma omp for
Implicit barrier
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
#pragma omp parallel
#pragma omp for
for(i = 1, i < 13, i++)
c[i] = a[i] + b[i]
Combining pragmasCombining pragmas
 These two code segments are equivalent
#pragma omp parallel
{
#pragma omp for
for (i=0;i< MAX; i++)
{ res[i]
= huge();
}
}
#pragma omp parallel for
for (i=0;i< MAX; i++) {
res[i] = huge();
}
Data EnvironmentData Environment
 OpenMP uses a shared-memory programming
model
 Most variables are shared by default.
 Global variables are shared among threads
 C/C++: File scope variables, static
Data EnvironmentData Environment
 But, not everything is shared...
 Stack variables in functions called from parallel regions
are PRIVATE
 Automatic variables within a statement block are
PRIVATE
 Loop index variables are private (with exceptions)
 C/C+: The first loop index variable in nested loops
following a #pragma omp for
Data Scope AttributesData Scope Attributes
 The default status can be modified with
 default (shared | none)
 Scoping attribute clauses
 shared(varname,…)
 private(varname,…)
The Private ClauseThe Private Clause
 Reproduces the variable for each thread
 Variables are un-initialized; C++ object is default
constructed
 Any value external to the parallel region is undefined
void* work(float* c, int N) {
float x, y; int i;
#pragma omp parallel for private(x,y)
for(i=0; i<N; i++) {
x = a[i]; y = b[i];
c[i] = x + y;
}
}
Protect Shared DataProtect Shared Data
 Must protect access to shared, modifiable data
float dot_prod(float* a, float* b, int N)
{
float sum = 0.0;
#pragma omp parallel for shared(sum)
for(int i=0; i<N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}
 #pragma omp critical [(lock_name)]
 Defines a critical region on a structured block
OpenMP* Critical ConstructOpenMP* Critical Construct
float RES;
#pragma omp parallel
{ float B;
#pragma omp for
for(int i=0; i<niters; i++){
B = big_job(i);
#pragma omp critical (RES_lock)
consum (B, RES);
}
}
Threads wait their turn –at
a time, only one calls
consum() thereby
protecting RES from race
conditions
Naming the critical
construct RES_lock is
optional
OpenMP* Reduction ClauseOpenMP* Reduction Clause
 reduction (op : list)
 The variables in “list” must be shared in the
enclosing parallel region
 Inside parallel or work-sharing construct:
 A PRIVATE copy of each list variable is created and
initialized depending on the “op”
 These copies are updated locally by threads
 At end of construct, local copies are combined through
“op” into a single value and combined with the value in
the original SHARED variable
Reduction ExampleReduction Example
 Local copy of sum for each thread
 All local copies of sum added together and stored
in “global” variable
#pragma omp parallel for reduction(+:sum)
for(i=0; i<N; i++) {
sum += a[i] * b[i];
}
 A range of associative operands can be used with
reduction
 Initial values are the ones that make sense
mathematically
C/C++ Reduction OperationsC/C++ Reduction Operations
Operand Initial Value
+ 0
* 1
- 0
^ 0
Operand Initial Value
& ~0
| 0
&& 1
|| 0
Schedule Clause When To Use
STATIC Predictable and similar
work per iteration
DYNAMIC Unpredictable, highly
variable work per
iteration
GUIDED Special case of dynamic
to reduce scheduling
overhead
Which Schedule to UseWhich Schedule to Use
Parallel SectionsParallel Sections
 Independent sections of code can
execute concurrently
Serial Parallel
#pragma omp parallel sections
{
#pragma omp section
phase1();
#pragma omp section
phase2();
#pragma omp section
phase3();
}
 Denotes block of code to be executed by only
one thread
 First thread to arrive is chosen
 Implicit barrier at end
Single ConstructSingle Construct
#pragma omp parallel
{
DoManyThings();
#pragma omp single
{
ExchangeBoundaries();
} // threads wait here for single
DoManyMoreThings();
}
 Denotes block of code to be executed only by
the master thread
 No implicit barrier at end
Master ConstructMaster Construct
#pragma omp parallel
{
DoManyThings();
#pragma omp master
{ // if not master skip to next stmt
ExchangeBoundaries();
}
DoManyMoreThings();
}
Implicit BarriersImplicit Barriers
 Several OpenMP* constructs have implicit
barriers
 parallel
 for
 single
 Unnecessary barriers hurt performance
 Waiting threads accomplish no work!
 Suppress implicit barriers, when safe, with the
nowait clause
Barrier ConstructBarrier Construct
 Explicit barrier synchronization
 Each thread waits until all threads arrive
#pragma omp parallel shared (A, B, C)
{
DoSomeWork(A,B);
printf(“Processed A into Bn”);
#pragma omp barrier
DoSomeWork(B,C);
printf(“Processed B into Cn”);
}
Atomic ConstructAtomic Construct
 Special case of a critical section
 Applies only to simple update of memory location
#pragma omp parallel for shared(x, y, index, n)
for (i = 0; i < n; i++) {
#pragma omp atomic
x[index[i]] += work1(i);
y[i] += work2(i);
}
OpenMP* APIOpenMP* API
 Get the thread number within a team
 int omp_get_thread_num(void);
 Get the number of threads in a team
 int omp_get_num_threads(void);
 Usually not needed for OpenMP codes
 Can lead to code not being serially consistent
 Does have specific uses (debugging)
 Must include a header file
#include <omp.h>
More OpenMP*More OpenMP*
 Data environment constructs
 FIRSTPRIVATE
 LASTPRIVATE
 THREADPRIVATE
 Variables initialized from shared variable
 C++ objects are copy-constructed
Firstprivate ClauseFirstprivate Clause
incr=0;
#pragma omp parallel for firstprivate(incr)
for (I=0;I<=MAX;I++) {
if ((I%2)==0) incr++;
A(I)=incr;
}
 Variables update shared variable using value from
last iteration
 C++ objects are updated as if by assignment
Lastprivate ClauseLastprivate Clause
void sq2(int n,
double *lastterm)
{
double x; int i;
#pragma omp parallel
#pragma omp for lastprivate(x)
for (i = 0; i < n; i++){
x = a[i]*a[i] + b[i]*b[i];
b[i] = sqrt(x);
}
lastterm = x;
}
 Preserves global scope for per-thread storage
 Legal for name-space-scope and file-scope
 Use copyin to initialize from master thread
Threadprivate ClauseThreadprivate Clause
struct Astruct A;
#pragma omp threadprivate(A)
…
#pragma omp parallel copyin(A)
do_something_to(&A);
…
#pragma omp parallel
do_something_else_to(&A);
Private copies of “A”
persist between
regions
Performance IssuesPerformance Issues
 Idle threads do no useful work
 Divide work among threads as evenly as possible
 Threads should finish parallel tasks at same time
 Synchronization may be necessary
 Minimize time waiting for protected resources
Load ImbalanceLoad Imbalance
 Unequal work loads lead to idle threads and
wasted time.
time Busy
Idle
#pragma omp parallel
{
#pragma omp for
for( ; ; ){
}
}
time
#pragma omp parallel
{
#pragma omp critical
{
...
}
...
}
SynchronizationSynchronization
 Lost time waiting for locks
time
Busy
Idle
In Critical
Performance TuningPerformance Tuning
 Profilers use sampling to provide performance
data.
 Traditional profilers are of limited use for tuning
OpenMP*:
 Measure CPU time, not wall clock time
 Do not report contention for synchronization objects
 Cannot report load imbalance
 Are unaware of OpenMP constructs
Programmers need profilers specifically designed for OpenMP.
Static Scheduling: Doing ItStatic Scheduling: Doing It
By HandBy Hand
 Must know:
 Number of threads (Nthrds)
 Each thread ID number (id)
 Compute start and end iterations:
#pragma omp parallel
{
int i, istart, iend;
istart = id * N / Nthrds;
iend = (id+1) * N / Nthrds;
for(i=istart;i<iend;i++){
c[i] = a[i] + b[i];}
}
SummarySummary
 Discussed about Multi-Core Processors
 Discussed about openMP programming.
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Sumit Khanka
 
AMD Processor
AMD ProcessorAMD Processor
AMD ProcessorAli Fahad
 
Multi core processors
Multi core processorsMulti core processors
Multi core processorsNipun Sharma
 
Fifty Year Of Microprocessor
Fifty Year Of MicroprocessorFifty Year Of Microprocessor
Fifty Year Of MicroprocessorAli Usman
 
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMSCOMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMSijcsit
 
Multicore processing
Multicore processingMulticore processing
Multicore processingguestc0be34a
 
AMD processors
AMD processorsAMD processors
AMD processorssanthu652
 
Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Sudip Roy
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...inside-BigData.com
 
Multicore processors and its advantages
Multicore processors and its advantagesMulticore processors and its advantages
Multicore processors and its advantagesNitesh Tudu
 
29092013042656 multicore-processor-technology
29092013042656 multicore-processor-technology29092013042656 multicore-processor-technology
29092013042656 multicore-processor-technologySindhu Nathan
 
Example : parallelize a simple problem
Example : parallelize a simple problemExample : parallelize a simple problem
Example : parallelize a simple problemMrMaKKaWi
 

Was ist angesagt? (20)

Processor types
Processor typesProcessor types
Processor types
 
Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.
 
Dual-core processor
Dual-core processorDual-core processor
Dual-core processor
 
AMD Processor
AMD ProcessorAMD Processor
AMD Processor
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
Multicore Processor Technology
Multicore Processor TechnologyMulticore Processor Technology
Multicore Processor Technology
 
Fifty Year Of Microprocessor
Fifty Year Of MicroprocessorFifty Year Of Microprocessor
Fifty Year Of Microprocessor
 
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMSCOMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
 
Amd processor
Amd processorAmd processor
Amd processor
 
Multicore processing
Multicore processingMulticore processing
Multicore processing
 
Amd vs intel
Amd vs intelAmd vs intel
Amd vs intel
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
 
Supercomputers
SupercomputersSupercomputers
Supercomputers
 
generations of computer
generations of computergenerations of computer
generations of computer
 
AMD processors
AMD processorsAMD processors
AMD processors
 
Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
 
Multicore processors and its advantages
Multicore processors and its advantagesMulticore processors and its advantages
Multicore processors and its advantages
 
29092013042656 multicore-processor-technology
29092013042656 multicore-processor-technology29092013042656 multicore-processor-technology
29092013042656 multicore-processor-technology
 
Example : parallelize a simple problem
Example : parallelize a simple problemExample : parallelize a simple problem
Example : parallelize a simple problem
 

Andere mochten auch

Advanced trends in microcontrollers by suhel
Advanced trends in microcontrollers by suhelAdvanced trends in microcontrollers by suhel
Advanced trends in microcontrollers by suhelSuhel Mulla
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersDhanashree Prasad
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architectureJawid Ahmad Baktash
 
Multicore Processsors
Multicore ProcesssorsMulticore Processsors
Multicore ProcesssorsAveen Meena
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecturePiyush Mittal
 
Under ground cables presention
Under ground cables presentionUnder ground cables presention
Under ground cables presentionRazu Khan
 
Intel I3,I5,I7 Processor
Intel I3,I5,I7 ProcessorIntel I3,I5,I7 Processor
Intel I3,I5,I7 Processorsagar solanky
 
Multi core processors
Multi core processorsMulti core processors
Multi core processorsAdithya Bhat
 

Andere mochten auch (11)

Wire Manufacturer In Pune - Elcab Conductors
 Wire Manufacturer In Pune - Elcab Conductors Wire Manufacturer In Pune - Elcab Conductors
Wire Manufacturer In Pune - Elcab Conductors
 
Advanced trends in microcontrollers by suhel
Advanced trends in microcontrollers by suhelAdvanced trends in microcontrollers by suhel
Advanced trends in microcontrollers by suhel
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
 
Multi core processor
Multi core processorMulti core processor
Multi core processor
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architecture
 
Multicore Processsors
Multicore ProcesssorsMulticore Processsors
Multicore Processsors
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Cable and laying
Cable and layingCable and laying
Cable and laying
 
Under ground cables presention
Under ground cables presentionUnder ground cables presention
Under ground cables presention
 
Intel I3,I5,I7 Processor
Intel I3,I5,I7 ProcessorIntel I3,I5,I7 Processor
Intel I3,I5,I7 Processor
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 

Ähnlich wie Webinaron muticoreprocessors

Slot29-CH18-MultiCoreComputers-18-slides (1).pptx
Slot29-CH18-MultiCoreComputers-18-slides (1).pptxSlot29-CH18-MultiCoreComputers-18-slides (1).pptx
Slot29-CH18-MultiCoreComputers-18-slides (1).pptxvun24122002
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
Chapter 4 Microprocessor CPU
Chapter 4 Microprocessor CPUChapter 4 Microprocessor CPU
Chapter 4 Microprocessor CPUaskme
 
End of a trend
End of a trendEnd of a trend
End of a trendmml2000
 
Bharath technical seminar.pptx
Bharath technical seminar.pptxBharath technical seminar.pptx
Bharath technical seminar.pptxMadhav Reddy
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecturemuhammedsalihabbas
 
CA UNIT I PPT.ppt
CA UNIT I PPT.pptCA UNIT I PPT.ppt
CA UNIT I PPT.pptRAJESH S
 
Lecture1 - Computer Architecture
Lecture1 - Computer ArchitectureLecture1 - Computer Architecture
Lecture1 - Computer ArchitectureVolodymyr Ushenko
 
Intel 8th generation and 7th gen microprocessor full details especially for t...
Intel 8th generation and 7th gen microprocessor full details especially for t...Intel 8th generation and 7th gen microprocessor full details especially for t...
Intel 8th generation and 7th gen microprocessor full details especially for t...Chessin Chacko
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3senayteklay
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 

Ähnlich wie Webinaron muticoreprocessors (20)

Module 1 unit 3
Module 1  unit 3Module 1  unit 3
Module 1 unit 3
 
Slot29-CH18-MultiCoreComputers-18-slides (1).pptx
Slot29-CH18-MultiCoreComputers-18-slides (1).pptxSlot29-CH18-MultiCoreComputers-18-slides (1).pptx
Slot29-CH18-MultiCoreComputers-18-slides (1).pptx
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Chapter 4 Microprocessor CPU
Chapter 4 Microprocessor CPUChapter 4 Microprocessor CPU
Chapter 4 Microprocessor CPU
 
End of a trend
End of a trendEnd of a trend
End of a trend
 
Bharath technical seminar.pptx
Bharath technical seminar.pptxBharath technical seminar.pptx
Bharath technical seminar.pptx
 
594
594594
594
 
594
594594
594
 
594
594594
594
 
594
594594
594
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecture
 
Nehalem
NehalemNehalem
Nehalem
 
CA UNIT I PPT.ppt
CA UNIT I PPT.pptCA UNIT I PPT.ppt
CA UNIT I PPT.ppt
 
Lecture1 - Computer Architecture
Lecture1 - Computer ArchitectureLecture1 - Computer Architecture
Lecture1 - Computer Architecture
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Intel 8th generation and 7th gen microprocessor full details especially for t...
Intel 8th generation and 7th gen microprocessor full details especially for t...Intel 8th generation and 7th gen microprocessor full details especially for t...
Intel 8th generation and 7th gen microprocessor full details especially for t...
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 

Mehr von Nagasuri Bala Venkateswarlu

Swift: A parallel scripting for applications at the petascale and beyond.
Swift: A parallel scripting for applications at the petascale and beyond.Swift: A parallel scripting for applications at the petascale and beyond.
Swift: A parallel scripting for applications at the petascale and beyond.Nagasuri Bala Venkateswarlu
 
Let us explore How To solve Technical Education in India.
Let us explore How To solve Technical Education in India.Let us explore How To solve Technical Education in India.
Let us explore How To solve Technical Education in India.Nagasuri Bala Venkateswarlu
 
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...Nagasuri Bala Venkateswarlu
 

Mehr von Nagasuri Bala Venkateswarlu (20)

Building mathematicalcraving
Building mathematicalcravingBuilding mathematicalcraving
Building mathematicalcraving
 
Nbvtalkonmoocs
NbvtalkonmoocsNbvtalkonmoocs
Nbvtalkonmoocs
 
Nbvtalkatbzaonencryptionpuzzles
NbvtalkatbzaonencryptionpuzzlesNbvtalkatbzaonencryptionpuzzles
Nbvtalkatbzaonencryptionpuzzles
 
Nbvtalkon what is engineering(Revised)
Nbvtalkon what is engineering(Revised)Nbvtalkon what is engineering(Revised)
Nbvtalkon what is engineering(Revised)
 
Swift: A parallel scripting for applications at the petascale and beyond.
Swift: A parallel scripting for applications at the petascale and beyond.Swift: A parallel scripting for applications at the petascale and beyond.
Swift: A parallel scripting for applications at the petascale and beyond.
 
Nbvtalkon what is engineering
Nbvtalkon what is engineeringNbvtalkon what is engineering
Nbvtalkon what is engineering
 
Fourth paradigm
Fourth paradigmFourth paradigm
Fourth paradigm
 
Let us explore How To solve Technical Education in India.
Let us explore How To solve Technical Education in India.Let us explore How To solve Technical Education in India.
Let us explore How To solve Technical Education in India.
 
Nbvtalkstaffmotivationataitam
NbvtalkstaffmotivationataitamNbvtalkstaffmotivationataitam
Nbvtalkstaffmotivationataitam
 
top 10 Data Mining Algorithms
top 10 Data Mining Algorithmstop 10 Data Mining Algorithms
top 10 Data Mining Algorithms
 
Dip
DipDip
Dip
 
Anits dip
Anits dipAnits dip
Anits dip
 
Bglrsession4
Bglrsession4Bglrsession4
Bglrsession4
 
Gmrit2
Gmrit2Gmrit2
Gmrit2
 
Introduction to socket programming nbv
Introduction to socket programming nbvIntroduction to socket programming nbv
Introduction to socket programming nbv
 
Clusteryanam
ClusteryanamClusteryanam
Clusteryanam
 
Nbvtalkataitamimageprocessingconf
NbvtalkataitamimageprocessingconfNbvtalkataitamimageprocessingconf
Nbvtalkataitamimageprocessingconf
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
 
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
 

Kürzlich hochgeladen

Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxachiever3003
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 

Kürzlich hochgeladen (20)

Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptx
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 

Webinaron muticoreprocessors

  • 1. Introduction toIntroduction to Multi-CoreMulti-Core Architectures: AArchitectures: A Paradigm Shift in ITParadigm Shift in IT Dr NB Venkateswarlu RITCH CENTER, Visakhapatnam Prof in CSE, AITAM, TEKKALI
  • 2. IndexIndex  What is Computer Architecture  Which are Computer Components  Classification of Computers  The Dawn of Multi-Core (Stream) processors.  Symmetric Multi-Core (SMP)  Heterogeneous Multi-Core Processors  GPU (Graphic Processing Unit)  Intel Family of Multi-Core Processors  AMD Family of Multi-Core Processors  Others  Multi-Threading, Hyper-Threading, Multi-Core  OpenMP Programming Paradigm
  • 3. ConceptsConcepts What is Computer Architecture I/O systemInstr. Set Proc. Compiler Operating System Application Digital Design Circuit Design Instruction Set Architecture Firmware Datapath & Control Layout
  • 4. Evolvement of computer architecture scalar Pre- exe seque nce Multi- processing array Multi- computer multiprocesin g Ledgend processor I/E overlap Fun parallel Multi fun Pipe line Pseudo-vector Obvious vector m-m r-r SIMD MIMD Multi-Core
  • 5. Categories of computer system  Flynn’S Classification  SISD  SIMD  MISD  MIMD
  • 10. Amdahl’s LawAmdahl’s Law  states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
  • 11. Amdahl’s LawAmdahl’s Law S P = 1 – S 0 1 time 1 Speedup = ───────── S + (1 – S)/ N Where N = number of parallel processors Example: S = 0.6, N = 10, Speedup = 1.56 S = 0.6, N = ∞, Speedup = 1.67 Gene Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” AFIPS Conference Proceedings, (30), pp. 483-485, 1967.
  • 12. CPUCPU The Central Processing Unit (CPU) Microprocessor (µp)  The brain of the computer  Its task is to execute instructions  It keeps on executing instructions from the moment it is powered-on to the moment it is powered-off.  Execution of various instructions causes the computer to perform various tasks.  There are a wide range of CPUs  The dominant CPUs on the market today are from  Intel: Pentium (Desktop), Xeon (Server), Pentium-M (Mobile), X- scale (Embedded)  AMD: Athlon (Desktop), Opteron (Server), Turion (Mobile), Geode (Embedded)
  • 13. Key Characteristics of aKey Characteristics of a CPUCPU Several metrics are used to describe the characteristics of a CPU  Native “Word” size  32-bit or 64-bit  Clock speeds  The clock speed of a CPU defines the rate at which the internal clock of the CPU operates  The clock sequences and determines the speed at which instructions are executed by the CPU  Modern CPUs operate in Gigahertz (1 GHz = 109 Hz)  However, clock speeds alone do not determine the overall ability of a CPU  Instruction set  More versatile CPUs support a richer and more efficient instruction sets  Modern CPUs have new instruction sets such as SSE, SSE2, 3DNow that can be used to boost performance of many common applications  These instructions perform the operation of several instructions but take less time  FLOPS: Floating Point Operations per Second  FLOPS are a better metric for measuring a CPU’s computational capabilities  Modern CPUs deliver 2-3 Giga FLOPS (1 GFLOPS = 109 FLOPS)  FLOPS is also used as a metric for describing computational capabilities of computer systems  Modern supercomputers deliver 1 Terra Flop (TFLOP), 1 TFLOP = 1012 FLOPS.
  • 14. Key Characteristics of a CPU (Contd.)Key Characteristics of a CPU (Contd.) Several metrics are used to describe the characteristics of a CPU  Power consumption  Power is an important factor in today’s computing  Power is the product of voltage applied to the CPU (V) and the amount of current drawn by the CPU (I)  The unit for voltage is volts  The unit for current is ampere (aka amps)  Power = V x I  Power is typically represented in Watts  Modern desktop processors consume anywhere from 35 to 100 watts  Lower power is better as power is proportional to heat  More power implies the processor generates more heat and heating is a big problem  Number of computational units of core per CPU  Some CPUs have multiple cores  Each core is an independent computational unit and can execute instructions in parallel  More the number of cores the better the CPU is for multi-threaded applications  Single threaded applications typically experience a slow down on multi-core processors due to reduced clock speeds
  • 15. Key Characteristics of a CPU (Contd.)Key Characteristics of a CPU (Contd.) Several metrics are used to describe the characteristics of a CPU  Cache size and configuration  Cache is a small, but high speed memory that is fabricated along with the CPU  Size of cache is inversely proportional to its speed  Cost of the CPU increases as size of cache increases  It is much faster than RAM  It is used to minimize the overall latency of accessing RAM  Microprocessors have a hierarchy of caches  L1 cache: Fastest and closest to the core components  L3 cache: Relatively slower and further away from CPU  Example cache configurations  Quad core AMD Opteron (Shanghai)  32 KB (Data) + 32 KB (Instr.) L1 cache per core  Unified 512 KB L2 cache per core  Unified 6 MB shared L3 cache (for 4 cores)  Quad core Intel Xeon (Nehalem)  32 KB (Data) + 32 KB (Instr.) L1 Cache per core  Unified 256 KB L2 cache per core  Unified 8 MB shared L3 cache (for 4 cores)
  • 16. Trends in computingTrends in computing  Until recently, hardware performance improvements have been primarily achieved due to advancement in microprocessor fabrication technologies:  Steady improvement in processor clock speeds  Faster clocks (with in the same family of processors) provide higher FLOPS  Increase in number of transistors on-chip  More complex and sophisticated hardware to improve performance  Larger caches to provide rapid access to instruction and data
  • 17. Moore’s LawMoore’s Law  The steady advancement in microprocessor technology was predicted by Gordon Moore (in 1965), co-founder of Intel®  Moore’s law states that the number of transistors on microprocessors will double approximately every two years.  Many advancements in digital technologies can be linked to Moore’s law. This includes:  Processing speed  Memory Capacity  Speed and bandwidth of data communication networks and  Resolution of monitors and digital cameras  Thus far, Moore’s law has steadily held true for about 40 years (from 1965 to about 2005)  Breakthroughs in miniaturization of transistors has been the turnkey technology
  • 18. Moore’s Law vs. Intel’sMoore’s Law vs. Intel’s roadmaproadmap  Here is a graph illustrating the progress of Moore’s law based on Intel® Inc. technological roadmap (obtained from Wikipedia)
  • 19. Stagnation of Moore’s LawStagnation of Moore’s Law  In the past few years we have reached the fundamental limits at IC fabrication technology (particularly lithography and interconnect)  It is no longer feasible to further miniaturize the transistors on the IC  They are already just a several atoms large and at this point laws of physics change making it an extremely challenging task  Heat dissipation has reached breakdown threshold  Heat is generated as a part of regular transistor operations  Higher heat dissipations will cause the transistors to fail  With the current state of the art a single processor cannot yield any more than 4 to 5 GFLOPS  How do we move beyond this barrier?
  • 20. SIA Roadmap for ProcessorsSIA Roadmap for Processors (1999)(1999) Year 1999 2002 2005 2008 2011 2014 Feature size (nm) 180 130 100 70 50 35 Logic transistors/cm2 6.2M 18M 39M 84M 180M 390M Clock (GHz) 1.25 2.1 3.5 6.0 10.0 16.9 Chip size (mm2 ) 340 430 520 620 750 900 Power supply (V) 1.8 1.5 1.2 0.9 0.6 0.5 High-perf. Power (W) 90 130 160 170 175 183 Source: http://www.semichips.org
  • 21. ISSCC, Feb. 2001, KeynoteISSCC, Feb. 2001, Keynote “Ten years from now, microprocessors will run at 10GHz to 30GHz and be capable of processing 1 trillion operations per second -- about the same number of calculations that the world's fastest supercomputer can perform now. “Unfortunately, if nothing changes these chips will produce as much heat, for their proportional size, as a nuclear reactor. . . .” Patrick P. Gelsinger Senior Vice President General Manager Digital Enterprise Group INTEL CORP.
  • 22. Intel ViewIntel View  Soon >billion transistors integrated  Clock frequency can still increase  Future applications will demand TIPS  Power? Heat?
  • 23. TEMPARATURE IMAGE PF CELLTEMPARATURE IMAGE PF CELL PROCESSORPROCESSOR
  • 24. Multi-core and Multi-Multi-core and Multi- processorsprocessors  The solution to increasing the effective compute power is via the use of multi-core, multi-processor computer system along with suitable software  This is a paradigm shift in hardware and software technologies  Multiple cores and multiple processors are interconnected using a variety of high speed data communication networks  Software plays a central role in harnessing the power of multi-core/multi-processor systems  Most industry leaders believe this is the near future of computing!
  • 25. Everything is to be GREENEverything is to be GREEN NOW!NOW!  Green Buildings  Green-IT
  • 26. Multi-core TrendsMulti-core Trends  Multi-core processors are most definitely the future of computing  Both Intel and AMD are pushing for larger number of cores per CPU package  The Cell Broadband Engine (aka Cell) has 8 synergistic processing elements (SPE)  The Sun Microsystems Niagara has 8 cores, with each core capable of running 8-threads.
  • 27. Sun NiagaraSun Niagara  32 way parallelism typically delivered 23x performance improvement Sun Niagara 0.00 0.20 0.40 0.60 0.80 1.00 1.20 Dense Protein FEM-Sphr FEM-Cant Tunnel FEM-Har QCD FEM-Ship Econom Epidem FEM-Accel Circuit W ebbase LP Median GFlop/s 8 Cores x 4 Threads[Naïve] 1 Core x 1 Thread[Naïve]
  • 28. A GENERIC MODERN PCA GENERIC MODERN PC
  • 29. SINGLE CORE VS MULTI-CORE (8): ASINGLE CORE VS MULTI-CORE (8): A COMPARISION FROM GEORGIA PACKINGCOMPARISION FROM GEORGIA PACKING INSTITUTEINSTITUTE
  • 31. AMD DUAL COREAMD DUAL CORE PROCESSORPROCESSOR
  • 34. Hierarchy of Modular Building BlocksHierarchy of Modular Building Blocks High Speed Network Core Core Mem Ctrl I/O AttachCache Interconnect Fabric High Speed Network SMP Interconnect Memory Memory Chip Board Rack Homogenous SMP on Board • 2 – 128 HW contexts on board Hierarchical SMP servers with NUMA characteristics Grid/Cluster Core Homogenous SMP on chip • 2-32 HW contexts on chip • Various forms of resource sharing Core Core will support multiple HW threads sharing a single cache exhibiting SMP characteristics. Main Processor(s) with Accelerator(s) • Master-Slave relationship between entities Hierarchical SMP servers with non- uniform memory access characteristics Heterogenous collection of processors on chip • Heterogenity at data and control flow level • Systems will increasingly need to implement a hybrid execution model • New programming systems need to reduce the need for programmer awareness of the topology on which their program executes The next gen programming system must support programming simplicity while leveraging the performance of the underlying HW topology.
  • 35. Looming “Multicore Crisis”Looming “Multicore Crisis” Slide Source: Berkeley View of Landscape
  • 36. Architecture trendsArchitecture trends  Several processor cores on a chip and specialized computing engines  XML processing, cryptography, graphics  Questions:  how to interconnect large number of processor cores  how to provide sufficient memory bandwidth  how to structure the multilevel caching subsystem  how to balance the general purpose computing resources with specialized processing engines and all the supporting memory, caching and interconnect structure, given a constant power budget  Software development processes  how to program for multicore architectures  how to test and evaluate the performance of multithreaded applications
  • 38. Key Features: Dual CoreKey Features: Dual Core  Two physical cores in a package  Each with its own execution  resources  Each with its own L1 cache  32K instruction and 32K data  8-way set associative; 64-byte line  Both cores share the L2 cache  2MB 8-way set associative; 64-byte line size  10 clock cycles latency; Write Back update policy Two Actual Processor CoresTwo Actual Processor Cores EXE CoreEXE Core FP UnitFP Unit EXE CoreEXE Core FP UnitFP Unit L2 CacheL2 Cache L1 CacheL1 Cache L1 CacheL1 Cache System Bus (667MHz, 5333MB/s) System Bus (667MHz, 5333MB/s) Truly Parallel Multitasking & Multi-threadedTruly Parallel Multitasking & Multi-threaded ExecutionExecution Truly Parallel Multitasking & Multi-threadedTruly Parallel Multitasking & Multi-threaded ExecutionExecution
  • 39. Key Features: Smart CacheKey Features: Smart Cache  Shared between the two cores  Advanced Transfer Cache architecture  Reduced bus traffic  Both cores have full access to the entire cache  Dynamic Cache sizing BusBus 2 MB L2 Cache2 MB L2 Cache Core1Core1 Core2Core2 Enables Greater SystemEnables Greater System ResponsivenessResponsiveness
  • 41. Key Features:Key Features: Enhanced SpeedStepEnhanced SpeedStep TechnologyTechnology  Multi-point demand-based switching  Supports transitions to deeper sleep modes  Clock partitioning and recovery  Dynamic Bus Parking Breakthrough Performance and Improved Battery LifeBreakthrough Performance and Improved Battery Life
  • 42. Wide Dynamic ExecutionWide Dynamic Execution
  • 44. Smart Memory AccessSmart Memory Access
  • 45. Key Features: Digital MediaKey Features: Digital Media BoostBoost Streaming SIMD Extensions (SSE) Decoder Throughput Improvement  High Performance ComputingHigh Performance Computing  Digital PhotographyDigital Photography  Digital MusicDigital Music  Video EditingVideo Editing  Internet Content CreationInternet Content Creation  3D & 2D Modeling3D & 2D Modeling  CAD ToolsCAD Tools SSE/SSE2 Instruction Optimization Floating Point Performance Enhancement New Enhanced Streaming SIMD Extensions 3 (SSE3) Providing True SIMD Integer & Floating PointProviding True SIMD Integer & Floating Point Performance!Performance!
  • 46. Key Features: Digital MediaKey Features: Digital Media Boost – SSE3 InstructionsBoost – SSE3 Instructions  SIMD FP using AOS format*  Thread Synchronization  Video encoding  Complex arithmetic  FP to integer conversions  HADDPD, HSUBPD  HADDPS, HSUBPS  MONITOR, MWAIT  LDDQU  ADDSUBPD, ADDSUBPS,  MOVDDUP, MOVSHDUP,  MOVSLDUP  FISTTP * Also benefits Complex Arithmetic and Vectorization
  • 47. Intel Xeon(Clovertown)Intel Xeon(Clovertown)  pseudo quad-core / socket  4-issue, out of order, super-scalar  Fully pumped SSE  2.33GHz  21GB/s, 74.6 GFlop/s 4MB Shared L2 Xeon FSB Fully Buffered DRAM 10.6GB/s Xeon Blackford 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Xeon Xeon 4MB Shared L2 Xeon FSB Xeon 4MB Shared L2 Xeon Xeon 21.3 GB/s(read)
  • 48. AMD OpteronAMD Opteron  Dual core / socket  3-issue, out of order, super-scalar  Half pumped SSE  2.2GHz  21GB/s, 17.6 GFlop/s  Strong NUMA issues 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM DDR2 DRAM 10.6GB/s 10.6GB/s 8GB/s
  • 49. IBM Cell BladeIBM Cell Blade  Eight SPEs / socket  Dual-issue, in-order, VLIW like  SIMD only ISA  Disjoint Local Store address space + DMA  Weak DP FPU  51.2GB/s, 29.2GFlop/s,  Strong NUMA issues XDR DRAM 25.6GB/s EIB(RingNetwork) <<20GB/s each direction SPE256K PPE512K L2 MFC BIF XDR SPE256KMFC SPE256KMFC SPE256KMFC SPE256KMFC SPE256KMFC SPE256KMFC SPE256KMFC XDR DRAM 25.6GB/s EIB(RingNetwork) SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC
  • 50. Sun NiagaraSun Niagara  Eight core  Each core is 4 way multithreaded  Single-issue in-order  Shared, very slow FPU  1.0GHz  25.6GB/s, 0.1GFlop/s, (8GIPS) CrossbarSwitch 25.6GB/s DDR2DRAM 3MBSharedL2(12way) 8K D$MT UltraSparc FPU 8K D$MT UltraSparc 8K D$MT UltraSparc 8K D$MT UltraSparc 8K D$MT UltraSparc 8K D$MT UltraSparc 8K D$MT UltraSparc 8K D$MT UltraSparc 64GB/s (fill) 32GB/s (writethru)
  • 51. Cell’s Nine-Processor ChipCell’s Nine-Processor Chip © IEEE Spectrum, January 2006 Eight Identical Processors f = 5.6GHz (max) 44.8 Gflops
  • 52. QuickTime™ and a decompressor are needed to see this picture. Symmetric Multi-Core ProcessorsSymmetric Multi-Core Processors Phenom X4
  • 53. QuickTime™ and a decompressor are needed to see this picture. Symmetric Multi-Core ProcessorsSymmetric Multi-Core Processors UltraSparc
  • 54. Future of multi-coresFuture of multi-cores  Moore’s Law predicts that the number of cores will double every 18 - 24 months  2007 - 8 cores on a chip (IBM Cell has 9 and Sun will has 8)  2009 - 16 cores  2013 - 64 cores  2015 - 128 cores  2021 - 1k cores Exploiting TLP has the potential to close Moore’s gap Real performance has the potential to scale with number of cores
  • 55. IBM Cell Example: Will this be theIBM Cell Example: Will this be the trend for large multi-core designs?trend for large multi-core designs? Cell Processor Components • Power Processor Element (PPE) • Element Interconnect Bus (EIB) • 8 – Synergistic Processing Element (SPE) • Rambus XDR Memory controller • Flex I/O FLEX I/O SPE SPESPE SPE SPE SPESPE SPE Rambus XDR PPE PowerPC 512K L2 Cache 4 x 128 bit Element Interconnect Bus FLEX I/O SPE SPESPE SPE SPE SPESPE SPE Rambus XDR PPE PowerPC 512K L2 Cache 4 x 128 bit Element Interconnect Bus
  • 56. Obvious silicon real-estate advantageObvious silicon real-estate advantage for software memory managedfor software memory managed processorsprocessors IBM AMD Intel Cell BE
  • 58. Instruction-Level ParallelismInstruction-Level Parallelism (ILP)(ILP)  Extract parallelism in a single program  Superscalar processors have multiple execution units working in parallel  Challenge to find enough instructions that can be executed concurrently  Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order
  • 59. Performance Beyond ILPPerformance Beyond ILP  Much higher natural parallelism in some applications  Database, web servers, or scientific codes  Explicit Thread-Level Parallelism  Thread: has own instructions and data  May be part of a parallel program or independent programs  Each thread has all state (instructions, data, PC, register state, and so on) needed to execute  Multithreading: Thread-Level Parallelism within a processor
  • 60. Thread-Level ParallelismThread-Level Parallelism (TLP)(TLP)  ILP exploits implicit parallel operations within loop or straight-line code segment  TLP is explicitly represented by multiple threads of execution that are inherently parallel  Goal: Use multiple instruction streams to improve  Throughput of computers that run many programs  Execution time of multi-threaded programs  TLP could be more cost-effective to exploit than ILP
  • 61. Fine-Grained MultithreadingFine-Grained Multithreading  Switches between threads on each instruction, interleaving execution of multiple threads  Usually done round-robin, skipping stalled threads  CPU must be able to switch threads on every clock cycle  Pro: Hide latency of both short and long stalls  Instructions from other threads always available to execute  Easy to insert on short stalls  Con: Slow down execution of individual threads  Thread ready to execute without stalls will be delayed by instructions from other threads  Used on Sun’s Niagara
  • 62. Course-GrainedCourse-Grained MultithreadingMultithreading  Switches threads only on costly stalls  e.g., L2 cache misses  Pro: No switching each clock cycle  Relieves need to have very fast thread switching  No slow down for ready-to-go threads  Other threads only issue instructions when the main one would stall (for long time) anyway  Con: Limitation in hiding shorter stalls  Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread  New thread must fill pipe before instructions can complete  Thus, better for reducing penalty of high-cost stalls where pipeline refill << stall time  Used in IBM AS/400
  • 63. MultithreadingMultithreading ParadigmsParadigms Thread 1 Unused ExecutionTime FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Simultaneous Multithreading (SMT) or Intel’s HT Fine-grained Multithreading (cycle-by-cycle Interleaving) Thread 2 Thread 3 Thread 4 Thread 5 Coarse-grained Multithreading (Block Interleaving) Chip Multiprocessor (CMP) or called Multi-Core Processors today
  • 64. Simultaneous MultithreadingSimultaneous Multithreading (SMT)(SMT) Exploits TLP at the same time it exploits ILP  Intel’s HyperThreading (2-way SMT)  Others: IBM Power5 and Intel future multicore (8-core, 2-thread 45nm Nehalem)  Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources RegReg FileFile FMultFMult (4 cyc(4 cyclesles)) FAddFAdd (2 cyc)(2 cyc) ALU1ALU1ALU2ALU2 Load/StoreLoad/Store (variable)(variable) Fdiv, unpipeFdiv, unpipe (16 cyc(16 cyclesles)) RS & ROBRS & ROB plusplus PhysicalPhysical RegisterRegister FileFile RS & ROBRS & ROB plusplus PhysicalPhysical RegisterRegister FileFile FetchFetch UnitUnit FetchFetch UnitUnit PCPCPCPCPCPCPCPCPCPCPCPCPCPCPCPC I-CacheI-CacheI-CacheI-Cache DecodeDecodeDecodeDecode RegisterRegister RRenameenamerr RegisterRegister RRenameenamerr RegReg FileFile RegReg FileFile RegReg FileFile RegReg FileFile RegReg FileFile RegReg FileFile Reg File DD-Cache-CacheDD-Cache-Cache
  • 65. Simultaneous MultithreadingSimultaneous Multithreading (SMT)(SMT)  Insight that dynamically scheduled processor already has many HW mechanisms to support multithreading  Large set of virtual registers that can be used to hold register sets for independent threads  Register renaming provides unique register identifiers  Instructions from multiple threads can be mixed in data path  Without confusing sources and destinations across threads!  Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW  Just add per-thread renaming table and keep separate PCs  Independent commitment can be supported via separate reorder buffer for each thread
  • 66. SMT PipelineSMT Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write Retire Icache Dcache PC Register Map Regs Regs Data from Compaq
  • 67. Power 4Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine; each can issue instruction each cycle.
  • 68. Power 4 Power 5 2 fetch (PC), 2 initial decodes 2 commits (architected register sets)
  • 69. Pentium 4 Hyperthreading:Pentium 4 Hyperthreading: Performance ImprovementsPerformance Improvements
  • 70. Questions To be Asked WhileQuestions To be Asked While Migrating to Multi-CoreMigrating to Multi-Core
  • 71. Application performanceApplication performance 1. Does the concurrency platform allow me to measure the parallelism I’ve exposed in my application? 2. Does the concurrency platform address response time bottlenecks,‐ or just offer more throughput? 3. Does application performance scale up linearly as cores are added, or does it quickly reach diminishing returns? 4. Is my multicore enabled code just as fast as my original serial‐ code when run on a single processor? 5. Does the concurrency platform's scheduler load balance irregular‐ applications efficiently to achieve full utilization? 6. Will my application “play nicely” with other jobs on the system, or do multiple jobs cause thrashing of resources? 7. What tools are available for detecting multicore performance bottlenecks?
  • 72. Software reliabilitySoftware reliability 8. How much harder is it to debug my multicore enabled‐ application than to debug my original application? 9. Can I use my standard, familiar debugging tools? 10. Are there effective debugging tools to identify and localize parallel programming‐ errors, such as data race bugs?‐ 11. Must I use a parallel debugger even if I make an ordinary serial programming error? 12. What changes must I make to my release engineering‐ processes to ensure that my delivered software is reliable? 13. Can I use my existing unit tests and regression tests?
  • 73. Development timeDevelopment time 14. To multicore enable my application, how much logical‐ restructuring of my application must I do? 15. Can I easily train programmers to use the concurrency platform? 16. Can I maintain just one code base, or must I maintain a serial and parallel versions? 17. Can I avoid rewriting my application every time a new processor generation increases the core count? 18. Can I easily multicore enable ill structured and irregular code, or‐ ‐ is the concurrency platform limited to data parallel applications?‐ 19. Does the concurrency platform properly support modern programming paradigms, such as objects, templates, and exceptions? 20. What does it take to handle global variables in my application?
  • 75. What Is OpenMP*?What Is OpenMP*?  Compiler directives for multithreaded programming  Easy to create threaded Fortran and C/C++ codes  Supports data parallelism model  Incremental parallelism  Combines serial and parallel code in single source
  • 76. OpenMP* ArchitectureOpenMP* Architecture  Fork-join model  Work-sharing constructs  Data environment constructs  Synchronization constructs  Extensive Application Program Interface (API) for finer control
  • 77. Programming ModelProgramming Model Fork-join parallelism: • Master thread spawns a team of threads as needed • Parallelism is added incrementally: the sequential program evolves into a parallel program Parallel Regions Master Thread
  • 78. OpenMP* Pragma SyntaxOpenMP* Pragma Syntax  Most constructs in OpenMP* are compiler directives or pragmas.  For C and C++, the pragmas take the form:  #pragma omp construct [clause [clause]…]
  • 79. Parallel RegionsParallel Regions  Defines parallel region over structured block of code  Threads are created as ‘parallel’ pragma is crossed  Threads block at end of region  Data is shared among threads unless specified otherwise #pragma omp parallel Thread 1 Thread 2 Thread 3 C/C++ : #pragma omp parallel { block }
  • 80. How Many Threads?How Many Threads?  Set environment variable for number of threads set OMP_NUM_THREADS=4  There is no standard default for this variable  Many systems:  # of threads = # of processors  Intel® compilers use this default
  • 81. Work-sharing ConstructWork-sharing Construct  Splits loop iterations into threads  Must be in the parallel region  Must precede the loop #pragma omp parallel #pragma omp for for (I=0;I<N;I++){ Do_Work(I); }
  • 82. Work-sharing ConstructWork-sharing Construct  Threads are assigned an independent set of iterations  Threads must wait at the end of work-sharing construct #pragma omp parallel #pragma omp for Implicit barrier i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 i = 12 #pragma omp parallel #pragma omp for for(i = 1, i < 13, i++) c[i] = a[i] + b[i]
  • 83. Combining pragmasCombining pragmas  These two code segments are equivalent #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } } #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); }
  • 84. Data EnvironmentData Environment  OpenMP uses a shared-memory programming model  Most variables are shared by default.  Global variables are shared among threads  C/C++: File scope variables, static
  • 85. Data EnvironmentData Environment  But, not everything is shared...  Stack variables in functions called from parallel regions are PRIVATE  Automatic variables within a statement block are PRIVATE  Loop index variables are private (with exceptions)  C/C+: The first loop index variable in nested loops following a #pragma omp for
  • 86. Data Scope AttributesData Scope Attributes  The default status can be modified with  default (shared | none)  Scoping attribute clauses  shared(varname,…)  private(varname,…)
  • 87. The Private ClauseThe Private Clause  Reproduces the variable for each thread  Variables are un-initialized; C++ object is default constructed  Any value external to the parallel region is undefined void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; } }
  • 88. Protect Shared DataProtect Shared Data  Must protect access to shared, modifiable data float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; }
  • 89.  #pragma omp critical [(lock_name)]  Defines a critical region on a structured block OpenMP* Critical ConstructOpenMP* Critical Construct float RES; #pragma omp parallel { float B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical (RES_lock) consum (B, RES); } } Threads wait their turn –at a time, only one calls consum() thereby protecting RES from race conditions Naming the critical construct RES_lock is optional
  • 90. OpenMP* Reduction ClauseOpenMP* Reduction Clause  reduction (op : list)  The variables in “list” must be shared in the enclosing parallel region  Inside parallel or work-sharing construct:  A PRIVATE copy of each list variable is created and initialized depending on the “op”  These copies are updated locally by threads  At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable
  • 91. Reduction ExampleReduction Example  Local copy of sum for each thread  All local copies of sum added together and stored in “global” variable #pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; }
  • 92.  A range of associative operands can be used with reduction  Initial values are the ones that make sense mathematically C/C++ Reduction OperationsC/C++ Reduction Operations Operand Initial Value + 0 * 1 - 0 ^ 0 Operand Initial Value & ~0 | 0 && 1 || 0
  • 93. Schedule Clause When To Use STATIC Predictable and similar work per iteration DYNAMIC Unpredictable, highly variable work per iteration GUIDED Special case of dynamic to reduce scheduling overhead Which Schedule to UseWhich Schedule to Use
  • 94. Parallel SectionsParallel Sections  Independent sections of code can execute concurrently Serial Parallel #pragma omp parallel sections { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); }
  • 95.  Denotes block of code to be executed by only one thread  First thread to arrive is chosen  Implicit barrier at end Single ConstructSingle Construct #pragma omp parallel { DoManyThings(); #pragma omp single { ExchangeBoundaries(); } // threads wait here for single DoManyMoreThings(); }
  • 96.  Denotes block of code to be executed only by the master thread  No implicit barrier at end Master ConstructMaster Construct #pragma omp parallel { DoManyThings(); #pragma omp master { // if not master skip to next stmt ExchangeBoundaries(); } DoManyMoreThings(); }
  • 97. Implicit BarriersImplicit Barriers  Several OpenMP* constructs have implicit barriers  parallel  for  single  Unnecessary barriers hurt performance  Waiting threads accomplish no work!  Suppress implicit barriers, when safe, with the nowait clause
  • 98. Barrier ConstructBarrier Construct  Explicit barrier synchronization  Each thread waits until all threads arrive #pragma omp parallel shared (A, B, C) { DoSomeWork(A,B); printf(“Processed A into Bn”); #pragma omp barrier DoSomeWork(B,C); printf(“Processed B into Cn”); }
  • 99. Atomic ConstructAtomic Construct  Special case of a critical section  Applies only to simple update of memory location #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); }
  • 100. OpenMP* APIOpenMP* API  Get the thread number within a team  int omp_get_thread_num(void);  Get the number of threads in a team  int omp_get_num_threads(void);  Usually not needed for OpenMP codes  Can lead to code not being serially consistent  Does have specific uses (debugging)  Must include a header file #include <omp.h>
  • 101. More OpenMP*More OpenMP*  Data environment constructs  FIRSTPRIVATE  LASTPRIVATE  THREADPRIVATE
  • 102.  Variables initialized from shared variable  C++ objects are copy-constructed Firstprivate ClauseFirstprivate Clause incr=0; #pragma omp parallel for firstprivate(incr) for (I=0;I<=MAX;I++) { if ((I%2)==0) incr++; A(I)=incr; }
  • 103.  Variables update shared variable using value from last iteration  C++ objects are updated as if by assignment Lastprivate ClauseLastprivate Clause void sq2(int n, double *lastterm) { double x; int i; #pragma omp parallel #pragma omp for lastprivate(x) for (i = 0; i < n; i++){ x = a[i]*a[i] + b[i]*b[i]; b[i] = sqrt(x); } lastterm = x; }
  • 104.  Preserves global scope for per-thread storage  Legal for name-space-scope and file-scope  Use copyin to initialize from master thread Threadprivate ClauseThreadprivate Clause struct Astruct A; #pragma omp threadprivate(A) … #pragma omp parallel copyin(A) do_something_to(&A); … #pragma omp parallel do_something_else_to(&A); Private copies of “A” persist between regions
  • 105. Performance IssuesPerformance Issues  Idle threads do no useful work  Divide work among threads as evenly as possible  Threads should finish parallel tasks at same time  Synchronization may be necessary  Minimize time waiting for protected resources
  • 106. Load ImbalanceLoad Imbalance  Unequal work loads lead to idle threads and wasted time. time Busy Idle #pragma omp parallel { #pragma omp for for( ; ; ){ } } time
  • 107. #pragma omp parallel { #pragma omp critical { ... } ... } SynchronizationSynchronization  Lost time waiting for locks time Busy Idle In Critical
  • 108. Performance TuningPerformance Tuning  Profilers use sampling to provide performance data.  Traditional profilers are of limited use for tuning OpenMP*:  Measure CPU time, not wall clock time  Do not report contention for synchronization objects  Cannot report load imbalance  Are unaware of OpenMP constructs Programmers need profilers specifically designed for OpenMP.
  • 109. Static Scheduling: Doing ItStatic Scheduling: Doing It By HandBy Hand  Must know:  Number of threads (Nthrds)  Each thread ID number (id)  Compute start and end iterations: #pragma omp parallel { int i, istart, iend; istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++){ c[i] = a[i] + b[i];} }
  • 110. SummarySummary  Discussed about Multi-Core Processors  Discussed about openMP programming.

Hinweis der Redaktion

  1. SMP: symmetric multi-processor NUMA: Non Uniform Memory Access
  2. Today we have examples of 8 core devices and have seen roadmap that indicate that by 2008 or 09 16 core devices will exists. Intel last year demo some technology that they say will lead to an 80 core device in 5 years or so. It is probable a good bet that by 2020 we will have devices with 1K cores. So what is the problem with moving to TLP architectures. Obviously the future of processing performance growth looks bright using this approach. Need for more highly suffocated runtime instruction level parallel abstraction is no longer required so longer necessarily to scrape the bottom of the barrel to get a few % better performance. However research in key areas are still key. Methods to reduce leakage currents as feature sizes continue to shrink Power management techniques maybe Asynchronous design. Off chip memory and I/O bandwidth have become even more of a bottleneck and improvements in this area are still needed Silicon lasers for free air optical interfaces for an example. But these aren’t the big problems ahead for us.
  3. IBM Cell process used in Playstation 3 has been shown to have an order of magnitude leap over many of comparable technologies. This is due to some radical changes in micro-architecture. Application managed memory in the SPEs (no hardware cache). Explicit data movement via DMA required SPE are simple SIMD machines, no dedicated scale pipeline, no suffocated branch prediction logic (mainly done with compile time branch hint), very high bandwidth interconnect, and very high chip I/O bandwidth (ok on memory bandwidth). Heterogenous processor core architecture (8 SPEs and 1 PPE (powerpc machine)
  4. OpenMP* Pragma Syntax For Fortran, you can also use other sentinel’s (!$OMP), but these must exactly line up on columns 1-5. Column 6 must be blank or contain a + indicating that this line is a continuation from the previous line. C$OMP *$OMP !$OMP
  5. Parallel Regions Other Fortran Methods.
  6. Fortran will make all nested loop indices private.
  7. Data Scope Attributes All data clauses apply to parallel regions and worksharing constructs except “shared,” which only applies to parallel regions.
  8. Private Cause For-loop iteration variable is PRIVATE by default.
  9. Sections are distributed among the threads in the parallel team. Each section is executed only once and each thread may execute zero or more sections. It’s not possible to determine whether or not a section will be executed before another. Therefore, the output of one section should not serve as the input to another. Instead, the section that generates output should be moved before the sections construct.
  10. Atomic Construct Since index[i] can be the same for different I values, the update to x must be protected. Use of a critical section would serialize updates to x. Atomic protects individual elements of x array, so that if multiple, concurrent instances of index[i] are different, updates can still be done in parallel.
  11. Load Imbalance Relieve load imbalance Static even scheduling Equal size iteration chunks Based on runtime loop limits Totally parallel scheduling OpenMP* default Dynamic and guided scheduling The threads do some work then get the next chunk. There is some cost in scheduling algorithm.
  12. Synchronization Methods for tuning synchronization: Reduce lock contention Different named critical sections Introduce domain-specific locks Merge parallel loops and remove barriers Merge small critical sections Move critical sections outside loops
  13. Performance Tuning Mention time inside synch constructs, and so on.