I have introduced developments in multi-core computers along with their architectural developments. Also, I have explained about high performance computing, where these are used. At the end, openMP is introduced with many ready to run parallel programs.
2. IndexIndex
What is Computer Architecture
Which are Computer Components
Classification of Computers
The Dawn of Multi-Core (Stream) processors.
Symmetric Multi-Core (SMP)
Heterogeneous Multi-Core Processors
GPU (Graphic Processing Unit)
Intel Family of Multi-Core Processors
AMD Family of Multi-Core Processors
Others
Multi-Threading, Hyper-Threading, Multi-Core
OpenMP Programming Paradigm
3. ConceptsConcepts
What is Computer Architecture
I/O systemInstr. Set Proc.
Compiler
Operating
System
Application
Digital Design
Circuit Design
Instruction Set
Architecture
Firmware
Datapath & Control
Layout
4. Evolvement of computer architecture
scalar
Pre-
exe
seque
nce
Multi-
processing
array
Multi-
computer
multiprocesin
g
Ledgend
processor
I/E overlap Fun parallel
Multi fun Pipe line
Pseudo-vector Obvious vector
m-m r-r
SIMD MIMD
Multi-Core
10. Amdahl’s LawAmdahl’s Law
states that the performance improvement to be gained from
using some faster mode of execution is limited by the fraction
of the time the faster mode can be used.
11. Amdahl’s LawAmdahl’s Law
S P = 1 – S
0 1 time
1
Speedup = ─────────
S + (1 – S)/ N
Where N = number of parallel processors
Example: S = 0.6, N = 10, Speedup = 1.56
S = 0.6, N = ∞, Speedup = 1.67
Gene Amdahl, “Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities,” AFIPS Conference Proceedings,
(30), pp. 483-485, 1967.
12. CPUCPU
The Central Processing Unit (CPU) Microprocessor (µp)
The brain of the computer
Its task is to execute instructions
It keeps on executing instructions from the moment it is
powered-on to the moment it is powered-off.
Execution of various instructions causes the computer to
perform various tasks.
There are a wide range of CPUs
The dominant CPUs on the market today are from
Intel: Pentium (Desktop), Xeon (Server), Pentium-M (Mobile), X-
scale (Embedded)
AMD: Athlon (Desktop), Opteron (Server), Turion (Mobile), Geode
(Embedded)
13. Key Characteristics of aKey Characteristics of a
CPUCPU
Several metrics are used to describe the characteristics of a CPU
Native “Word” size
32-bit or 64-bit
Clock speeds
The clock speed of a CPU defines the rate at which the internal clock of the CPU
operates
The clock sequences and determines the speed at which instructions are executed by the CPU
Modern CPUs operate in Gigahertz (1 GHz = 109
Hz)
However, clock speeds alone do not determine the overall ability of a CPU
Instruction set
More versatile CPUs support a richer and more efficient instruction sets
Modern CPUs have new instruction sets such as SSE, SSE2, 3DNow that can be
used to boost performance of many common applications
These instructions perform the operation of several instructions but take less time
FLOPS: Floating Point Operations per Second
FLOPS are a better metric for measuring a CPU’s computational capabilities
Modern CPUs deliver 2-3 Giga FLOPS (1 GFLOPS = 109
FLOPS)
FLOPS is also used as a metric for describing computational capabilities of
computer systems
Modern supercomputers deliver 1 Terra Flop (TFLOP), 1 TFLOP = 1012
FLOPS.
14. Key Characteristics of a CPU (Contd.)Key Characteristics of a CPU (Contd.)
Several metrics are used to describe the characteristics of a CPU
Power consumption
Power is an important factor in today’s computing
Power is the product of voltage applied to the CPU (V) and the amount of current
drawn by the CPU (I)
The unit for voltage is volts
The unit for current is ampere (aka amps)
Power = V x I
Power is typically represented in Watts
Modern desktop processors consume anywhere from 35 to 100 watts
Lower power is better as power is proportional to heat
More power implies the processor generates more heat and heating is a big problem
Number of computational units of core per CPU
Some CPUs have multiple cores
Each core is an independent computational unit and can execute instructions in
parallel
More the number of cores the better the CPU is for multi-threaded applications
Single threaded applications typically experience a slow down on multi-core processors due to
reduced clock speeds
15. Key Characteristics of a CPU (Contd.)Key Characteristics of a CPU (Contd.)
Several metrics are used to describe the characteristics of a CPU
Cache size and configuration
Cache is a small, but high speed memory that is fabricated along with the CPU
Size of cache is inversely proportional to its speed
Cost of the CPU increases as size of cache increases
It is much faster than RAM
It is used to minimize the overall latency of accessing RAM
Microprocessors have a hierarchy of caches
L1 cache: Fastest and closest to the core components
L3 cache: Relatively slower and further away from CPU
Example cache configurations
Quad core AMD Opteron (Shanghai)
32 KB (Data) + 32 KB (Instr.) L1 cache per core
Unified 512 KB L2 cache per core
Unified 6 MB shared L3 cache (for 4 cores)
Quad core Intel Xeon (Nehalem)
32 KB (Data) + 32 KB (Instr.) L1 Cache per core
Unified 256 KB L2 cache per core
Unified 8 MB shared L3 cache (for 4 cores)
16. Trends in computingTrends in computing
Until recently, hardware performance
improvements have been primarily achieved due
to advancement in microprocessor fabrication
technologies:
Steady improvement in processor clock speeds
Faster clocks (with in the same family of processors)
provide higher FLOPS
Increase in number of transistors on-chip
More complex and sophisticated hardware to improve
performance
Larger caches to provide rapid access to instruction and
data
17. Moore’s LawMoore’s Law
The steady advancement in microprocessor technology was
predicted by Gordon Moore (in 1965), co-founder of Intel®
Moore’s law states that the number of transistors on
microprocessors will double approximately every two years.
Many advancements in digital technologies can be linked to Moore’s
law. This includes:
Processing speed
Memory Capacity
Speed and bandwidth of data communication networks and
Resolution of monitors and digital cameras
Thus far, Moore’s law has steadily held true for about 40 years
(from 1965 to about 2005)
Breakthroughs in miniaturization of transistors has been the
turnkey technology
18. Moore’s Law vs. Intel’sMoore’s Law vs. Intel’s
roadmaproadmap
Here is a graph illustrating the progress of Moore’s law based on Intel®
Inc. technological roadmap (obtained from Wikipedia)
19. Stagnation of Moore’s LawStagnation of Moore’s Law
In the past few years we have reached the fundamental
limits at IC fabrication technology (particularly
lithography and interconnect)
It is no longer feasible to further miniaturize the
transistors on the IC
They are already just a several atoms large and at this
point laws of physics change making it an extremely
challenging task
Heat dissipation has reached breakdown threshold
Heat is generated as a part of regular transistor operations
Higher heat dissipations will cause the transistors to fail
With the current state of the art a single processor
cannot yield any more than 4 to 5 GFLOPS
How do we move beyond this barrier?
21. ISSCC, Feb. 2001, KeynoteISSCC, Feb. 2001, Keynote
“Ten years from now,
microprocessors will run at
10GHz to 30GHz and be capable
of processing 1 trillion operations
per second -- about the same
number of calculations that the
world's fastest supercomputer
can perform now.
“Unfortunately, if nothing
changes these chips will produce
as much heat, for their
proportional size, as a nuclear
reactor. . . .”
Patrick P. Gelsinger
Senior Vice President
General Manager
Digital Enterprise Group
INTEL CORP.
22. Intel ViewIntel View
Soon >billion transistors integrated
Clock frequency can still increase
Future applications will demand TIPS
Power? Heat?
24. Multi-core and Multi-Multi-core and Multi-
processorsprocessors
The solution to increasing the effective compute power is
via the use of multi-core, multi-processor computer
system along with suitable software
This is a paradigm shift in hardware and software
technologies
Multiple cores and multiple processors are
interconnected using a variety of high speed data
communication networks
Software plays a central role in harnessing the power of
multi-core/multi-processor systems
Most industry leaders believe this is the near future of
computing!
25. Everything is to be GREENEverything is to be GREEN
NOW!NOW!
Green Buildings
Green-IT
26. Multi-core TrendsMulti-core Trends
Multi-core processors are most definitely the future of computing
Both Intel and AMD are pushing for larger number of cores per CPU
package
The Cell Broadband Engine (aka Cell) has 8 synergistic processing
elements (SPE)
The Sun Microsystems Niagara has 8 cores, with each core capable of
running 8-threads.
27. Sun NiagaraSun Niagara
32 way parallelism typically delivered 23x
performance improvement
Sun Niagara
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Dense
Protein
FEM-Sphr
FEM-Cant
Tunnel
FEM-Har
QCD
FEM-Ship
Econom
Epidem
FEM-Accel
Circuit
W
ebbase
LP
Median
GFlop/s
8 Cores x 4 Threads[Naïve]
1 Core x 1 Thread[Naïve]
29. SINGLE CORE VS MULTI-CORE (8): ASINGLE CORE VS MULTI-CORE (8): A
COMPARISION FROM GEORGIA PACKINGCOMPARISION FROM GEORGIA PACKING
INSTITUTEINSTITUTE
34. Hierarchy of Modular Building BlocksHierarchy of Modular Building Blocks
High Speed Network
Core Core
Mem
Ctrl
I/O
AttachCache
Interconnect Fabric
High Speed Network
SMP Interconnect
Memory Memory
Chip
Board
Rack
Homogenous SMP on
Board
• 2 – 128 HW contexts
on board
Hierarchical SMP servers
with NUMA characteristics
Grid/Cluster
Core
Homogenous SMP on chip
• 2-32 HW contexts on chip
• Various forms of resource
sharing
Core
Core will support multiple
HW threads sharing a
single cache exhibiting
SMP characteristics.
Main Processor(s) with
Accelerator(s)
• Master-Slave relationship
between entities
Hierarchical SMP
servers
with non-
uniform
memory access
characteristics
Heterogenous collection of
processors on chip
• Heterogenity at data and
control flow level
• Systems will increasingly need to
implement a hybrid execution
model
• New programming systems need
to reduce the need for programmer
awareness of the topology on
which their program executes
The next gen programming system
must support programming simplicity
while leveraging the performance of the
underlying HW topology.
36. Architecture trendsArchitecture trends
Several processor cores on a chip and specialized computing
engines
XML processing, cryptography, graphics
Questions:
how to interconnect large number of processor cores
how to provide sufficient memory bandwidth
how to structure the multilevel caching subsystem
how to balance the general purpose computing resources with
specialized processing engines and all the supporting memory,
caching and interconnect structure, given a constant power
budget
Software development processes
how to program for multicore architectures
how to test and evaluate the performance of multithreaded
applications
38. Key Features: Dual CoreKey Features: Dual Core
Two physical cores in a package
Each with its own execution
resources
Each with its own L1 cache
32K instruction and 32K data
8-way set associative; 64-byte line
Both cores share the L2 cache
2MB 8-way set associative; 64-byte line size
10 clock cycles latency; Write Back update policy
Two Actual Processor CoresTwo Actual Processor Cores
EXE CoreEXE Core
FP UnitFP Unit
EXE CoreEXE Core
FP UnitFP Unit
L2 CacheL2 Cache
L1 CacheL1 Cache L1 CacheL1 Cache
System Bus
(667MHz, 5333MB/s)
System Bus
(667MHz, 5333MB/s)
Truly Parallel Multitasking & Multi-threadedTruly Parallel Multitasking & Multi-threaded
ExecutionExecution
Truly Parallel Multitasking & Multi-threadedTruly Parallel Multitasking & Multi-threaded
ExecutionExecution
39. Key Features: Smart CacheKey Features: Smart Cache
Shared between the two cores
Advanced Transfer Cache architecture
Reduced bus traffic
Both cores have full access to the entire cache
Dynamic Cache sizing
BusBus
2 MB L2 Cache2 MB L2 Cache
Core1Core1 Core2Core2
Enables Greater SystemEnables Greater System
ResponsivenessResponsiveness
52. QuickTime™ and a
decompressor
are needed to see this picture.
Symmetric Multi-Core ProcessorsSymmetric Multi-Core Processors
Phenom X4
53. QuickTime™ and a
decompressor
are needed to see this picture.
Symmetric Multi-Core ProcessorsSymmetric Multi-Core Processors
UltraSparc
54. Future of multi-coresFuture of multi-cores
Moore’s Law predicts that the number of cores
will double every 18 - 24 months
2007 - 8 cores on a chip
(IBM Cell has 9 and Sun will has 8)
2009 - 16 cores
2013 - 64 cores
2015 - 128 cores
2021 - 1k cores
Exploiting TLP has the potential to close Moore’s gap
Real performance has the potential to scale with number of
cores
55. IBM Cell Example: Will this be theIBM Cell Example: Will this be the
trend for large multi-core designs?trend for large multi-core designs?
Cell Processor Components
• Power Processor
Element (PPE)
• Element Interconnect
Bus (EIB)
• 8 – Synergistic
Processing Element
(SPE)
• Rambus XDR
Memory controller
• Flex I/O
FLEX
I/O
SPE SPESPE SPE
SPE SPESPE SPE
Rambus
XDR
PPE
PowerPC
512K
L2
Cache
4 x 128 bit
Element Interconnect Bus
FLEX
I/O
SPE SPESPE SPE
SPE SPESPE SPE
Rambus
XDR
PPE
PowerPC
512K
L2
Cache
4 x 128 bit
Element Interconnect Bus
56. Obvious silicon real-estate advantageObvious silicon real-estate advantage
for software memory managedfor software memory managed
processorsprocessors
IBM
AMD
Intel
Cell
BE
58. Instruction-Level ParallelismInstruction-Level Parallelism
(ILP)(ILP)
Extract parallelism in a single program
Superscalar processors have multiple execution
units working in parallel
Challenge to find enough instructions that can be
executed concurrently
Out-of-order execution => instructions are sent
to execution units based on instruction
dependencies rather than program order
59. Performance Beyond ILPPerformance Beyond ILP
Much higher natural parallelism in some applications
Database, web servers, or scientific codes
Explicit Thread-Level Parallelism
Thread: has own instructions and data
May be part of a parallel program or independent programs
Each thread has all state (instructions, data, PC, register state,
and so on) needed to execute
Multithreading: Thread-Level Parallelism within a processor
60. Thread-Level ParallelismThread-Level Parallelism
(TLP)(TLP)
ILP exploits implicit parallel operations within loop or
straight-line code segment
TLP is explicitly represented by multiple threads of
execution that are inherently parallel
Goal: Use multiple instruction streams to improve
Throughput of computers that run many programs
Execution time of multi-threaded programs
TLP could be more cost-effective to exploit than ILP
61. Fine-Grained MultithreadingFine-Grained Multithreading
Switches between threads on each instruction, interleaving
execution of multiple threads
Usually done round-robin, skipping stalled threads
CPU must be able to switch threads on every clock cycle
Pro: Hide latency of both short and long stalls
Instructions from other threads always available to execute
Easy to insert on short stalls
Con: Slow down execution of individual threads
Thread ready to execute without stalls will be delayed by instructions
from other threads
Used on Sun’s Niagara
62. Course-GrainedCourse-Grained
MultithreadingMultithreading
Switches threads only on costly stalls
e.g., L2 cache misses
Pro: No switching each clock cycle
Relieves need to have very fast thread switching
No slow down for ready-to-go threads
Other threads only issue instructions when the main one would stall (for
long time) anyway
Con: Limitation in hiding shorter stalls
Pipeline must be emptied or frozen on stall, since CPU issues
instructions from only one thread
New thread must fill pipe before instructions can complete
Thus, better for reducing penalty of high-cost stalls where pipeline
refill << stall time
Used in IBM AS/400
65. Simultaneous MultithreadingSimultaneous Multithreading
(SMT)(SMT)
Insight that dynamically scheduled processor already has many
HW mechanisms to support multithreading
Large set of virtual registers that can be used to hold register sets for
independent threads
Register renaming provides unique register identifiers
Instructions from multiple threads can be mixed in data path
Without confusing sources and destinations across threads!
Out-of-order completion allows the threads to execute out of order,
and get better utilization of the HW
Just add per-thread renaming table and keep separate PCs
Independent commitment can be supported via separate reorder
buffer for each thread
66. SMT PipelineSMT Pipeline
Fetch Decode/
Map
Queue Reg
Read
Execute Dcache/
Store
Buffer
Reg
Write
Retire
Icache
Dcache
PC
Register
Map
Regs Regs
Data from Compaq
67. Power 4Power 4
Single-threaded predecessor to Power 5. 8
execution units in out-of-order engine;
each can issue instruction each cycle.
68. Power 4
Power 5
2 fetch (PC),
2 initial decodes
2 commits
(architected
register sets)
70. Questions To be Asked WhileQuestions To be Asked While
Migrating to Multi-CoreMigrating to Multi-Core
71. Application performanceApplication performance
1. Does the concurrency platform allow me to measure the
parallelism I’ve exposed in
my application?
2. Does the concurrency platform address response time bottlenecks,‐
or just offer
more throughput?
3. Does application performance scale up linearly as cores are added,
or does it quickly
reach diminishing returns?
4. Is my multicore enabled code just as fast as my original serial‐
code when run on a
single processor?
5. Does the concurrency platform's scheduler load balance irregular‐
applications
efficiently to achieve full utilization?
6. Will my application “play nicely” with other jobs on the system, or
do multiple jobs
cause thrashing of resources?
7. What tools are available for detecting multicore performance
bottlenecks?
72. Software reliabilitySoftware reliability
8. How much harder is it to debug my multicore enabled‐
application than to debug my
original application?
9. Can I use my standard, familiar debugging tools?
10. Are there effective debugging tools to identify and localize
parallel programming‐
errors, such as data race bugs?‐
11. Must I use a parallel debugger even if I make an ordinary
serial programming error?
12. What changes must I make to my release engineering‐
processes to ensure that my
delivered software is reliable?
13. Can I use my existing unit tests and regression tests?
73. Development timeDevelopment time
14. To multicore enable my application, how much logical‐
restructuring of my
application must I do?
15. Can I easily train programmers to use the concurrency platform?
16. Can I maintain just one code base, or must I maintain a serial
and parallel versions?
17. Can I avoid rewriting my application every time a new processor
generation
increases the core count?
18. Can I easily multicore enable ill structured and irregular code, or‐ ‐
is the concurrency
platform limited to data parallel applications?‐
19. Does the concurrency platform properly support modern
programming paradigms,
such as objects, templates, and exceptions?
20. What does it take to handle global variables in my application?
75. What Is OpenMP*?What Is OpenMP*?
Compiler directives for multithreaded
programming
Easy to create threaded Fortran and C/C++
codes
Supports data parallelism model
Incremental parallelism
Combines serial and parallel code in single source
76. OpenMP* ArchitectureOpenMP* Architecture
Fork-join model
Work-sharing constructs
Data environment constructs
Synchronization constructs
Extensive Application Program Interface (API) for
finer control
77. Programming ModelProgramming Model
Fork-join parallelism:
• Master thread spawns a team of threads as needed
• Parallelism is added incrementally: the sequential
program evolves into a parallel program
Parallel Regions
Master
Thread
78. OpenMP* Pragma SyntaxOpenMP* Pragma Syntax
Most constructs in OpenMP* are compiler
directives or pragmas.
For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]…]
79. Parallel RegionsParallel Regions
Defines parallel region over
structured block of code
Threads are created as
‘parallel’ pragma is crossed
Threads block at end of region
Data is shared among threads
unless specified otherwise
#pragma omp parallel
Thread
1
Thread
2
Thread
3
C/C++ :
#pragma omp parallel
{
block
}
80. How Many Threads?How Many Threads?
Set environment variable for number of threads
set OMP_NUM_THREADS=4
There is no standard default for this variable
Many systems:
# of threads = # of processors
Intel®
compilers use this default
81. Work-sharing ConstructWork-sharing Construct
Splits loop iterations into threads
Must be in the parallel region
Must precede the loop
#pragma omp parallel
#pragma omp for
for (I=0;I<N;I++){
Do_Work(I);
}
82. Work-sharing ConstructWork-sharing Construct
Threads are assigned an
independent set of
iterations
Threads must wait at the
end of work-sharing
construct
#pragma omp parallel
#pragma omp for
Implicit barrier
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
#pragma omp parallel
#pragma omp for
for(i = 1, i < 13, i++)
c[i] = a[i] + b[i]
83. Combining pragmasCombining pragmas
These two code segments are equivalent
#pragma omp parallel
{
#pragma omp for
for (i=0;i< MAX; i++)
{ res[i]
= huge();
}
}
#pragma omp parallel for
for (i=0;i< MAX; i++) {
res[i] = huge();
}
84. Data EnvironmentData Environment
OpenMP uses a shared-memory programming
model
Most variables are shared by default.
Global variables are shared among threads
C/C++: File scope variables, static
85. Data EnvironmentData Environment
But, not everything is shared...
Stack variables in functions called from parallel regions
are PRIVATE
Automatic variables within a statement block are
PRIVATE
Loop index variables are private (with exceptions)
C/C+: The first loop index variable in nested loops
following a #pragma omp for
86. Data Scope AttributesData Scope Attributes
The default status can be modified with
default (shared | none)
Scoping attribute clauses
shared(varname,…)
private(varname,…)
87. The Private ClauseThe Private Clause
Reproduces the variable for each thread
Variables are un-initialized; C++ object is default
constructed
Any value external to the parallel region is undefined
void* work(float* c, int N) {
float x, y; int i;
#pragma omp parallel for private(x,y)
for(i=0; i<N; i++) {
x = a[i]; y = b[i];
c[i] = x + y;
}
}
88. Protect Shared DataProtect Shared Data
Must protect access to shared, modifiable data
float dot_prod(float* a, float* b, int N)
{
float sum = 0.0;
#pragma omp parallel for shared(sum)
for(int i=0; i<N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}
89. #pragma omp critical [(lock_name)]
Defines a critical region on a structured block
OpenMP* Critical ConstructOpenMP* Critical Construct
float RES;
#pragma omp parallel
{ float B;
#pragma omp for
for(int i=0; i<niters; i++){
B = big_job(i);
#pragma omp critical (RES_lock)
consum (B, RES);
}
}
Threads wait their turn –at
a time, only one calls
consum() thereby
protecting RES from race
conditions
Naming the critical
construct RES_lock is
optional
90. OpenMP* Reduction ClauseOpenMP* Reduction Clause
reduction (op : list)
The variables in “list” must be shared in the
enclosing parallel region
Inside parallel or work-sharing construct:
A PRIVATE copy of each list variable is created and
initialized depending on the “op”
These copies are updated locally by threads
At end of construct, local copies are combined through
“op” into a single value and combined with the value in
the original SHARED variable
91. Reduction ExampleReduction Example
Local copy of sum for each thread
All local copies of sum added together and stored
in “global” variable
#pragma omp parallel for reduction(+:sum)
for(i=0; i<N; i++) {
sum += a[i] * b[i];
}
92. A range of associative operands can be used with
reduction
Initial values are the ones that make sense
mathematically
C/C++ Reduction OperationsC/C++ Reduction Operations
Operand Initial Value
+ 0
* 1
- 0
^ 0
Operand Initial Value
& ~0
| 0
&& 1
|| 0
93. Schedule Clause When To Use
STATIC Predictable and similar
work per iteration
DYNAMIC Unpredictable, highly
variable work per
iteration
GUIDED Special case of dynamic
to reduce scheduling
overhead
Which Schedule to UseWhich Schedule to Use
95. Denotes block of code to be executed by only
one thread
First thread to arrive is chosen
Implicit barrier at end
Single ConstructSingle Construct
#pragma omp parallel
{
DoManyThings();
#pragma omp single
{
ExchangeBoundaries();
} // threads wait here for single
DoManyMoreThings();
}
96. Denotes block of code to be executed only by
the master thread
No implicit barrier at end
Master ConstructMaster Construct
#pragma omp parallel
{
DoManyThings();
#pragma omp master
{ // if not master skip to next stmt
ExchangeBoundaries();
}
DoManyMoreThings();
}
97. Implicit BarriersImplicit Barriers
Several OpenMP* constructs have implicit
barriers
parallel
for
single
Unnecessary barriers hurt performance
Waiting threads accomplish no work!
Suppress implicit barriers, when safe, with the
nowait clause
98. Barrier ConstructBarrier Construct
Explicit barrier synchronization
Each thread waits until all threads arrive
#pragma omp parallel shared (A, B, C)
{
DoSomeWork(A,B);
printf(“Processed A into Bn”);
#pragma omp barrier
DoSomeWork(B,C);
printf(“Processed B into Cn”);
}
99. Atomic ConstructAtomic Construct
Special case of a critical section
Applies only to simple update of memory location
#pragma omp parallel for shared(x, y, index, n)
for (i = 0; i < n; i++) {
#pragma omp atomic
x[index[i]] += work1(i);
y[i] += work2(i);
}
100. OpenMP* APIOpenMP* API
Get the thread number within a team
int omp_get_thread_num(void);
Get the number of threads in a team
int omp_get_num_threads(void);
Usually not needed for OpenMP codes
Can lead to code not being serially consistent
Does have specific uses (debugging)
Must include a header file
#include <omp.h>
102. Variables initialized from shared variable
C++ objects are copy-constructed
Firstprivate ClauseFirstprivate Clause
incr=0;
#pragma omp parallel for firstprivate(incr)
for (I=0;I<=MAX;I++) {
if ((I%2)==0) incr++;
A(I)=incr;
}
103. Variables update shared variable using value from
last iteration
C++ objects are updated as if by assignment
Lastprivate ClauseLastprivate Clause
void sq2(int n,
double *lastterm)
{
double x; int i;
#pragma omp parallel
#pragma omp for lastprivate(x)
for (i = 0; i < n; i++){
x = a[i]*a[i] + b[i]*b[i];
b[i] = sqrt(x);
}
lastterm = x;
}
104. Preserves global scope for per-thread storage
Legal for name-space-scope and file-scope
Use copyin to initialize from master thread
Threadprivate ClauseThreadprivate Clause
struct Astruct A;
#pragma omp threadprivate(A)
…
#pragma omp parallel copyin(A)
do_something_to(&A);
…
#pragma omp parallel
do_something_else_to(&A);
Private copies of “A”
persist between
regions
105. Performance IssuesPerformance Issues
Idle threads do no useful work
Divide work among threads as evenly as possible
Threads should finish parallel tasks at same time
Synchronization may be necessary
Minimize time waiting for protected resources
106. Load ImbalanceLoad Imbalance
Unequal work loads lead to idle threads and
wasted time.
time Busy
Idle
#pragma omp parallel
{
#pragma omp for
for( ; ; ){
}
}
time
107. #pragma omp parallel
{
#pragma omp critical
{
...
}
...
}
SynchronizationSynchronization
Lost time waiting for locks
time
Busy
Idle
In Critical
108. Performance TuningPerformance Tuning
Profilers use sampling to provide performance
data.
Traditional profilers are of limited use for tuning
OpenMP*:
Measure CPU time, not wall clock time
Do not report contention for synchronization objects
Cannot report load imbalance
Are unaware of OpenMP constructs
Programmers need profilers specifically designed for OpenMP.
109. Static Scheduling: Doing ItStatic Scheduling: Doing It
By HandBy Hand
Must know:
Number of threads (Nthrds)
Each thread ID number (id)
Compute start and end iterations:
#pragma omp parallel
{
int i, istart, iend;
istart = id * N / Nthrds;
iend = (id+1) * N / Nthrds;
for(i=istart;i<iend;i++){
c[i] = a[i] + b[i];}
}
SMP: symmetric multi-processor
NUMA: Non Uniform Memory Access
Today we have examples of 8 core devices and have seen roadmap that indicate that by 2008 or 09 16 core devices will exists. Intel last year demo some technology that they say will lead to an 80 core device in 5 years or so. It is probable a good bet that by 2020 we will have devices with 1K cores.
So what is the problem with moving to TLP architectures. Obviously the future of processing performance growth looks bright using this approach. Need for more highly suffocated runtime instruction level parallel abstraction is no longer required so longer necessarily to scrape the bottom of the barrel to get a few % better performance.
However research in key areas are still key. Methods to reduce leakage currents as feature sizes continue to shrink
Power management techniques
maybe Asynchronous design.
Off chip memory and I/O bandwidth have become even more of a bottleneck and improvements in this area are still needed
Silicon lasers for free air optical interfaces for an example.
But these aren’t the big problems ahead for us.
IBM Cell process used in Playstation 3 has been shown to have an order of magnitude leap over many of comparable technologies. This is due to some radical changes in micro-architecture.
Application managed memory in the SPEs (no hardware cache). Explicit data movement via DMA required
SPE are simple SIMD machines, no dedicated scale pipeline, no suffocated branch prediction logic (mainly done with compile time branch hint), very high bandwidth interconnect, and very high chip I/O bandwidth (ok on memory bandwidth).
Heterogenous processor core architecture (8 SPEs and 1 PPE (powerpc machine)
OpenMP* Pragma Syntax
For Fortran, you can also use other sentinel’s (!$OMP), but these must exactly line up on columns 1-5. Column 6 must be blank or contain a + indicating that this line is a continuation from the previous line.
C$OMP
*$OMP
!$OMP
Parallel Regions
Other Fortran Methods.
Fortran will make all nested loop indices private.
Data Scope Attributes
All data clauses apply to parallel regions and worksharing constructs except “shared,” which only applies to parallel regions.
Private Cause
For-loop iteration variable is PRIVATE by default.
Sections are distributed among the threads in the parallel team. Each section is executed only once and each thread may execute zero or more sections. It’s not possible to determine whether or not a section will be executed before another. Therefore, the output of one section should not serve as the input to another. Instead, the section that generates output should be moved before the sections construct.
Atomic Construct
Since index[i] can be the same for different I values, the update to x must be protected. Use of a critical section would serialize updates to x. Atomic protects individual elements of x array, so that if multiple, concurrent instances of index[i] are different, updates can still be done in parallel.
Load Imbalance
Relieve load imbalance
Static even scheduling
Equal size iteration chunks
Based on runtime loop limits
Totally parallel scheduling
OpenMP* default
Dynamic and guided scheduling
The threads do some work then get the next chunk.
There is some cost in scheduling algorithm.
Synchronization
Methods for tuning synchronization:
Reduce lock contention
Different named critical sections
Introduce domain-specific locks
Merge parallel loops and remove barriers
Merge small critical sections
Move critical sections outside loops
Performance Tuning
Mention time inside synch constructs, and so on.