SlideShare ist ein Scribd-Unternehmen logo
1 von 90
Downloaden Sie, um offline zu lesen
8a-1
Programming with Shared Memory
Part 1
Threads
Accessing shared data
Critical sections
ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
2
Shared memory multiprocessor
system
Any memory location can be accessible by any of the
processors.
A single address space exists, meaning that each
memory location is given a unique address within a
single range of addresses.
Programming a shared memory multiprocessor
system can take advantage of data stored in the
shared memory, which is accessible by all processors
without having to send the data to destination through
message passing
3
Shared Memory Programming
Generally, shared memory programming more
convenient although it does require access to shared
data by different processors to be carefully controlled
by the programmer.
Shared memory systems have been around for a long
time but with the advent of multi-core systems, it has
become very important to able to program for them
4
Methods for Programming Shared
Memory Multiprocessors
• Using heavyweight processes.
• Using threads. Example Pthreads, Java threads
• Using a completely new programming language for parallel
programming - not popular. Example Ada.
• Modifying the syntax of an existing sequential programming
language to create a parallel programming language.
Example UPC
• Using an existing sequential programming language
supplemented with compiler directives and libraries for
specifying parallelism. Example OpenMP
We will look mostly at threads and OpenMP
5
Using Heavyweight Processes
Operating systems often based upon notion of a process.
Processor time shares between processes, switching from
one process to another. Might occur at regular intervals or
when an active process becomes delayed.
Offers opportunity to de-schedule processes blocked from
proceeding for some reason, e.g. waiting for an I/O operation
to complete.
Concept could be used for parallel programming. Not much
used because of overhead but fork/join concepts used
elsewhere.
6
FORK-JOIN construct
7
UNIX System Calls
8
UNIX System Calls
SPMD model with different code for master process and
forked slave process.
9
Differences between a process and threads
10
Statement Execution Order
Single processor: Processes/threads typically executed until blocked.
Multiprocessor: Instructions of processes/threads interleaved in time.
Example
Process 1 Process 2
Instruction 1.1 Instruction 2.1
Instruction 1.2 Instruction 2.2
Instruction 1.3 Instruction 2.3
Many possible orderings, e.g.:
Instruction 1.1
Instruction 1.2
Instruction 2.1
Instruction 1.3
Instruction 2.2
Instruction 2.3
assuming instructions cannot be divided into smaller steps.
11
A sequence of instructions might perform a
specific task, for example print out a message.
If two processes were to print messages, using
interleaved instructions, the messages could
appear garbled - the individual characters of
each message could be interleaved if special
care is not taken in coding the print routines.
12
Thread-Safe Routines
Thread safe if they can be called from multiple threads
simultaneously and always produce correct results.
Standard I/O thread safe (prints messages without
interleaving the characters).
System routines that return time may not be thread safe.
Routines that access shared data may require special care to
be made thread safe.
8a-13
Re-ordering code also done intentionally by:
• Compilers to optimize performance
• Processors internally, again to optimize
performance
Compilers will re-order code prior to execution while
processors will re-order the code during execution.
In both cases the objective is to best utilize the
available computer resources and minimize any
waiting time.
14
Compiler/Processor Optimizations
Compiler and processor reorder instructions to improve
performance (execution speed).
Example
Suppose one had the code:
.
.
a = b + 5;
x = y * 4;
p = x + 9;
.
.
and the processor can perform, as is usual, multiple arithmetic
operations at the same time.
15
Compiler/Processor Optimizations
Can reorder to:
.
.
x = y * 4;
a = b + 5;
p = x + 9;
.
.
and still be logically correct. This gives the multiply operation
longer time to complete before the result (x) is needed in the last
statement.
Very common for processors to execute machines instructions out
of order for increased speed.
16
Accessing Shared Data
Accessing shared data needs careful control.
Consider two processes each of which is to add one to a
shared data item, x.
Location x is read, x + 1 computed, and the result written
back to the location:
17
Conflict in accessing shared variable
18
Critical Section
A mechanism for ensuring that only one process accesses a
particular resource at a time.
critical section – a section of code for accessing resource
Arrange that only one such critical section is executed at a
time.
This mechanism is known as mutual exclusion.
Concept also appears in an operating systems.
19
Locks
Simplest mechanism for ensuring mutual exclusion of critical
sections.
A lock - a 1-bit variable that is a 1 to indicate that a process
has entered the critical section and a 0 to indicate that no
process is in the critical section.
Operates much like that of a door lock:
A process coming to the “door” of a critical section and
finding it open may enter the critical section, locking the door
behind it to prevent other processes from entering. Once the
process has finished the critical section, it unlocks the door
and leaves.
20
Control of critical sections through
busy waiting
21
Critical Sections Serializing Code
High performance programs should have as few as
possible critical sections as their use can serialize the
code.
Suppose, all processes happen to come to their critical
section together.
They will execute their critical sections one after the other.
In that situation, the execution time becomes almost that of
a single processor.
22
Illustration
23
Deadlock
Can occur with two processes when one requires a resource
held by the other, and this process requires a resource held by
the first process.
24
Deadlock (deadly embrace)
Deadlock can also occur in a circular fashion with several
processes having a resource wanted by another.
25
Program example
To sum the elements of an array, a[1000]:
int sum, a[1000];
sum = 0;
for (i = 0; i < 1000; i++)
sum = sum + a[i];
26
UNIX Processes
Calculation will be divided into two parts, one doing even i and
one doing odd i; i.e.,
Process 1 Process 2
sum1 = 0; sum2 = 0;
for (i = 0; i < 1000; i = i + 2) for (i = 1; i < 1000; i = i + 2)
sum1 = sum1 + a[i]; sum2 = sum2 + a[i];
Each process will add its result (sum1 or sum2) to an
accumulating result, sum :
sum = sum + sum1; sum = sum + sum2;
Sum will need to be shared and protected by a lock. Shared
data structure is created:
27
ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010, Oct 27, 2010
Programming with Shared Memory
Introduction to OpenMP
28
OpenMP
An accepted standard developed in the late 1990s by a
group of industry specialists.
Consists of a small set of compiler directives, augmented
with a small set of library routines and environment variables
using the base language Fortran and C/C++.
Several OpenMP compilers available.
29
Wikipedia OpenMP
http://en.wikipedia.org/wiki/File:OpenMP_language_extensions.svg
30
OpenMP
• Uses a thread-based shared memory programming
model
• OpenMP programs will create multiple threads
• All threads have access to global memory
• Data can be shared among all threads or private to one
thread
• Synchronization occurs but often implicit
31
OpenMP uses “fork-join” model but thread-based.
Initially, a single thread executed by a master thread.
parallel directive creates a team of threads with a specified
block of code executed by the multiple threads in parallel.
The exact number of threads in the team determined by one
of several ways.
Other directives used within a parallel construct to specify
parallel for loops and different blocks of code for threads.
32
parallel region
Multiple threads
parallel region
Master thread
Fork/join model
Synchronization
33
For C/C++, the OpenMP directives contained in #pragma
statements.
Format:
#pragma omp directive_name ...
where omp is an OpenMP keyword.
May be additional parameters (clauses) after directive name
for different options.
Some directives require code to specified in a structured
block that follows directive and then directive and structured
block form a “construct”.
34
Parallel Directive
#pragma omp parallel
structured_block
creates multiple threads, each one executing the specified
structured_block, (a single statement or a compound
statement created with { ...} with a single entry point and a
single exit point.)
Implicit barrier at end of construct.
Directive corresponds to forall construct.
35
Hello world example
#pragma omp parallel
{
printf("Hello World from thread %d of %dn",
omp_get_thread_num(),
omp_get_num_threads() );
}
Output from an 8-processor/core machine:
Hello World from thread 0 of 8
Hello World from thread 4 of 8
Hello World from thread 3 of 8
Hello World from thread 2 of 8
Hello World from thread 7 of 8
Hello World from thread 1 of 8
Hello World from thread 6 of 8
Hello World from thread 5 of 8
OpenMP
directive for a
parallel region
Opening
brace must
be on a new
line
36
Private and shared variables
Variables could be declared within each parallel region but
OpenMP provides private clause.
int myid;
…
#pragma omp parallel private(myid)
{
myid = omp_get_thread_num();
printf("Hello World from thread = %dn", myid);
}
Each thread
has a local
variable tid
Also a shared clause available.
37
Example
#pragma omp parallel private(myid, threads)
{
myid = omp_get_thread_num(); // myid is private
threads = omp_get_num_threads(); // threads is private
arr[myid] = 10 * threads; // arr is global
}
38
Number of threads in a team
Established by either:
1. num_threads clause after the parallel directive, or
2. omp_set_num_threads() library routine being previously
called, or
3. Environment variable OMP_NUM_THREADS is defined
in order given or is system dependent if none of above.
Number of threads available can also be altered dynamically to
achieve best use of system resources.
39
Work-Sharing
Three constructs in this classification:
sections
for
single
In all cases, there is an implicit barrier at end of construct
unless a nowait clause included, which overrides the barrier.
Note: These constructs do not start a new team of threads.
That done by an enclosing parallel construct.
40
Sections
The construct
#pragma omp sections
{
#pragma omp section
structured_block
.
.
.
#pragma omp section
structured_block
}
cause structured blocks to be shared among threads in team.
The first section directive optional.
Blocks
executed by
available
threads
41
Example
#pragma omp parallel shared(a,b,c,d,nthreads) private(I,tid)
{
tid = omp_get_thread_num();
#pragma omp sections nowait
{
#pragma omp section
{
printf("Thread %d doing section 1n",tid);
for (i=0; i<N; i++) {
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %fn",tid,i,c[i]);
}
}
#pragma omp section
{
printf("Thread %d doing section 2n",tid);
for (i=0; i<N; i++) {
d[i] = a[i] * b[i];
printf("Thread %d: d[%d]= %fn",tid,i,d[i]);
}
}
} /* end of sections */
} /* end of parallel section */
One
thread
does this
Another
thread
does this
42
For Loop
#pragma omp for
for ( i = 0; …. )
causes for loop to be divided into parts and parts shared
among threads in the team. for loop must be of a simple form.
Loops divided can be specified by additional “schedule”
clause.
For loop of a
simple form
43
Loop Scheduling and Partitioning
• Static
#pragma omp parallel for schedule (static,chunk_size)
Partitions loop iterations into equal sized chunks specified by
chunk_size. Chunks assigned to threads in round robin
fashion.
• Dynamic
#pragma omp parallel for schedule (dynamic,chunk_size)
Uses internal work queue. Chunk-sized block of loop assigned
to threads as they become available.
44
Example
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %dn", nthreads);
}
printf("Thread %d starting...n“,tid);
#pragma omp for schedule(dynamic,chunk)
for (i=0; i<N; i++) {
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %fn",tid,i,c[i]);
}
} /* end of parallel section */
For loop
Executed by
one thread
45
Single
The directive
#pragma omp single
structured block
cause the structured block to be executed by one thread only.
46
Combined Parallel Work-sharing
Constructs
If a parallel directive is followed by a single for directive, it
can be combined into:
#pragma omp parallel for
<for loop>
with similar effects.
47
If a parallel directive is followed by a single sections directive,
it can be combined into
#pragma omp parallel sections
{
#pragma omp section
structured_block
#pragma omp section
structured_block
.
.
.
}
with similar effect. (In both cases, the nowait clause is not
allowed.)
48
Master Directive
The master directive:
#pragma omp master
structured_block
causes the master thread to execute the structured block.
Different to those in the work sharing group in that there is
no implied barrier at the end of the construct (nor the
beginning). Other threads encountering this directive will
ignore it and the associated structured block, and will move
on.
49
• Guided
#pragma omp parallel for schedule (guided,chunk_size)
Similar to dynamic but chunk size starts large and gets smaller
to reduce time threads have to go to work queue.
chunk size = number of iterations remaining
2 * number of threads
• Runtime
#pragma omp parallel for schedule (runtime)
Uses OMP_SCHEDULE environment variable to specify which
of static, dynamic or guided should be used.
50
Reduction clause
Used combined the result of the iterations into a single
value c.f. with MPI _Reduce().
Can be used with parallel, for, and sections,
Example
sum = 0
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < 100; k++ ) {
sum = sum + funct(k);
}
Private copy of sum created for each thread by complier.
Private copy will be added to sum at end.
Eliminates here the need for critical sections.
Operation
Variable
51
Private variables
private clause – creates private copies of variables for
each thread
firstprivate clause - as private clause but initializes each
copy to the values given immediately prior to parallel
construct.
lastprivate clause – as private but “the value of each
lastprivate variable from the sequentially last iteration of
the associated loop, or the lexically last section directive,
is assigned to the variable’s original object.”
52
Synchronization Constructs
Critical
critical directive will only allow one thread execute the
associated structured block. When one or more threads
reach the critical directive:
#pragma omp critical name
structured_block
they will wait until no other thread is executing the same
critical section (one with the same name), and then one
thread will proceed to execute the structured block.
name is optional. All critical sections with no name map to
one undefined name.
53
Barrier
When a thread reaches the barrier
#pragma omp barrier
it waits until all threads have reached the barrier and then they
all proceed together.
There are restrictions on the placement of barrier directive in a
program. In particular, all threads must be able to reach the
barrier.
54
Atomic
The atomic directive
#pragma omp atomic
expression_statement
implements a critical section efficiently when the critical
section simply updates a variable (adds one, subtracts one,
or does some other simple arithmetic operation as defined
by expression_statement).
55
Flush
A synchronization point which causes thread to have a
“consistent” view of certain or all shared variables in memory.
All current read and write operations on variables allowed to
complete and values written back to memory but any memory
operations in code after flush are not started.
Format:
#pragma omp flush (variable_list)
Only applied to thread executing flush, not to all threads in team.
Flush occurs automatically at entry and exit of parallel and critical
directives, and at the exit of for, sections, and single (if a no-wait
clause is not present).
56
Ordered clause
Used in conjunction with for and parallel for directives to
cause an iteration to be executed in the order that it
would have occurred if written as a sequential loop.
57
More information
Full information on OpenMP at
http://openmp.org/wp/
slides 8d-58
Programming with Shared Memory
Specifying parallelism
Performance issues
ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
59
We have seen OpenMP for specifying parallelism
Programmer decides on what parts of the code
should be parallelized and inserts compiler
directives (pragma’s)
Whatever programming environment we use where
the programmers explicitly says what should be
done in parallel, the issue for the programmer is
deciding what can be done in parallel.
Let us use generic language constructs for
parallelism.
60
par Construct
For specifying concurrent statements:
par {
S1;
S2;
.
.
Sn;
}
Says one can execute
all statement S1 to Sn
simultaneously if
resources available, or
execute them in any
order and still get the
correct result
61
forall Construct
To start multiple similar processes together:
forall (i = 0; i < n; i++) {
S1;
S2;
.
.
Sm;
}
Says each iteration of
body can be executed
simultaneously if
resources available, or in
any order and still get the
correct result.
The statements of each
instance of body executed
in order given.
Each instance of the body
uses a different value of i.
62
Example
forall (i = 0; i < 5; i++)
a[i] = 0;
clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.
63
Dependency Analysis
To identify which processes could be executed together.
Example
Can see immediately in the code
forall (i = 0; i < 5; i++)
a[i] = 0;
that every instance of the body is independent of other
instances and all instances can be executed simultaneously.
However, it may not be that obvious. Need algorithmic way
of recognizing dependencies, for a parallelizing compiler.
64
65
66
Can use Berstein’s conditions at:
• Machine instruction level inside processor – have logic
to detect if conditions satisfied (see computer
architecture course)
• At the process level to detect whether two processes
can be executed simultaneously (using the inputs and
outputs of processes).
• Can be extended to more than two processes but
number of conditions rises – need every input/output
combination checked.
For three statements, need how many conditions
checked?
67
Shared Memory Programming
Performance Issues
68
Performance issues with Threads
Program might actually go slower when parallelized.
Too many threads can significantly reduce the program
performance.
69
Reasons:
• Work split among too many threads gives each thread too
little work, so overhead of starting and terminating threads
swamps useful work.
• To many concurrent threads incurs overhead from having to
share fixed hardware resources
• OS typically schedules threads in round robin with a time-
slice
• Time-slicing incurs overhead
• Saving registers, effects on cache memory, virtual
memory management … .
• Waiting to acquire a lock.
• When a thread is suspended while holding a lock, all
threads waiting for lock will have to wait for thread to re-
start.
Source: Multi-core programming by S. Akhter and J. Roberts, Intel Press.
70
Some Strategies
• Limit number of runnable threads to number of hardware
threads. (See later we do not do this with GPUs)
• For a n-core machine (not hyper-threaded) have n runnable
threads.
• If hyper-threaded (with 2 virtual threads per core) double
this.
• Can have more threads in total but others may be blocked.
• Separate I/O threads from compute threads
• I/O threads wait for external events
• Never hard-code number of threads – leave as a tuning
parameter.
71
• Let OpenMP optimize number of threads
• Implement a thread pool
• Implement a work stealing approach in which threads
has a work queue. Threads with no work take work
from other threads
72
Critical Sections Serializing Code
High performance programs should have as few as
possible critical sections as their use can serialize the
code.
Suppose, all processes happen to come to their critical
section together.
They will execute their critical sections one after the other.
In that situation, the execution time becomes almost that of
a single processor.
73
Illustration
74
Shared Data in Systems with Caches
All modern computer systems have cache memory, high-
speed memory closely attached to each processor for holding
recently referenced data and code.
Processors
Cache memory
Main memory
75
Cache coherence protocols
Update policy - copies of data in all caches are updated at
the time one copy is altered,
or
Invalidate policy - when one copy of data is altered, the same
data in any other cache is invalidated (by resetting a valid bit
in the cache). These copies are only updated when the
associated processor makes reference for it.
76
False Sharing
Different parts of block
required by different
processors but not
same bytes.
If one processor writes
to one part of the
block, copies of the
complete block in other
caches must be
updated or invalidated
although the actual
data is not shared.
77
Solution for False Sharing
Compiler to alter the layout of the data stored in the
main memory, separating data only altered by one
processor into different blocks.
78
Sequential Consistency
Formally defined by Lamport (1979):
A multiprocessor is sequentially consistent if the result
of any execution is the same as if the operations of all
the processors were executed in some sequential
order, and the operations of each individual
processors occur in this sequence in the order
specified by its program.
i.e., the overall effect of a parallel program is not
changed by any arbitrary interleaving of instruction
execution in time.
79
Sequential Consistency
80
Writing a parallel program for a system which is
known to be sequentially consistent enables us to
reason about the result of the program.
81
Example
Process P1 Process 2
. .
data = new; .
flag = TRUE; .
. .
. while (flag != TRUE) { };
. data_copy = data;
. .
Expect data_copy to be set to new because we expect data =
new to be executed before flag = TRUE and while (flag !=
TRUE) { } to be executed before data_copy = data.
Ensures that process 2 reads new data from process 1.
Process 2 will simple wait for the new data to be produced.
82
Program Order
Sequential consistency refers to “operations of each
individual processor .. occur in the order specified in its
program” or program order.
In previous figure, this order is that of the stored machine
instructions to be executed.
83
Compiler Optimizations
The order of execution is not necessarily the same as the
order of the corresponding high level statements in the
source program as a compiler may reorder statements for
improved performance.
In this case, term program order will depend upon context,
either the order in the source program or the order in the
compiled machine instructions.
84
High Performance Processors
Modern processors also usually reorder machine instructions
internally during execution for increased performance.
Does not alter a multiprocessor being sequential consistency, if
processor only produces final results in program order (that is,
retrives values to registers in program order, which most
processors do).
All multiprocessors will have the option of operating under the
sequential consistency model. However, it can severely limit
compiler optimizations and processor performance.
85
Example of Processor Re-ordering
Process P1 Process 2
. .
new = a * b; .
data = new; .
flag = TRUE; .
. .
. while (flag != TRUE) { };
. data_copy = data;
. .
Multiply machine instruction corresponding to new = a * b is issued
for execution.
Next instruction corresponding to data = new cannot be issued until
the multiply has produced its result.
However following statement, flag = TRUE, completely independent
and a clever processor could start this operation before multiply has
completed leading to the sequence:
86
Process P1 Process 2
. .
new = a * b; .
flag = TRUE; .
data = new; .
. .
. while (flag != TRUE) { };
. data_copy = data;
. .
Now while statement might occur before new assigned to
data, and code would fail.
All multiprocessors have option of operating under sequential
consistency model, i.e. not reorder instructions and forcing
multiply instruction above to complete before starting
subsequent instruction that depend upon its result.
87
Relaxing Read/Write Orders
Processors may be able to relax the consistency in
terms of the order of reads and writes of one processor
with respect to those of another processor to obtain
higher performance, and instructions to enforce
consistency when needed.
88
Examples
Alpha processors
Memory barrier (MB) instruction - waits for all previously issued
memory accesses instructions to complete before issuing any
new memory operations.
Write memory barrier (WMB) instruction - as MB but only on
memory write operations, i.e. waits for all previously issued
memory write accesses instructions to complete before issuing
any new memory write operations - which means memory
reads could be issued after a memory write operation but
overtake it and complete before the write operation.
89
SUN Sparc V9 processors
memory barrier (MEMBAR) instruction with four bits for
variations Write-to-read bit prevent any reads that follow it
being issued before all writes that precede it have completed.
Other: Write-to-write, read-to-read, read-to-write.
IBM PowerPC processor
SYNC instruction - similar to Alpha MB instruction (check
differences)
90
Questions

Weitere ähnliche Inhalte

Ähnlich wie slides8 SharedMemory.ppt

distributed-systemsfghjjjijoijioj-chap3.pptx
distributed-systemsfghjjjijoijioj-chap3.pptxdistributed-systemsfghjjjijoijioj-chap3.pptx
distributed-systemsfghjjjijoijioj-chap3.pptxlencho3d
 
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfLec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfsamaghorab
 
Distributed system lectures
Distributed system lecturesDistributed system lectures
Distributed system lecturesmarwaeng
 
parallel Questions &amp; answers
parallel Questions &amp; answersparallel Questions &amp; answers
parallel Questions &amp; answersMd. Mashiur Rahman
 
2023comp90024_workshop.pdf
2023comp90024_workshop.pdf2023comp90024_workshop.pdf
2023comp90024_workshop.pdfLevLafayette1
 
Lecture 9 - Process Synchronization.pptx
Lecture 9 - Process Synchronization.pptxLecture 9 - Process Synchronization.pptx
Lecture 9 - Process Synchronization.pptxEhteshamulIslam1
 
parallel programming models
 parallel programming models parallel programming models
parallel programming modelsSwetha S
 
TCP Sockets Tutor maXbox starter26
TCP Sockets Tutor maXbox starter26TCP Sockets Tutor maXbox starter26
TCP Sockets Tutor maXbox starter26Max Kleiner
 
Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptxssuser41d319
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
Computer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemComputer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemmohantysikun0
 
[iOS] Multiple Background Threads
[iOS] Multiple Background Threads[iOS] Multiple Background Threads
[iOS] Multiple Background ThreadsNikmesoft Ltd
 
ipc.pptx
ipc.pptxipc.pptx
ipc.pptxSuhanB
 
CS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfCS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfKishaKiddo
 

Ähnlich wie slides8 SharedMemory.ppt (20)

distributed-systemsfghjjjijoijioj-chap3.pptx
distributed-systemsfghjjjijoijioj-chap3.pptxdistributed-systemsfghjjjijoijioj-chap3.pptx
distributed-systemsfghjjjijoijioj-chap3.pptx
 
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfLec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdf
 
unit_1.pdf
unit_1.pdfunit_1.pdf
unit_1.pdf
 
Distributed system lectures
Distributed system lecturesDistributed system lectures
Distributed system lectures
 
Multithreading by rj
Multithreading by rjMultithreading by rj
Multithreading by rj
 
parallel Questions &amp; answers
parallel Questions &amp; answersparallel Questions &amp; answers
parallel Questions &amp; answers
 
2023comp90024_workshop.pdf
2023comp90024_workshop.pdf2023comp90024_workshop.pdf
2023comp90024_workshop.pdf
 
Lecture 9 - Process Synchronization.pptx
Lecture 9 - Process Synchronization.pptxLecture 9 - Process Synchronization.pptx
Lecture 9 - Process Synchronization.pptx
 
parallel programming models
 parallel programming models parallel programming models
parallel programming models
 
TCP Sockets Tutor maXbox starter26
TCP Sockets Tutor maXbox starter26TCP Sockets Tutor maXbox starter26
TCP Sockets Tutor maXbox starter26
 
Chapter 3 chapter reading task
Chapter 3 chapter reading taskChapter 3 chapter reading task
Chapter 3 chapter reading task
 
Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptx
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Computer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemComputer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer system
 
[iOS] Multiple Background Threads
[iOS] Multiple Background Threads[iOS] Multiple Background Threads
[iOS] Multiple Background Threads
 
ipc.pptx
ipc.pptxipc.pptx
ipc.pptx
 
CS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfCS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdf
 
Chapter 6 os
Chapter 6 osChapter 6 os
Chapter 6 os
 
Distributed Elixir
Distributed ElixirDistributed Elixir
Distributed Elixir
 

Mehr von aminnezarat

ارائه ابزار.pptx
ارائه ابزار.pptxارائه ابزار.pptx
ارائه ابزار.pptxaminnezarat
 
00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdf00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdfaminnezarat
 
Smart Data Strategy EN (1).pdf
Smart Data Strategy EN (1).pdfSmart Data Strategy EN (1).pdf
Smart Data Strategy EN (1).pdfaminnezarat
 
06 hpc library_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
06 hpc library_fundamentals_of_parallelism_and_code_optimization-www.astek.ir06 hpc library_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
06 hpc library_fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
 
05 mpi fundamentals_of_parallelism_and_code_optimization-www.astek.ir
05 mpi fundamentals_of_parallelism_and_code_optimization-www.astek.ir05 mpi fundamentals_of_parallelism_and_code_optimization-www.astek.ir
05 mpi fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
 
04 memory traffic_fundamentals_of_parallelism_and_code_optimization-www.astek...
04 memory traffic_fundamentals_of_parallelism_and_code_optimization-www.astek...04 memory traffic_fundamentals_of_parallelism_and_code_optimization-www.astek...
04 memory traffic_fundamentals_of_parallelism_and_code_optimization-www.astek...aminnezarat
 
03 open mp_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
03 open mp_fundamentals_of_parallelism_and_code_optimization-www.astek.ir03 open mp_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
03 open mp_fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
 
02 vectorization fundamentals_of_parallelism_and_code_optimization-www.astek.ir
02 vectorization fundamentals_of_parallelism_and_code_optimization-www.astek.ir02 vectorization fundamentals_of_parallelism_and_code_optimization-www.astek.ir
02 vectorization fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
 
Machine learning and big-data-in-physics 13970711-Dr. Amin Nezarat
Machine learning and big-data-in-physics 13970711-Dr. Amin NezaratMachine learning and big-data-in-physics 13970711-Dr. Amin Nezarat
Machine learning and big-data-in-physics 13970711-Dr. Amin Nezarataminnezarat
 
Big data HPC Convergence-Dr. Amin-Nezarat-(aminnezarat@gmail.com)-2019
Big data HPC Convergence-Dr. Amin-Nezarat-(aminnezarat@gmail.com)-2019Big data HPC Convergence-Dr. Amin-Nezarat-(aminnezarat@gmail.com)-2019
Big data HPC Convergence-Dr. Amin-Nezarat-(aminnezarat@gmail.com)-2019aminnezarat
 
Camera ready-nash equilibrium-ngct2015-format
Camera ready-nash equilibrium-ngct2015-formatCamera ready-nash equilibrium-ngct2015-format
Camera ready-nash equilibrium-ngct2015-formataminnezarat
 
Data set cloudrank-d-hpca_tutorial
Data set cloudrank-d-hpca_tutorialData set cloudrank-d-hpca_tutorial
Data set cloudrank-d-hpca_tutorialaminnezarat
 

Mehr von aminnezarat (15)

ارائه ابزار.pptx
ارائه ابزار.pptxارائه ابزار.pptx
ارائه ابزار.pptx
 
00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdf00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdf
 
Smart Data Strategy EN (1).pdf
Smart Data Strategy EN (1).pdfSmart Data Strategy EN (1).pdf
Smart Data Strategy EN (1).pdf
 
BASIC_MPI.ppt
BASIC_MPI.pptBASIC_MPI.ppt
BASIC_MPI.ppt
 
Chap2 GGKK.ppt
Chap2 GGKK.pptChap2 GGKK.ppt
Chap2 GGKK.ppt
 
06 hpc library_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
06 hpc library_fundamentals_of_parallelism_and_code_optimization-www.astek.ir06 hpc library_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
06 hpc library_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
05 mpi fundamentals_of_parallelism_and_code_optimization-www.astek.ir
05 mpi fundamentals_of_parallelism_and_code_optimization-www.astek.ir05 mpi fundamentals_of_parallelism_and_code_optimization-www.astek.ir
05 mpi fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
04 memory traffic_fundamentals_of_parallelism_and_code_optimization-www.astek...
04 memory traffic_fundamentals_of_parallelism_and_code_optimization-www.astek...04 memory traffic_fundamentals_of_parallelism_and_code_optimization-www.astek...
04 memory traffic_fundamentals_of_parallelism_and_code_optimization-www.astek...
 
03 open mp_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
03 open mp_fundamentals_of_parallelism_and_code_optimization-www.astek.ir03 open mp_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
03 open mp_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
02 vectorization fundamentals_of_parallelism_and_code_optimization-www.astek.ir
02 vectorization fundamentals_of_parallelism_and_code_optimization-www.astek.ir02 vectorization fundamentals_of_parallelism_and_code_optimization-www.astek.ir
02 vectorization fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
Machine learning and big-data-in-physics 13970711-Dr. Amin Nezarat
Machine learning and big-data-in-physics 13970711-Dr. Amin NezaratMachine learning and big-data-in-physics 13970711-Dr. Amin Nezarat
Machine learning and big-data-in-physics 13970711-Dr. Amin Nezarat
 
Big data HPC Convergence-Dr. Amin-Nezarat-(aminnezarat@gmail.com)-2019
Big data HPC Convergence-Dr. Amin-Nezarat-(aminnezarat@gmail.com)-2019Big data HPC Convergence-Dr. Amin-Nezarat-(aminnezarat@gmail.com)-2019
Big data HPC Convergence-Dr. Amin-Nezarat-(aminnezarat@gmail.com)-2019
 
Camera ready-nash equilibrium-ngct2015-format
Camera ready-nash equilibrium-ngct2015-formatCamera ready-nash equilibrium-ngct2015-format
Camera ready-nash equilibrium-ngct2015-format
 
Data set cloudrank-d-hpca_tutorial
Data set cloudrank-d-hpca_tutorialData set cloudrank-d-hpca_tutorial
Data set cloudrank-d-hpca_tutorial
 

Kürzlich hochgeladen

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 

Kürzlich hochgeladen (16)

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 

slides8 SharedMemory.ppt

  • 1. 8a-1 Programming with Shared Memory Part 1 Threads Accessing shared data Critical sections ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
  • 2. 2 Shared memory multiprocessor system Any memory location can be accessible by any of the processors. A single address space exists, meaning that each memory location is given a unique address within a single range of addresses. Programming a shared memory multiprocessor system can take advantage of data stored in the shared memory, which is accessible by all processors without having to send the data to destination through message passing
  • 3. 3 Shared Memory Programming Generally, shared memory programming more convenient although it does require access to shared data by different processors to be carefully controlled by the programmer. Shared memory systems have been around for a long time but with the advent of multi-core systems, it has become very important to able to program for them
  • 4. 4 Methods for Programming Shared Memory Multiprocessors • Using heavyweight processes. • Using threads. Example Pthreads, Java threads • Using a completely new programming language for parallel programming - not popular. Example Ada. • Modifying the syntax of an existing sequential programming language to create a parallel programming language. Example UPC • Using an existing sequential programming language supplemented with compiler directives and libraries for specifying parallelism. Example OpenMP We will look mostly at threads and OpenMP
  • 5. 5 Using Heavyweight Processes Operating systems often based upon notion of a process. Processor time shares between processes, switching from one process to another. Might occur at regular intervals or when an active process becomes delayed. Offers opportunity to de-schedule processes blocked from proceeding for some reason, e.g. waiting for an I/O operation to complete. Concept could be used for parallel programming. Not much used because of overhead but fork/join concepts used elsewhere.
  • 8. 8 UNIX System Calls SPMD model with different code for master process and forked slave process.
  • 9. 9 Differences between a process and threads
  • 10. 10 Statement Execution Order Single processor: Processes/threads typically executed until blocked. Multiprocessor: Instructions of processes/threads interleaved in time. Example Process 1 Process 2 Instruction 1.1 Instruction 2.1 Instruction 1.2 Instruction 2.2 Instruction 1.3 Instruction 2.3 Many possible orderings, e.g.: Instruction 1.1 Instruction 1.2 Instruction 2.1 Instruction 1.3 Instruction 2.2 Instruction 2.3 assuming instructions cannot be divided into smaller steps.
  • 11. 11 A sequence of instructions might perform a specific task, for example print out a message. If two processes were to print messages, using interleaved instructions, the messages could appear garbled - the individual characters of each message could be interleaved if special care is not taken in coding the print routines.
  • 12. 12 Thread-Safe Routines Thread safe if they can be called from multiple threads simultaneously and always produce correct results. Standard I/O thread safe (prints messages without interleaving the characters). System routines that return time may not be thread safe. Routines that access shared data may require special care to be made thread safe.
  • 13. 8a-13 Re-ordering code also done intentionally by: • Compilers to optimize performance • Processors internally, again to optimize performance Compilers will re-order code prior to execution while processors will re-order the code during execution. In both cases the objective is to best utilize the available computer resources and minimize any waiting time.
  • 14. 14 Compiler/Processor Optimizations Compiler and processor reorder instructions to improve performance (execution speed). Example Suppose one had the code: . . a = b + 5; x = y * 4; p = x + 9; . . and the processor can perform, as is usual, multiple arithmetic operations at the same time.
  • 15. 15 Compiler/Processor Optimizations Can reorder to: . . x = y * 4; a = b + 5; p = x + 9; . . and still be logically correct. This gives the multiply operation longer time to complete before the result (x) is needed in the last statement. Very common for processors to execute machines instructions out of order for increased speed.
  • 16. 16 Accessing Shared Data Accessing shared data needs careful control. Consider two processes each of which is to add one to a shared data item, x. Location x is read, x + 1 computed, and the result written back to the location:
  • 17. 17 Conflict in accessing shared variable
  • 18. 18 Critical Section A mechanism for ensuring that only one process accesses a particular resource at a time. critical section – a section of code for accessing resource Arrange that only one such critical section is executed at a time. This mechanism is known as mutual exclusion. Concept also appears in an operating systems.
  • 19. 19 Locks Simplest mechanism for ensuring mutual exclusion of critical sections. A lock - a 1-bit variable that is a 1 to indicate that a process has entered the critical section and a 0 to indicate that no process is in the critical section. Operates much like that of a door lock: A process coming to the “door” of a critical section and finding it open may enter the critical section, locking the door behind it to prevent other processes from entering. Once the process has finished the critical section, it unlocks the door and leaves.
  • 20. 20 Control of critical sections through busy waiting
  • 21. 21 Critical Sections Serializing Code High performance programs should have as few as possible critical sections as their use can serialize the code. Suppose, all processes happen to come to their critical section together. They will execute their critical sections one after the other. In that situation, the execution time becomes almost that of a single processor.
  • 23. 23 Deadlock Can occur with two processes when one requires a resource held by the other, and this process requires a resource held by the first process.
  • 24. 24 Deadlock (deadly embrace) Deadlock can also occur in a circular fashion with several processes having a resource wanted by another.
  • 25. 25 Program example To sum the elements of an array, a[1000]: int sum, a[1000]; sum = 0; for (i = 0; i < 1000; i++) sum = sum + a[i];
  • 26. 26 UNIX Processes Calculation will be divided into two parts, one doing even i and one doing odd i; i.e., Process 1 Process 2 sum1 = 0; sum2 = 0; for (i = 0; i < 1000; i = i + 2) for (i = 1; i < 1000; i = i + 2) sum1 = sum1 + a[i]; sum2 = sum2 + a[i]; Each process will add its result (sum1 or sum2) to an accumulating result, sum : sum = sum + sum1; sum = sum + sum2; Sum will need to be shared and protected by a lock. Shared data structure is created:
  • 27. 27 ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010, Oct 27, 2010 Programming with Shared Memory Introduction to OpenMP
  • 28. 28 OpenMP An accepted standard developed in the late 1990s by a group of industry specialists. Consists of a small set of compiler directives, augmented with a small set of library routines and environment variables using the base language Fortran and C/C++. Several OpenMP compilers available.
  • 30. 30 OpenMP • Uses a thread-based shared memory programming model • OpenMP programs will create multiple threads • All threads have access to global memory • Data can be shared among all threads or private to one thread • Synchronization occurs but often implicit
  • 31. 31 OpenMP uses “fork-join” model but thread-based. Initially, a single thread executed by a master thread. parallel directive creates a team of threads with a specified block of code executed by the multiple threads in parallel. The exact number of threads in the team determined by one of several ways. Other directives used within a parallel construct to specify parallel for loops and different blocks of code for threads.
  • 32. 32 parallel region Multiple threads parallel region Master thread Fork/join model Synchronization
  • 33. 33 For C/C++, the OpenMP directives contained in #pragma statements. Format: #pragma omp directive_name ... where omp is an OpenMP keyword. May be additional parameters (clauses) after directive name for different options. Some directives require code to specified in a structured block that follows directive and then directive and structured block form a “construct”.
  • 34. 34 Parallel Directive #pragma omp parallel structured_block creates multiple threads, each one executing the specified structured_block, (a single statement or a compound statement created with { ...} with a single entry point and a single exit point.) Implicit barrier at end of construct. Directive corresponds to forall construct.
  • 35. 35 Hello world example #pragma omp parallel { printf("Hello World from thread %d of %dn", omp_get_thread_num(), omp_get_num_threads() ); } Output from an 8-processor/core machine: Hello World from thread 0 of 8 Hello World from thread 4 of 8 Hello World from thread 3 of 8 Hello World from thread 2 of 8 Hello World from thread 7 of 8 Hello World from thread 1 of 8 Hello World from thread 6 of 8 Hello World from thread 5 of 8 OpenMP directive for a parallel region Opening brace must be on a new line
  • 36. 36 Private and shared variables Variables could be declared within each parallel region but OpenMP provides private clause. int myid; … #pragma omp parallel private(myid) { myid = omp_get_thread_num(); printf("Hello World from thread = %dn", myid); } Each thread has a local variable tid Also a shared clause available.
  • 37. 37 Example #pragma omp parallel private(myid, threads) { myid = omp_get_thread_num(); // myid is private threads = omp_get_num_threads(); // threads is private arr[myid] = 10 * threads; // arr is global }
  • 38. 38 Number of threads in a team Established by either: 1. num_threads clause after the parallel directive, or 2. omp_set_num_threads() library routine being previously called, or 3. Environment variable OMP_NUM_THREADS is defined in order given or is system dependent if none of above. Number of threads available can also be altered dynamically to achieve best use of system resources.
  • 39. 39 Work-Sharing Three constructs in this classification: sections for single In all cases, there is an implicit barrier at end of construct unless a nowait clause included, which overrides the barrier. Note: These constructs do not start a new team of threads. That done by an enclosing parallel construct.
  • 40. 40 Sections The construct #pragma omp sections { #pragma omp section structured_block . . . #pragma omp section structured_block } cause structured blocks to be shared among threads in team. The first section directive optional. Blocks executed by available threads
  • 41. 41 Example #pragma omp parallel shared(a,b,c,d,nthreads) private(I,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait { #pragma omp section { printf("Thread %d doing section 1n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %fn",tid,i,c[i]); } } #pragma omp section { printf("Thread %d doing section 2n",tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %fn",tid,i,d[i]); } } } /* end of sections */ } /* end of parallel section */ One thread does this Another thread does this
  • 42. 42 For Loop #pragma omp for for ( i = 0; …. ) causes for loop to be divided into parts and parts shared among threads in the team. for loop must be of a simple form. Loops divided can be specified by additional “schedule” clause. For loop of a simple form
  • 43. 43 Loop Scheduling and Partitioning • Static #pragma omp parallel for schedule (static,chunk_size) Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion. • Dynamic #pragma omp parallel for schedule (dynamic,chunk_size) Uses internal work queue. Chunk-sized block of loop assigned to threads as they become available.
  • 44. 44 Example #pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %dn", nthreads); } printf("Thread %d starting...n“,tid); #pragma omp for schedule(dynamic,chunk) for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %fn",tid,i,c[i]); } } /* end of parallel section */ For loop Executed by one thread
  • 45. 45 Single The directive #pragma omp single structured block cause the structured block to be executed by one thread only.
  • 46. 46 Combined Parallel Work-sharing Constructs If a parallel directive is followed by a single for directive, it can be combined into: #pragma omp parallel for <for loop> with similar effects.
  • 47. 47 If a parallel directive is followed by a single sections directive, it can be combined into #pragma omp parallel sections { #pragma omp section structured_block #pragma omp section structured_block . . . } with similar effect. (In both cases, the nowait clause is not allowed.)
  • 48. 48 Master Directive The master directive: #pragma omp master structured_block causes the master thread to execute the structured block. Different to those in the work sharing group in that there is no implied barrier at the end of the construct (nor the beginning). Other threads encountering this directive will ignore it and the associated structured block, and will move on.
  • 49. 49 • Guided #pragma omp parallel for schedule (guided,chunk_size) Similar to dynamic but chunk size starts large and gets smaller to reduce time threads have to go to work queue. chunk size = number of iterations remaining 2 * number of threads • Runtime #pragma omp parallel for schedule (runtime) Uses OMP_SCHEDULE environment variable to specify which of static, dynamic or guided should be used.
  • 50. 50 Reduction clause Used combined the result of the iterations into a single value c.f. with MPI _Reduce(). Can be used with parallel, for, and sections, Example sum = 0 #pragma omp parallel for reduction(+:sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); } Private copy of sum created for each thread by complier. Private copy will be added to sum at end. Eliminates here the need for critical sections. Operation Variable
  • 51. 51 Private variables private clause – creates private copies of variables for each thread firstprivate clause - as private clause but initializes each copy to the values given immediately prior to parallel construct. lastprivate clause – as private but “the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section directive, is assigned to the variable’s original object.”
  • 52. 52 Synchronization Constructs Critical critical directive will only allow one thread execute the associated structured block. When one or more threads reach the critical directive: #pragma omp critical name structured_block they will wait until no other thread is executing the same critical section (one with the same name), and then one thread will proceed to execute the structured block. name is optional. All critical sections with no name map to one undefined name.
  • 53. 53 Barrier When a thread reaches the barrier #pragma omp barrier it waits until all threads have reached the barrier and then they all proceed together. There are restrictions on the placement of barrier directive in a program. In particular, all threads must be able to reach the barrier.
  • 54. 54 Atomic The atomic directive #pragma omp atomic expression_statement implements a critical section efficiently when the critical section simply updates a variable (adds one, subtracts one, or does some other simple arithmetic operation as defined by expression_statement).
  • 55. 55 Flush A synchronization point which causes thread to have a “consistent” view of certain or all shared variables in memory. All current read and write operations on variables allowed to complete and values written back to memory but any memory operations in code after flush are not started. Format: #pragma omp flush (variable_list) Only applied to thread executing flush, not to all threads in team. Flush occurs automatically at entry and exit of parallel and critical directives, and at the exit of for, sections, and single (if a no-wait clause is not present).
  • 56. 56 Ordered clause Used in conjunction with for and parallel for directives to cause an iteration to be executed in the order that it would have occurred if written as a sequential loop.
  • 57. 57 More information Full information on OpenMP at http://openmp.org/wp/
  • 58. slides 8d-58 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
  • 59. 59 We have seen OpenMP for specifying parallelism Programmer decides on what parts of the code should be parallelized and inserts compiler directives (pragma’s) Whatever programming environment we use where the programmers explicitly says what should be done in parallel, the issue for the programmer is deciding what can be done in parallel. Let us use generic language constructs for parallelism.
  • 60. 60 par Construct For specifying concurrent statements: par { S1; S2; . . Sn; } Says one can execute all statement S1 to Sn simultaneously if resources available, or execute them in any order and still get the correct result
  • 61. 61 forall Construct To start multiple similar processes together: forall (i = 0; i < n; i++) { S1; S2; . . Sm; } Says each iteration of body can be executed simultaneously if resources available, or in any order and still get the correct result. The statements of each instance of body executed in order given. Each instance of the body uses a different value of i.
  • 62. 62 Example forall (i = 0; i < 5; i++) a[i] = 0; clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.
  • 63. 63 Dependency Analysis To identify which processes could be executed together. Example Can see immediately in the code forall (i = 0; i < 5; i++) a[i] = 0; that every instance of the body is independent of other instances and all instances can be executed simultaneously. However, it may not be that obvious. Need algorithmic way of recognizing dependencies, for a parallelizing compiler.
  • 64. 64
  • 65. 65
  • 66. 66 Can use Berstein’s conditions at: • Machine instruction level inside processor – have logic to detect if conditions satisfied (see computer architecture course) • At the process level to detect whether two processes can be executed simultaneously (using the inputs and outputs of processes). • Can be extended to more than two processes but number of conditions rises – need every input/output combination checked. For three statements, need how many conditions checked?
  • 68. 68 Performance issues with Threads Program might actually go slower when parallelized. Too many threads can significantly reduce the program performance.
  • 69. 69 Reasons: • Work split among too many threads gives each thread too little work, so overhead of starting and terminating threads swamps useful work. • To many concurrent threads incurs overhead from having to share fixed hardware resources • OS typically schedules threads in round robin with a time- slice • Time-slicing incurs overhead • Saving registers, effects on cache memory, virtual memory management … . • Waiting to acquire a lock. • When a thread is suspended while holding a lock, all threads waiting for lock will have to wait for thread to re- start. Source: Multi-core programming by S. Akhter and J. Roberts, Intel Press.
  • 70. 70 Some Strategies • Limit number of runnable threads to number of hardware threads. (See later we do not do this with GPUs) • For a n-core machine (not hyper-threaded) have n runnable threads. • If hyper-threaded (with 2 virtual threads per core) double this. • Can have more threads in total but others may be blocked. • Separate I/O threads from compute threads • I/O threads wait for external events • Never hard-code number of threads – leave as a tuning parameter.
  • 71. 71 • Let OpenMP optimize number of threads • Implement a thread pool • Implement a work stealing approach in which threads has a work queue. Threads with no work take work from other threads
  • 72. 72 Critical Sections Serializing Code High performance programs should have as few as possible critical sections as their use can serialize the code. Suppose, all processes happen to come to their critical section together. They will execute their critical sections one after the other. In that situation, the execution time becomes almost that of a single processor.
  • 74. 74 Shared Data in Systems with Caches All modern computer systems have cache memory, high- speed memory closely attached to each processor for holding recently referenced data and code. Processors Cache memory Main memory
  • 75. 75 Cache coherence protocols Update policy - copies of data in all caches are updated at the time one copy is altered, or Invalidate policy - when one copy of data is altered, the same data in any other cache is invalidated (by resetting a valid bit in the cache). These copies are only updated when the associated processor makes reference for it.
  • 76. 76 False Sharing Different parts of block required by different processors but not same bytes. If one processor writes to one part of the block, copies of the complete block in other caches must be updated or invalidated although the actual data is not shared.
  • 77. 77 Solution for False Sharing Compiler to alter the layout of the data stored in the main memory, separating data only altered by one processor into different blocks.
  • 78. 78 Sequential Consistency Formally defined by Lamport (1979): A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processors occur in this sequence in the order specified by its program. i.e., the overall effect of a parallel program is not changed by any arbitrary interleaving of instruction execution in time.
  • 80. 80 Writing a parallel program for a system which is known to be sequentially consistent enables us to reason about the result of the program.
  • 81. 81 Example Process P1 Process 2 . . data = new; . flag = TRUE; . . . . while (flag != TRUE) { }; . data_copy = data; . . Expect data_copy to be set to new because we expect data = new to be executed before flag = TRUE and while (flag != TRUE) { } to be executed before data_copy = data. Ensures that process 2 reads new data from process 1. Process 2 will simple wait for the new data to be produced.
  • 82. 82 Program Order Sequential consistency refers to “operations of each individual processor .. occur in the order specified in its program” or program order. In previous figure, this order is that of the stored machine instructions to be executed.
  • 83. 83 Compiler Optimizations The order of execution is not necessarily the same as the order of the corresponding high level statements in the source program as a compiler may reorder statements for improved performance. In this case, term program order will depend upon context, either the order in the source program or the order in the compiled machine instructions.
  • 84. 84 High Performance Processors Modern processors also usually reorder machine instructions internally during execution for increased performance. Does not alter a multiprocessor being sequential consistency, if processor only produces final results in program order (that is, retrives values to registers in program order, which most processors do). All multiprocessors will have the option of operating under the sequential consistency model. However, it can severely limit compiler optimizations and processor performance.
  • 85. 85 Example of Processor Re-ordering Process P1 Process 2 . . new = a * b; . data = new; . flag = TRUE; . . . . while (flag != TRUE) { }; . data_copy = data; . . Multiply machine instruction corresponding to new = a * b is issued for execution. Next instruction corresponding to data = new cannot be issued until the multiply has produced its result. However following statement, flag = TRUE, completely independent and a clever processor could start this operation before multiply has completed leading to the sequence:
  • 86. 86 Process P1 Process 2 . . new = a * b; . flag = TRUE; . data = new; . . . . while (flag != TRUE) { }; . data_copy = data; . . Now while statement might occur before new assigned to data, and code would fail. All multiprocessors have option of operating under sequential consistency model, i.e. not reorder instructions and forcing multiply instruction above to complete before starting subsequent instruction that depend upon its result.
  • 87. 87 Relaxing Read/Write Orders Processors may be able to relax the consistency in terms of the order of reads and writes of one processor with respect to those of another processor to obtain higher performance, and instructions to enforce consistency when needed.
  • 88. 88 Examples Alpha processors Memory barrier (MB) instruction - waits for all previously issued memory accesses instructions to complete before issuing any new memory operations. Write memory barrier (WMB) instruction - as MB but only on memory write operations, i.e. waits for all previously issued memory write accesses instructions to complete before issuing any new memory write operations - which means memory reads could be issued after a memory write operation but overtake it and complete before the write operation.
  • 89. 89 SUN Sparc V9 processors memory barrier (MEMBAR) instruction with four bits for variations Write-to-read bit prevent any reads that follow it being issued before all writes that precede it have completed. Other: Write-to-write, read-to-read, read-to-write. IBM PowerPC processor SYNC instruction - similar to Alpha MB instruction (check differences)