Suppose we have two vehicles a car that can carry two
persons with an average speed of 200 km/h and a bus that can
carry 40 persons with a speed of 50 km/h , find the latency and
throughput if they both used to carry a group of people for a
distance of 4500 km ?
• Latency =22.5
• Latency =90
• We can use parallelism to increase throughput by using
a larger number of lower clocked processing units (as
in the GPU) which is well suited for computation intensive
applications ( applications with need of large number of
calculations such as image processing applications).
• Also we can use it in latency sensitive applications which
are applications with need of fast response (e.g.
Windows applications which many parts running together
at the same time). In this case we use a more powerful
processing units such as (Intel core i7).
• Throughput and latency often contradictory .
WHY USE PARALLEL
1) The Real World is Massively Parallel:
In the natural world, many complex, interrelated events are
happening at the same time, yet within a temporal sequence.
Compared to serial computing, parallel computing is much better
suited for modeling, simulating and understanding complex, real
WHY USE PARALLEL
2)SAVE TIME AND/OR MONEY:
• In theory, throwing more resources at a task will shorten its time to
completion, with potential cost savings.
• Parallel computers can be built from cheap, commodity
3) SOLVE LARGER / MORE COMPLEX PROBLEMS:
• Many problems are so large and/or complex that it is impractical
or impossible to solve them on a single computer, especially
given limited computer memory.
• Example: Web search engines/databases processing millions of
transactions per second
• Using compute resources on a wide area network, or even the
Internet when local compute resources are scarce or insufficient.
4) TAKE ADVANTAGE OF NON-LOCAL RESOURCES:
We Face the following limitations when designing a parallel
1. Amadahl’s law.
4. Resource Requirements.
6. Parallel Slowdown
Amdahl’s law provides an upper bound on the speedup that can be
obtained by a parallel program
Let us suppose that a processor need an ‘n’ time units to complete ‘n’
And we have a program that have a ‘p’ part of parallizable code and ‘1-p’
or ‘s’ un-paralleizable part of code.
Then the processing time for a ‘N’ number of processors is :
𝑻 = 𝒏𝒔 +
=𝒏 𝟏 − 𝒑 +
And hence the speed up ratio will be calculated as a ratio between a
single processor time ‘Ts’ and an ‘N’ processors processing time ‘Tp’.
𝑺𝒑𝒆𝒆𝒅 𝒖𝒑 =
𝒏 𝟏 − 𝒑 +
𝟏 − 𝒑 +
Introducing the number of processors performing the parallel
fraction of work, the relationship can be modeled by:
𝑺𝒑𝒆𝒆𝒅 𝒖𝒑 =
(𝟏 − 𝒑) +
N = number of processors and S = serial fraction
The costs of complexity are measured in every aspect of the
software development cycle:
• Data dependency
• Race conditions
Results from multiple use of the same location(s) in storage by
for (int i=0;i<100;i++)
How to Handle Data Dependencies:
• Distributed memory architectures - communicate required data at
• Shared memory architectures -synchronize read/write operations
Thread A Thread B
1A: Read variable V 1B: Read variable V
2A: Add 1 to variable V 2B: Add 1 to variable V
3A: Write back to variable V 3B: Write back to variable V
If instruction 1B is executed between 1A and 3A, or if
instruction 1A is executed between 1B and 3B, the program
will produce incorrect data. This is known as a race
• Our program often requires "serialization" of segments of
• Synchronization is the solution for the above problems.
• To implement synchronization we use:
• Barrier .
• Locks .
• Synchronous communication operations .
Although there is several standardization APIs, such as MPI,
POSIX threads, and OpenMP, portability issues with parallel
programs are not as serious as in years past. However...
Operating systems can play a key role in code portability issues.
Hardware architectures are characteristically highly variable and
can affect portability.
• The primary intent of parallel programming is to decrease
execution wall clock time, however in order to accomplish
this, more CPU time is required. For example, a parallel code
that runs in 1 hour on 8 processors actually uses 8 hours of
• The amount of memory required can be greater for parallel
codes than serial codes, due to the need to replicate data and
for overheads associated with parallel support libraries and
• For short running parallel programs, there can actually be a
decrease in performance compared to a similar serial
implementation. The overhead costs associated with setting
up the parallel environment, task creation, communications
and task termination can comprise a significant portion of the
total execution time for short runs.
Two types of scaling based on time to solution:
• Strong scaling: The total problem size stays fixed as more
processors are added.
• Weak scaling: The problem size per processor stays fixed as
more processors are added.
Hardware factors play a significant role in scalability. Examples:
• Memory-CPU bus bandwidth
• Amount of memory available on any given machine or set
• Processor clock speed
• Not all parallelization results in speed-up.
• Once task split up into multiple threads those threads
spend a large amount of time communicating among each
other resulting degradation in the system.
• This is known as parallel slowdown.