A presentation that introduces the basic concepts of parallel computing and gives some details on General Purpose GPU computing using the CUDA architecture.
2. General Concepts
GPU Programming
CA Parallel implementation
What is parallel computing?
Simultaneous use of multiple computing resources to solve a single
computational problem.
The computing resources can be:
A single computer with multiple processors.
A number of computers connected to a network.
A combination of both.
Benefits of parallel computing:
The computational load is broken apart in discrete pieces of work
that can be treated simultaneously.
The total simulation time is much less using multiple computing
resources.
Parallel Computing: Perspectives for more e cient hydrological modeling
2 / 20
3. General Concepts
GPU Programming
CA Parallel implementation
Parallel Computer Models Classification
Parallel Computer Classification
Flynn’s taxonomy: A widely used classification
Flynn's taxonomy: a widely used classifications
Classify along two independent dimensions:
◦ Classify along two independent dimensions:
Instruction and Data.
Instruction and Data
Each dimension can have two possible states:
◦ Each dimension can have two possible states:
Single or Multiple
Single or Multiple.
SISD
Single Instruction,
Single Data
SIMD
Single Instruction,
Multiple Data
MISD
Multiple Instruction,
Single Data
MIMD
Multiple Instruction,
Multiple Data
38
Parallel Computing: Perspectives for more e cient hydrological modeling
3 / 20
4. General Concepts
CPU
CPU GPU Programming
CPU
CPU
CA Parallel implementation
MIMD: Multiple Instruction, Multiple Data
The most common type of Interconnectcomputer (most modern
parallel
computers fall into this category).
Consists of a collection of fully independent processing units or
Memory
cores having their own control unit and its own ALU.
Execution
FIGURE 2.3
can be synchronous or asynchronous, as the processors
own pace.
Acan operate system
shared-memory at their
CPU
CPU
CPU
CPU
Memory
Memory
Memory
Memory
Interconnect
FIGURE 2.4
A distributed-memory system Parallel Computing: Perspectives for more e cient hydrological modeling
4 / 20
5. General Concepts
GPU Programming
CA Parallel implementation
Parallelism: An everyday example
Parallelism
Task parallelism: the ability to execute di↵erent tasks within a
problem at the same time.
As an analogy, think about a farmer who
hires workers to pick apples from an
orchard of trees
Data parallelism: the ability to execute parts of the same task on
di↵erent data at the same time.
◦ Worker hardware
As an analogy, think about a
farmer who hires workers to
(processing element)
pick apples from his trees:
◦ Trees tasks
Worker = hardware
◦ Apples data
(processing element).
Trees = task.
Apples = data.
Parallel Computing: Perspectives for more e cient hydrological modeling
5 / 20
47
6. Parallelism
General Concepts
GPU Programming
CA Parallel implementation
Sequential approach
The serial approach would be to have one
worker pick all of the apples from each tree
The sequential approach would be to have the worker pick all of
the apples from each tree.
48
Parallel Computing: Perspectives for more e cient hydrological modeling
6 / 20
7. Parallelism – More workers
workers
Parallelism: More
General Concepts
GPU Programming
CA Parallel implementation
Data parallel hardware: Working on the same tree, which allows
Working on the same tree.
each task parallel hardware, and would allow each task to
◦ data to be completed quicker.
be completed quicker work per tree?
How many workers should
How many workers should there be per tree?
What ififsome trees have few apples, while others have many?
What some trees have few apples, while others many?
49
Parallel Computing: Perspectives for more e cient hydrological modeling
7 / 20
8. Parallelism – More workers
Parallelism: More workers
General Concepts
GPU Programming
CA Parallel implementation
Each parallelism: Each worker pick a different tree
Task worker pick apples from apples from a di↵erent tree.
◦ Task parallelism, and although each task takes the
Although as in the serial version, many are
same time each task takes the same time as in the sequential version,
many tasks are parallel
accomplished inaccomplished in parallel.
What there are only few densely populated trees?
◦ What if if there are only aafew densely populated trees?
50
Parallel Computing: Perspectives for more e cient hydrological modeling
8 / 20
9. General Concepts
GPU Programming
CA Parallel implementation
Algorithm Decomposition
Task Decomposition
Most of engineering problems are non trivial and it is crucial to
have more formal to functionally independent parts
reduces an algorithm concepts for determining parallelism.
Tasks may have dependencies on other tasks
The concept of decomposition
◦ If the input of task B is dependent on the output of task A, then task
B is Task decomposition: dividing the algorithm into individual tasks,
dependent on task A
which are functionally independent. Tasks which don’t have
◦ Tasks that don’t have dependencies (or whose dependencies are
dependencies (or whose dependencies are completed) can be
completed) can be executed at any time to achieve parallelism
executed at any time to achieve parallelism.
◦ Task dependency graphs are used to describe the relationship
Data decomposition: dividing a data set into discrete chunks that
between tasks
can be processed in parallel.
A
B
A
B is dependent on A
B
C
A and B are independent
of each other
C is dependent on A and B
Parallel Computing: Perspectives for more e cient hydrological modeling
52
9 / 20
10. General Concepts
GPU Programming
CA Parallel
A quiet revolution and potential build-up implementation
◦ Calculation:TFLOPS Programming?
Why GPU vs. 100 GFLOPS
◦
Memory Bandwidth: ~10x
Many-core GPU
Multi-core CPU
Courtesy: John Owens
Figure 1.1. GPU in every PC– massive volume and potential impact
◦ Enlarging Perform ance Gap betw een GPUs and CPUs.
Parallel programming is easier than ever because it can be done at
relative low-end pc’s.
10
Cards such as the Nvidia Tesla C1060 and GT200 contain 240
cores, each of which is highly multithreaded.
Parallel Computing: Perspectives for more e cient hydrological modeling
10 / 20
11. General Concepts
●
CPU
GPU Programming
CA Parallel implementation
GPU vs CPU
●
●
●
GPU: area used for but very cache
Most die Few instructions memoryfast execution. Uses very fast
Relatively few transistors for ALUs
GDDR3 RAM. Most die area is used for ALUs and the caches are
relative small.
GPU CPU: Lots of instructions but slower execution. Uses slower DDR2
●
or die area used it ALUs
Most DDR3 RAM (butfor has direct access to more memory than
●
Relativelyfew transistors for ALUs.
relative small caches
GPUs). Most die area is used for memory cache and there are
Parallel Computing: Perspectives for more e cient hydrological modeling
11 / 20
12. General Concepts
GPU Programming
CA Parallel implementation
GPU is fastGPU is fast
Parallel Computing: Perspectives for more e cient hydrological modeling
12 / 20
13. General Concepts
GPU Programming
CA Parallel implementation
CUDA: Compute Unified Device Architecture
CUDA Program: Consists of phases that are executed on either
the host (CPU) or a device (GPU).
No data parallelism = the code is executed at the host.
Data parallelism = the code is executed at the device.
Data-parallel portions of an application are expressed as device
kernels which run on the device.
Arrays of Parallel Threads
GPU kernels are written using the Single Program Multiple Data
(SPMD) programming model.
• A CUDA kernel is executed by an array of
threads
SPMD executes multiple instances of the same program
– All threads run the same code (SPMD)
independently, where eachthat it uses to compute memorya di↵erent portion of
– Each thread has an ID program works on addresses and
the data. make control decisions
threadID
0 1 2 3 4 5 6 7
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
Parallel Computing: Perspectives for more e cient hydrological modeling
15
13 / 20
14. General Concepts
GPU Programming
CA Parallel implementation
CUDA: Compute Unified Device Architecture
Chapter 2. Programming Model
Grid
A CUDA kernel is executed
by an array of threads.
Each thread has an ID,
which is used to compute
memory addresses and make
control decisions.
CUDA threads are organized
into multiple blocks.
Threads within a block
cooperate via shared
memory, atomic operations
and barrier synchronization.
Block (0, 0)
Block (1, 0)
Block (2, 0)
Block (0, 1)
Block (1, 1)
Block (2, 1)
Block (1, 1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
Figure 2-1.Grid of Thread Blocks
Parallel Computing: Perspectives for more e cient hydrological modeling
2.3
Memory Hierarchy
14 / 20
15. General Concepts
GPU Programming
CA Parallel implementation
CUDA memory types
Chapter 4: Hardware Implementation
Global memory: Low
bandwidth but large space.
Fastest read/write calls if
they are coalesced.
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Texture memory: Cache
optimized for 2D spatial
patterns.
Shared Memory
Registers
Constant memory: Slow,
but with cache (8 kb).
Processor 1
Registers
Processor 2
Registers
…
Instruction
Unit
Processor M
Constant
Cache
Shared memory: Fast, but
it can be used only by the
threads of the same block.
Texture
Cache
Device Memory
Registers: 32768 32-bit
registers per Multi-processor.
A set of SIMT multiprocessors with on-chip shared memory.
Figure 4-2.Hardware Model
Parallel Computing: Perspectives for more e cient hydrological modeling
4.2
Multiple Devices
15 / 20
16. General Concepts
GPU Programming
CA Parallel implementation
CA Parallel implementation
A parallel version of the Cellular Automata algorithm for variably
saturated flow in soils was developed in CUDA API.
The infiltration experiment of Vauclin et al. (1979) was chosen as a
benchmark test for the accuracy and the speed of the algorithm.
0
t = 2 hrs
t = 3 hrs
t = 4 hrs
t = 8 hrs
experimental data
Water Depth (m)
0.5
1
1.5
2
0
0.5
1
1.5
Distance (m)
2
2.5
3
Parallel Computing: Perspectives for more e cient hydrological modeling
16 / 20
17. General Concepts
GPU Programming
CA Parallel implementation
Why parallel code is important?
In real case scenarios, where the 3-D simulation of large areas is
needed, the grid sizes are excessively large.
In natural hazards assessment the simulations should be fast in order
to be useful (the prediction should be before the actual event!).
Fast simulations allow us to calibrate easier the model parameters
and investigate more e ciently the physical phenomena.
The inherent CA concept natural parallelism make easier the
parallel implementation of the algorithm.
Parallel Computing: Perspectives for more e cient hydrological modeling
17 / 20
18. General Concepts
GPU Programming
CA Parallel implementation
Technical details
Di culties
The most challenging issue was the irregular geometry of the
domain which made more di cult the exploitation of the locality at
the thread computations and the use of the shared memory.
The cell values were stored in a 1D array and for each cell the
indexes of its neighboring cells were also stored.
Code structure
Simulation constants are stored in the constant memory.
Soil properties for each soil class are stored in the texture memory.
Atomic operations are used in order to check for convergence at
every iteration.
The shared memory is used to accelerate the atomic operations and
the block’s memory accesses.
Parallel Computing: Perspectives for more e cient hydrological modeling
18 / 20
19. General Concepts
GPU Programming
CA Parallel implementation
Results of the numerical tests
Nvidia Quadro 2000:
192 CUDA cores.
1 GB GDDR5 of RAM memory.
100000"
90"
70"
Speed%Up%
Speed%(%cells/sec%)%
80"
10000"
1000"
100"
CPU"
10"
GPU"
60"
50"
40"
30"
20"
10"
1"
1000"
10000"
100000"
Number%of%Cells%
1000000"
10000000"
0"
1000"
10000"
100000"
Number%of%Cells%
1000000"
10000000"
Parallel Computing: Perspectives for more e cient hydrological modeling
19 / 20