The document discusses optimizations for memory and communication in massively parallel computing. It recommends caching data in faster shared memory to reduce loads and stores to global device memory. This can improve performance by avoiding non-coalesced global memory accesses. The document provides an example of coalescing writes for a matrix transpose by first loading data into shared memory and then writing columns of the tile to global memory in contiguous addresses.
16. Rev ie w
PC Architecture
8 GB/s
>?@
?>L9G=2%&66"K16
J%+8#"F7(&"K16
H%'2$7,6">'%("I"
A+%#$)%7(B& F+1#$)%7(B&
>@C!
E&.+%/"K16 ?>L"K16
3+ Gb/s
CD!E F!:! G#$&%8&# !
160+ GB/s
to
VRAM 25+ GB/s
modified from Matthew Bolitho
17. Rev ie w
The PCI-“not-so”-e Bus
• PCIe bus is slow
• Try to minimize/group transfers
• Use pinned memory on host whenever possible
• Try to perform copies asynchronously (e.g. Streams)
• Use “Zero-Copy” when appropriate
• Examples in the SDK (e.g. bandwidthTest)
30. Memory Coalescing
GPU memory controller granularity is 64 or 128 bytes
Must also be 64 or 128 byte aligned
Suppose thread loads a float (4 bytes)
Controller loads 64 bytes, throws 60 bytes away
31. Memory Coalescing
Memory controller actually more intelligent
Consider half-warp (16 threads)
Suppose each thread reads consecutive float
Memory controller will perform one 64 byte load
This is known as coalescing
Make threads read consecutive locations
34. Memory Coalescing
GT200 has hardware coalescer
Inspects memory requests from each half-warp
Determines minimum set of transactions which are
64 or 128 bytes long
64 or 128 byte aligned
37. Coalescing Summary
Coalescing dramatically speeds global memory access
Strive for perfect coalescing:
Align starting address (may require padding)
A warp should access within a contiguous region
40. Shared Memory
SMs can access gmem at 80+ GiB/sec
but have hundreds of cycles of latency
Each SM has 16 kiB ‘shared’ memory
Essentially user-managed cache
Speed comparable to registers
Accessible to all threads in a block
Reduces load/stores to device memory
55. Shared Memory Banks
Bank 0 0 16
Bank 1 1 17
Bank 2 2 18
Shared memory divided Bank 3
Bank 4
3
4
19
20
into 16 ‘banks’ Bank 5 5 21
Bank 6 6 22
Shared memory is (almost) Bank 7 7 23
as fast as registers (...) Bank 8 8 24
Bank 9 9 25
Exception is in case of Bank 10
Bank 11
10
11
26
27
bank conflicts Bank 12 12 28
Bank 13 13 29
Bank 14 14 30
Bank 15 15 31
4 bytes
89. tation t hrough
compu
g GPU
Acc eleratin thods
-precis ion me
mixed
lark
M ichael C s
ic
As trophys
ian Ce nter for Univers
ity
on Harvard
-Smiths
Harvard
SC’10
90. ... too much ?
ba nk c
onflict
s
on
ing
isi
ale sc
ec
co
ca
pr
ch
d part
ition
in
ixe
cla ca m
g
m ping
m
pi
ng
adca sting
bro
ms
zero-cop trea