2. Scope
This presentation covers the main features of
Fermi, Fermi refresh & Kepler architectures
The overview is done from compute perspective,
and as such Graphics features are not discussed
Polyphase Engine, Raster, ROBs, etc.
5. GF100 SM
SM - Stream Multiprocessor
32 “CUDA cores”, organized into two clusters, 16 cores each
Warp is 32 threads – two cycles to complete a Warp
NVIDIA solution - ALU clock is double the Core clock
4 SFUs (accelerate transcendental functions)
16 Load / Store units
Dual Warp scheduler – execute two warps concurrently
Note bottlenecks on LD/ST & SFU – architecture decision
Each SM can hold up to 48 Warps, divided up to 8 blocks
Hold “in-flight” warps to hide latency
Typically no. of blocks is lower.
For example, 24 warps per block = 2 blocks per SM
6. Packing it all together
GPC – Graphic Processing Cluster
Four SMs
Transparent to compute usages
7. Packing it all together
Four GPCs
768K L2 shared between SMs
Support L2 only or L1&L2 caching
384-bit GDDR5
GigaThread Scheduler
Schedule thread blocks to SMs
Concurrent Kernel Execution - separated
kernels per SM.
8.
9. Fermi GF104 SM
Changes from GF100 SM:
48 “CUDA cores”, organized into three clusters of 16 cores
each
8 SFUs instead of 4
Rest remains the same (32K 32-bit registers, 64K L1/Shared,
etc.)
Wait a sec…three clusters, but still schedule two warps ?
Under-utilization study of GF100 led to scheduling redesign –
Next slide…
10. Instruction Level Parallelism (ILP)
GF100 GF104
Two warp Schedulers feed two clusters of cores Adopt ILP idea from CPU world - issue two
Memory access or SFU access lead to instructions per clock
underutilization of Cores Cluster Add a third cluster for balanced utilization
11.
12. Meet GK104 SMX
192 “CUDA Cores”
Organized into 6 clusters of 32 cores each
No more “dual clocked ALU”
16 Load/Store units
16 SFUs
64K 32-bit registers
Same 64K L1/Shared
Same dual-issued Warp scheduling:
Execute 4 warps concurrently
Issue two instructions per cycle
Each SMX can hold up to 64 warps,
divided up to 16 blocks
13. From GF104 to GK104
Look at Half of SMX
SM SMX
Same:
Two warp schedulers
Two dispatch units per
scheduler
32K register file
6 rows of cores
1 row of load/store
1 row of SFU
Different:
On SMX, a row of cores
is 16 vs 8 on SM
On SMX a row of SFU is
16 vs 8 on SM
14. Packing it all together
Four GPCs, each has two SMXs
512K L2 shared between SMs
L1 is no longer used for CUDA
256-bit GDDR5
GigaThread Scheduler
Dynamic Parallelism
15. GK104 vs. GF104
Kepler has less “multiprocessors”
8 vs. 16
Less flexible on executing different kernels concurrently
Each “multiprocessor” is stronger
Issue twice the warps (6 vs. 3)
Twice the register file
Execute warp in a single cycle
More SFUs
10x Faster atomic operations
But:
SMX Holds 64 warps vs. the 48 for SM – less latency hiding per warp cluster
L1/Shared Memory stayed the same size – and totally bypassed in CUDA/OpenCL
Memory BW did not scale as compute/cores did (192GB/Sec, same as in GF110)
16. GK110 SMX
Tesla only (no GeForce
version)
Very similar to GK104 SMX
Additional Double-Precision
units, otherwise the same
18. Improved scheduling 1 - HyperQ
Scenario: multiple CPU processes send work to the GPU
On Fermi, time division between processes
On Kepler, simultaneous processing from multiple processes
19. Improved scheduling 2
A new age in GPU programmability:
moving from Master-Salve pattern to self-feeding