From fermi to kepler

NVIDIA GPU Architecture:
From Fermi to Kepler
Ofer Rosenberg
Jan 21st 2013

Scope
 This presentation covers the main features of
Fermi, Fermi refresh & Kepler architectures

 The overview is done from compute perspective,
and as such Graphics features are not discussed
 Polyphase Engine, Raster, ROBs, etc.

Quick Numbers
GTX 480 GTX580 GTX680
Architecture GF100 GF110 GK104
SM / SMX 15 16 8
CUDA cores 480 512 1536
Core Frequency 700MHz 772MHz 1006MHz
Compute Power 1345 GFLOPS 1581 GFLOPS 3090 GFLOPS
Memory BW 177.4 GB/s 192.2 GB/s 192.2 GB/s
Transistors 3.2B 3.0B 3.5B
Technology 40nm 40nm 28nm
Power 250W 244W 195W

GF100 SM
 SM - Stream Multiprocessor

 32 “CUDA cores”, organized into two clusters, 16 cores each

 Warp is 32 threads – two cycles to complete a Warp
 NVIDIA solution - ALU clock is double the Core clock

 4 SFUs (accelerate transcendental functions)

 16 Load / Store units

 Dual Warp scheduler – execute two warps concurrently
 Note bottlenecks on LD/ST & SFU – architecture decision

 Each SM can hold up to 48 Warps, divided up to 8 blocks
 Hold “in-flight” warps to hide latency

 Typically no. of blocks is lower.

 For example, 24 warps per block = 2 blocks per SM

Packing it all together
 GPC – Graphic Processing Cluster
 Four SMs
 Transparent to compute usages

 Four GPCs
 768K L2 shared between SMs
 Support L2 only or L1&L2 caching

 384-bit GDDR5
 GigaThread Scheduler
 Schedule thread blocks to SMs
 Concurrent Kernel Execution - separated
kernels per SM.

Fermi GF104 SM
Changes from GF100 SM:

 48 “CUDA cores”, organized into three clusters of 16 cores
each

 8 SFUs instead of 4

 Rest remains the same (32K 32-bit registers, 64K L1/Shared,
etc.)

 Wait a sec…three clusters, but still schedule two warps ?

 Under-utilization study of GF100 led to scheduling redesign –
Next slide…

Instruction Level Parallelism (ILP)

GF100 GF104
 Two warp Schedulers feed two clusters of cores  Adopt ILP idea from CPU world - issue two
 Memory access or SFU access lead to instructions per clock
underutilization of Cores Cluster  Add a third cluster for balanced utilization

Meet GK104 SMX
 192 “CUDA Cores”

 Organized into 6 clusters of 32 cores each
 No more “dual clocked ALU”

 16 Load/Store units

 16 SFUs

 64K 32-bit registers

 Same 64K L1/Shared

 Same dual-issued Warp scheduling:
 Execute 4 warps concurrently

 Issue two instructions per cycle

 Each SMX can hold up to 64 warps,
divided up to 16 blocks

From GF104 to GK104
 Look at Half of SMX
SM SMX
 Same:
 Two warp schedulers
 Two dispatch units per
scheduler
 32K register file
 6 rows of cores
 1 row of load/store
 1 row of SFU

 Different:
 On SMX, a row of cores
is 16 vs 8 on SM
 On SMX a row of SFU is
16 vs 8 on SM

 Four GPCs, each has two SMXs

 512K L2 shared between SMs
 L1 is no longer used for CUDA

 256-bit GDDR5

 GigaThread Scheduler
 Dynamic Parallelism

GK104 vs. GF104
 Kepler has less “multiprocessors”
 8 vs. 16
 Less flexible on executing different kernels concurrently

 Each “multiprocessor” is stronger
 Issue twice the warps (6 vs. 3)

 Twice the register file
 Execute warp in a single cycle

 More SFUs
 10x Faster atomic operations

 But:
 SMX Holds 64 warps vs. the 48 for SM – less latency hiding per warp cluster

 L1/Shared Memory stayed the same size – and totally bypassed in CUDA/OpenCL
 Memory BW did not scale as compute/cores did (192GB/Sec, same as in GF110)

GK110 SMX
 Tesla only (no GeForce
version)
 Very similar to GK104 SMX
 Additional Double-Precision
units, otherwise the same

GK110

 Production versions: 14 & 13 SMXs (not 15)
 Improved device-level scheduling (next slides):
 HyperQ
 Dynamic Parallelism

Improved scheduling 1 - HyperQ
 Scenario: multiple CPU processes send work to the GPU

 On Fermi, time division between processes

 On Kepler, simultaneous processing from multiple processes

Improved scheduling 2
 A new age in GPU programmability:

moving from Master-Salve pattern to self-feeding

From fermi to kepler

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie From fermi to kepler

Ähnlich wie From fermi to kepler (20)

Mehr von Ofer Rosenberg

Mehr von Ofer Rosenberg (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

From fermi to kepler