GPU programing
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW
12. The Gap Between CPU and GPU ref: Tesla GPU Computing Brochure
13. GPU Has 10x Comp Density Given the same chip area , the achievable performance of GPU is 10x higher than that of CPU.
14. Evolution of Intel Pentium Pentium I Pentium II Pentium III Pentium IV Chip area breakdown Q: What can you observe? Why?
15. Extrapolation of Single Core CPU If we extrapolate the trend, in a few generations, Pentium will look like: Of course, we know it did not happen. Q: What happened instead? Why?
16. Evolution of Multi-core CPUs Penryn Bloomfield Gulftown Beckton Chip area breakdown Q: What can you observe? Why?
17. Let's Take a Closer Look Less than 10% of total chip area is used for the real execution. Q: Why?
18. The Memory Hierarchy Notes on Energy at 45nm: 64-bit Int ADD takes about 1 pJ. 64-bit FP FMA takes about 200 pJ. It seems we can not further increase the computational density.
19. The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link
20. The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW Power Wall + Memory Wall + ILP Wall = Brick Wall David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link
21. How to Break the Brick Wall? Hint: how to exploit the parallelism inside the application?
22. Step 1: Trade Latency with Throughput Hind the memory latency through fine-grained interleaved threading.
27. Fine-Grained Interleaved Threading Pros: reduce cache size, no branch predictor, no OOO scheduler Cons: register pressure, thread scheduler, require huge parallelism Without and with fine-grained interleaved threading
28. HW Support Register file supports zero overhead context switch between interleaved threads.
29.
30. Step 2: Single Instruction Multiple Data SSE has 4 data lanes GPU has 8/16/24/... data lanes GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4.
33. Example of SIMT Execution Assume 32 threads are grouped into one warp.
34. Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit
35. NVIDIA's Motivation of Simple Core "This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train." --Bill Dally, NVIDIA
36. Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs)
91. Sandy Bridge's New CPU-GPU interface ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, ( link )
92. Sandy Bridge's New CPU-GPU interface ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, ( link )
93.
94.
Hinweis der Redaktion
NVIDIA planned to put 512 PEs into a single GPU, but the GTX480 turns out to have 480 PEs.
GPU can achieve 10x performance over CPU.
Notice the third place is PowerXCell. Rmax is the performance of Linpack benchmark. Rpeak is the raw performance of the machine.
This gap is narrowed by multi-core CPUs.
Comparing raw performance is less interesting.
The area breakdown is an approximation, but it is good enough to see the trend.
The size of L3 in high end and low end CPUs are quite different.
This break down is also an approximation.
Numbers are based on Intel Nehalem at 45nm and the presentation of Bill Dally.
More registers are required to store the contexts of threads.
Hiding memory latency by multi-threading. The Cell uses a relatively static approach. The overlapping of computation and DMA transfer is explicitly specified by programmer.
Fine-grained multi-threading can keep the PEs busy even the program has little ILP.
The cache can still help.
The address assignment and translation is done dynamically by hardware.
The vector core should be larger than scalar core.
From scalar to vector.
From vector to threads.
Warp can be grouped at run time by hardware. In this case it will be transparent to the programmer.
The NVIDIA Fermi PE can do int and fp.
We have ignored some architectural features of Fermi. Noticeably the interconnection network is not discussed here.
These features are summarized by the paper of Michael Garland and David Kirk.
The vector program use SSE as example. However, the "incps" is not an SSE instruction. It is used here to represent incrementation of the vector.
Each thread uses its ID to locate its working data set.
The scheduler tries to maintain load balancing among SMs.
Numbers taken from an old paper on G80 architecture, but it should be similar to the GF100 architecture.
The old architecture has 16 banks.
It is a trend to use threads to hide vector width. The OpenCL applies the same programming model.
It is arguable whether working on threads is more productive.
This example assumes the two warp schedulers are decoupled. It is possible that they are coupled together, at the cost of hardware complexity.
Assume the register file has one read port. The register file may need two read port to support instructions with 3 source operands, e.g. the Fused Multiply Add (FMA).
5 issue VLIW.
The atomic unit is helpful in voting operation, e.g. histogram.
The figure is taken from 8800 GPU. See the paper of Samuel Williams for more detail.
The number is obtained in 8800 GPU.
The latency hiding is addressed in the PhD thesis of Samuel Williams.