Gpu and The Brick Wall

Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Today's Topics ,[object Object],[object Object],[object Object],[object Object],[object Object]

GPU Architecture NVIDIA Fermi, 512 Processing Elements (PEs)

What Can It Do? Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second!

General Purposed Computing ref: http://www.nvidia.com/object/tesla_computing_solutions.html

The Vision of NVIDIA ,[object Object],[object Object]

Single-Chip GPU v.s. Fastest Super Computers ref: http://www.llnl.gov/str/JanFeb05/Seager.html

Top500 Super Computer in June 2010

GPU Will Top the List in Nov 2010

The Gap Between CPU and GPU ref: Tesla GPU Computing Brochure

GPU Has 10x Comp Density Given the same chip area , the achievable performance of GPU is 10x higher than that of CPU.

Evolution of Intel Pentium Pentium I Pentium II Pentium III Pentium IV Chip area breakdown Q: What can you observe? Why?

Extrapolation of Single Core CPU If we extrapolate the trend, in a few generations, Pentium will look like: Of course, we know it did not happen. Q: What happened instead? Why?

Evolution of Multi-core CPUs Penryn Bloomfield Gulftown Beckton Chip area breakdown Q: What can you observe? Why?

Let's Take a Closer Look Less than 10% of total chip area is used for the real execution. Q: Why?

The Memory Hierarchy Notes on Energy at 45nm: 64-bit Int ADD takes about 1 pJ. 64-bit FP FMA takes about 200 pJ. It seems we can not further increase the computational density.

The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW Power Wall + Memory Wall + ILP Wall = Brick Wall David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

How to Break the Brick Wall? Hint: how to exploit the parallelism inside the application?

Step 1: Trade Latency with Throughput Hind the memory latency through fine-grained interleaved threading.

Interleaved Multi-threading ,[object Object],[object Object],[object Object],[object Object]

Interleaved Multi-threading ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Fine-Grained Interleaved Threading Pros: reduce cache size, no branch predictor, no OOO scheduler Cons: register pressure, thread scheduler, require huge parallelism Without and with fine-grained interleaved threading

HW Support Register file supports zero overhead context switch between interleaved threads.

Can We Make Further Improvement? ,[object Object],[object Object],Hint: We have only utilized thread level parallelism (TLP) so far.

Step 2: Single Instruction Multiple Data SSE has 4 data lanes GPU has 8/16/24/... data lanes GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4.

Hardware Support Supporting interleaved threading + SIMD execution

Single Instruction Multiple Thread (SIMT) Hide vector width using scalar threads.

Example of SIMT Execution Assume 32 threads are grouped into one warp.

Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit

NVIDIA's Motivation of Simple Core "This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train." --Bill Dally, NVIDIA

Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs)

Throughput Oriented Architectures ,[object Object],[object Object],[object Object],[object Object],ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. ( link )

CUDA Programming Massive number (>10000) of light-weight threads.

Express Data Parallelism in Threads ,[object Object]

Vector Program ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Vector width is exposed to programmers.

CUDA Program ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Two Levels of Thread Hierarchy ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Multi-dimension Thread and Block ID ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Both grid and thread block can have two dimensional index.

Scheduling Thread Blocks on SM Example: Scheduling 4 thread blocks on 3 SMs.

Executing Thread Block on SM ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Executed on machine with width of 4: Executed on machine with width of 8: Notes: the number of Processing Elements (PEs) is transparent to programmer.

Multiple Levels of Memory Hierarchy Name Cache? cycle read-only? Global L1/L2 200~400 (cache miss) R/W Shared No 1~3 R/W Constant Yes 1~3 Read-only Texture Yes ~100 Read-only Local L1/L2 200~400 (cache miss) R/W

Explicit Management of Shared Mem Shared memory is frequently used to exploit locality.

Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; //allocate smem i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; } Example: average filter with 3x3 window 3x3 window on image Image data in DRAM

Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // thread wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; } Example: average filter over 3x3 window 3x3 window on image Stage data in shared mem

Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); // every thread is ready A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; } Example: average filter over 3x3 window 3x3 window on image all threads finish the load

Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; } Example: average filter over 3x3 window 3x3 window on image Start computation

Programmers Think in Threads Q: Why make this hassle?

Why Use Thread instead of Vector? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Features of CUDA ,[object Object],[object Object],[object Object]

Micro-architecture GF100 micro-architecture

HW Groups Threads Into Warps Example: 32 threads per warp

Example of Implementation Note: NVIDIA may use a more complicated implementation.

Example ,[object Object],[object Object],[object Object],Assume warp 0 and warp 1 are scheduled for execution.

Read Src Op ,[object Object],[object Object],[object Object],Read source operands: r1 for warp 0 r4 for warp 1

Buffer Src Op ,[object Object],[object Object],[object Object],Push ops to op collector: r1 for warp 0 r4 for warp 1

Read Src Op ,[object Object],[object Object],[object Object],Read source operands: r2 for warp 0 r5 for warp 1

Buffer Src Op ,[object Object],[object Object],[object Object],Push ops to op collector: r2 for warp 0 r5 for warp 1

Execute ,[object Object],[object Object],[object Object],Compute the first 16 threads in the warp.

Execute ,[object Object],[object Object],[object Object],Compute the last 16 threads in the warp.

Write back ,[object Object],[object Object],[object Object],Write back: r0 for warp 0 r3 for warp 1

Other High Performance GPU ,[object Object]

ATI Radeon 5000 Series Architecture

Radeon SIMD Engine ,[object Object],[object Object]

Performance Optimization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Shared Mem Contains Multiple Banks

Compute Capability Need arch info to perform optimization. ref: NVIDIA, "CUDA C Programming Guide", ( link )

Shared Memory (compute capability 2.x) without bank conflict: with bank conflict:

Global Memory In Off-Chip DRAM ,[object Object]

Roofline Model Identify performance bottleneck: computation bound v.s. bandwidth bound

Optimization Is Key for Attainable Gflops/s

Computation, Bandwidth, Latency ,[object Object]

Trends ,[object Object],[object Object],[object Object]

Intel Many Integrated Core (MIC) 32 core version of MIC:

Intel Sandy Bridge ,[object Object],[object Object],[object Object]

Sandy Bridge's New CPU-GPU interface ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, ( link )

AMD Llano Fusion APU (expt. Q3 2011) ,[object Object],[object Object],[object Object]

GPU Research in ES Group ,[object Object],[object Object]

Gpu and The Brick Wall

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Gpu and The Brick Wall

Ähnlich wie Gpu and The Brick Wall (20)

Mehr von ugur candan

Mehr von ugur candan (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Gpu and The Brick Wall

Hinweis der Redaktion