This document summarizes a presentation about using CUDA (Compute Unified Device Architecture) to accelerate lattice quantum chromodynamics (QCD) calculations. CUDA is used to parallelize computations across many GPU threads. Each thread processes one lattice site, with neighboring sites and links accessed sequentially. Initially, each thread required 1.4KB of local storage, limiting occupancy. Occupancy was improved by storing data in registers instead of shared memory, expanding loops explicitly. This achieved up to 82 gigabytes per second on a GTX 280, 20 times faster than CPUs. Memory access patterns, float4 arrays, and textures were optimized to improve bandwidth utilization.
7. I’ll discuss Quantum ChromoDynamics
Although it’s “standard”, these equations are hard to solve
Big questions:
why do quarks appear in groups?
physics during big bang?
8.
9. Quantum
ChromoDynamics
The theory of nuclear
interactions
(bound by “gluons”)
Extremely difficult:
Must work at the level of fields, not particles
Calculation is quantum mechanical
10. Lattice QCD:
Solving Quantum Chromodynamics by Computer
Discretize space and time
(place the quarks and gluons on a 4D lattice)
11. Spacetime = 3+1 dimensions
32 ∼ 10
4 6
lattice sites
Quarks live on sites (24 floats each)
Gluons live on links (18 floats each)
lattice sites
4 × 324 × (24 + 4 × 18) ∼ 384MB
Total system size
gluons
float bytes quarks
12. Lattice QCD:
Inner loop requires repeatedly solving linear equation
quarks
gluons
DW is a sparse matrix
with only nearest neighbor
couplings
DW needs to be fast!
21. Step 1: Read neighbor site
Step 2: Read neighbor link
22. Step 1: Read neighbor site
Step 2: Read neighbor link
Step 3: Accumulate into
23. Step 4: Read neighbor site
Step 1: Read neighbor site
Step 2: Read neighbor link
Step 3: Accumulate into
24. Step 4: Read neighbor site
Step 1: Read neighbor site
Step 5: Read neighbor link
Step 2: Read neighbor link
Step 3: Accumulate into
25. Step 4: Read neighbor site
Step 1: Read neighbor site
Step 5: Read neighbor link
Step 2: Read neighbor link
Step 6: Accumulate into
Step 3: Accumulate into
28. Reminder -- each multiprocessor has:
16 kb shared memory
16 k registers
1024 active threads (max)
High occupancy needed for
(roughly 25% or so)
maximum performance
29. DW : does it fit onto the GPU?
Each thread requires 1.4 kb 0.2 kb
of fast local memory
24 12 floats
18 floats
24 floats
30. DW : does it fit onto the GPU?
Each thread requires 1.4 kb 0.2 kb
of fast local memory
MP has 16 kb shared mem
Threads/MP = 16 / 0.2 = 80
31. DW : does it fit onto the GPU?
Each thread requires 1.4 kb 0.2 kb
of fast local memory
MP has 16 kb shared mem
Threads/MP = 16 / 0.2 = 80 64
(multiple of 64 only)
32. DW : does it fit onto the GPU?
Each thread requires 1.4 kb 0.2 kb
of fast local memory
MP has 16 kb shared mem
Threads/MP = 16 / 0.2 = 80 64
(multiple of 64 only)
MP occupancy = 64/1024 = 6%
33. 6% occupancy
sounds pretty
bad!
Andreas Kuehn / Getty
34. How can we get better occupancy?
Reminder -- each multiprocessor has:
16 kb shared memory
16 k registers
1024 active threads (max)
Each thread requires 0.2 kb
of fast local memory
35. How can we get better occupancy?
Reminder -- each multiprocessor has:
16 kb shared memory Occupancy > 25%
16 k registers = 64 kb memory
1024 active threads
Each thread requires 0.2 kb
of fast local memory
36. Registers as data
(possible because no inter-thread communication)
Instead of shared memory
Registers are allocated as
37. Registers as data
Can’t be indexed. All loops must be
EXPLICITLY expanded
39. Performance Results:
44 Gigabytes/sec (Tesla C870)
82 Gigabytes/sec (GTX 280)
(90 Gflops/s)
(completely bandwidth limited)
For comparison:
t wice as fast as Cell impl. (arXiv:0804.3654)
20 times faster than CPU implementations
40. GB/s vs Occupancy
Tesla C870 GTX 280
GB/s GB/s
45.00 85.00
33.75 63.75
22.50 42.50
11.25 21.25
0 0
≥ 25% 17% 8% 0% ≥ 19% 13% 6% 0%
Occupancy Occupancy
Surprise! Very robust to low occupancy
43. When memory access isn’t perfectly coalesced
Sometimes float4 arrays can hide latency
This global memory read
corresponds to a single CUDA
instruction
In case of coalesce miss, at least
4x data is transfered
thread 0 thread 1 thread 2
44. When memory access isn’t perfectly coalesced
Binding to textures can help
corresponds to a single
CUDA instruction
This makes use of the texture cache and can reduce penalty
for nearly coalesced accesses
45. Regarding textures, there are t wo kinds of memory:
Linear array
Can be modified in kernel
Can only be bound to 1D texture
“Cuda array”
Can’t be modifed in kernel
Gets reordered for 2D, 3D locality
Allows various hardware features
46. When a CUDA array is bound to a 2D texture, it is
probably reordered to something like a Z-cur ve
This gives 2D locality
Wikipedia image
47. Warnings:
The effectiveness of float4, textures, depends
on the CUDA hardware and driver (!)
Certain “magic” access patterns are many
times faster than others
Testing appears to be necessary
48. Memory bandwidth test
Simple kernel
Memory access completely coalesced
Should be optimal
53. CUDA Compiler
(LOTS of
optimization
here)
CUDA PTX CUDA machine
C code code code
Use unofficial CUDA disassembler to
view CUDA machine code
CUDA
disassembly
54. CUDA Disassembler (decuda)
foo.cu
Compile and save cubin file
Disassemble
57. The compiler is very aggressive in optimization. It will
group memory loads together to minimize latency
(snippet from LQCD)
Notice: each thread reads 20 floats!