[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Irregular Parallelism on the GPU:
Algorithms and Data Structures

John Owens
Associate Professor, Electrical and Computer Engineering
Institute for Data Analysis and Visualization
SciDAC Institute for Ultrascale Visualization
University of California, Davis

Outline
• Irregular parallelism

• Get a new data structure! (sparse matrix-vector)

• Reuse a good data structure! (composite)

• Get a new algorithm! (hash tables)

• Convert serial to parallel! (task queue)

• Concentrate on communication not computation!
(MapReduce)

• Broader thoughts

Sparse Matrix-Vector Multiply: What’s Hard?

• Dense approach is wasteful

• Unclear how to map work to parallel processors

• Irregular data access

“Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors”, Bell & Garland, SC ’

(DI
A)
Dia
go
na
l

Structured
(EL
L)
EL
LP
AC
K

(CS
R)
Co
m pre
ss
e dR
ow
(HY
B)
Hy
bri
d
Sparse Matrix Formats

(CO
O)
Unstructured

Co
o rdi
na
te

Diagonal Matrices

We’ve recently done
-2 0 1
work on tridiagonal
systems—PPoPP ’ 10

• Diagonals should be mostly populated

• Map one thread per row

• Good parallel eﬃciency

• Good memory behavior [column-major storage]

Irregular Matrices: ELL
0 2

0 1 2 3 4 5

0 2 3

0 1 2

1 2 3 4 5

5

padding

• Assign one thread per row again

• But now:

• Load imbalance hurts parallel eﬃciency

Irregular Matrices: COO
00 02

10 11 12 13 14 15

20 22 23

30 31 32

41 42 43 44 45

55
no padding!

• General format; insensitive to sparsity pattern, but ~3x
slower than ELL

• Assign one thread per element, combine results from all
elements in a row to get output element

• Req segmented reduction, communication btwn threads

Irregular Matrices: HYB
“Typical” “Exceptional”
0 2

0 1 2 13 14 15

0 2 3

0 1 2

1 2 3 44 45

5

• Combine regularity of ELL + ﬂexibility of COO

SpMV: Summary

• Ample parallelism for large matrices

• Structured matrices (dense, diagonal): straightforward

• Sparse matrices: Tradeoﬀ between parallel eﬃciency and
load balance

• Take-home message: Use data structure appropriate to
your matrix

• Insight: Irregularity is manageable if you regularize the
common case

The Programmable Pipeline
Input Tess. Vertex Geom. Frag.
Raster Compose
Assembly Shading Shading Shading Shading

Split Dice Shading Sampling Composition

Ray-Primitive
Ray Generation Ray Traversal Shading
Intersection

Composition
Samples /
Fragments
Fragment Generation
Fragments
(position, depth, color)

Composite
S1
Subpixels
(position, color)

Filter S3 S2
Final pixels

Subpixels
locations

S4
Anjul Patney, Stanley Tzeng, and John D. Pixel
Owens. Fragment-Parallel Composite and
Filter. Computer Graphics Forum (Proceedings
of the Eurographics Symposium on Rendering),
29(4):1251–1258, June 2010.

Pixel-Parallel Composition
Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1

Pixel i Pixel i+1

Sample-Parallel Composition

Segmented Scan

Segmented Reduction

Pixel i Pixel i+1

Pixel i Pixel i+1

M randomly-distributed fragments

Nested Data Parallelism
00 02

10 11 12 13 14 15

20

30
22

31
23

32
=
41 42 43 44 45

55


Segmented Scan

Segmented Reduction

Pixel i Pixel i+1

Hash Tables & Sparsity

Lefebvre and Hoppe, Siggraph 2006

Scalar Hashing
#2

#1
#
#1

key key key

#2

Linear Probing Double Probing Chaining

Scalar Hashing: Parallel Problems
#2

#1
# #1

key key key

#2

• Construction and Lookup

• Variable time/work per entry

• Construction

• Synchronization / shared access to data structure

Parallel Hashing: The Problem
• Hash tables are good for sparse data.

• Input: Set of key-value pairs to place in the hash table

• Output: Data structure that allows:

• Determining if key has been placed in hash table

• Given the key, fetching its value

• Could also:

• Sort key-value pairs by key (construction)

• Binary-search sorted list (lookup)

• Recalculate at every change

Parallel Hashing: What We Want

• Fast construction time

• Fast access time

• O(1) for any element, O(n) for n elements in parallel

• Reasonable memory usage

• Algorithms and data structures may sit at diﬀerent places in
this space

• Perfect spatial hashing has good lookup times and memory
usage but is very slow to construct

Level : Distribute into buckets
Keys Bucket ids

Atomic
add

8 5 6 8 Bucket sizes

h

Local oﬀsets 1 3 7 0 5 2 4 6 0 8 13 19 Global oﬀsets

Data distributed into buckets

Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and
Nina Amenta. Real-Time Parallel Hashing on the GPU. Siggraph Asia 2009 (ACM TOG 28(5)), pp. 154:1–154:9, Dec. 2009.

Parallel Hashing: Level

• FKS (Level 1) is good for a coarse categorization

• Possible performance issue: atomics

• Bad for a ﬁne categorization

• Space requirement for n elements to (probabilistically) guarantee
no collisions is O(n2)

Hashing in Parallel
ha hb

1 0 1 0
3 1 3 1
2 2 2 2
0 3 1 3

Cuckoo Hashing Construction
h1 h2 T1 T2

1 1

0 1 0 0
1 0 1 1
0 0

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Cuckoo Construction Mechanics

• Level 1 created buckets of no more than 512 items

• Average: 409; probability of overﬂow: < 10-6

• Level 2: Assign each bucket to a thread block, construct
cuckoo hash per bucket entirely within shared memory

• Semantic: Multiple writes to same location must have one and
only one winner

• Our implementation uses 3 tables of 192 elements each (load
factor: 71%)

• What if it fails? New hash functions & start over.

Why did we pick this implementation?

• Alternative: single stage, cuckoo hash only

• Problem 1: Uncoalesced reads/writes to global memory
(and may require atomics)

• Problem 2: Penalty for failure is large

But it turns out … Thanks to expert reviewer
(and current collaborator)
Vasily Volkov!

• … this simpler approach is faster.

• Single hash table in global memory (1.4x data size),
4 diﬀerent hash functions into it

• For each thread:

• Atomically exchange my item with item in hash table

• If my new item is empty, I’m done

• If my new item used hash function n, use hash function
n+1 to reinsert

Hashing: Big Ideas

• Classic serial hashing techniques are a poor ﬁt for a GPU.

• Serialization, load balance

• Solving this problem required a diﬀerent algorithm

• Both hashing algorithms were new to the parallel literature

• Hybrid algorithm was entirely new

Representing Smooth Surfaces
• Polygonal Meshes

• Lack view adaptivity

• Ineﬃcient storage/transfer

• Parametric Surfaces

• Collections of smooth patches

• Hard to animate

• Subdivision Surfaces

• Widely popular, easy for modeling

• Flexible animation
[Boubekeur and Schlick 2007]

Parallel Traversal

Anjul Patney and John D. Owens. Real-Time Reyes-
Style Adaptive Surface Subdivision. ACM Transactions
on Graphics, 27(5):143:1–143:8, December 2008.

Cracks & T-Junctions
Uniform Reﬁnement Adaptive Reﬁnement

Crack

T-Junction

Previous work was post-process after all steps; ours was
inherent within subdivision process

Recursive Subdivision is Irregular

1
2
2 2
2 1 1
1
1
1 2
1 2 2
2

1
1
2
1
2
1
1 1
1 1
2
2
Patney, Ebeida, and Owens. “Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces”. HPG ’ . 1

Static Task List
Input Input Input Input Input

SM SM SM SM SM

Atomic Ptr

Daniel Cederman and
Output Philippas Tsigas, On Dynamic
restart Load Balancing on Graphics
Processors. Graphics
kernel Hardware 2008, June 2008.

Private Work Queue Approach

• Allocate private work queue of tasks per
core

• Each core can add to or remove work from
its local queue

• Cores mark self as idle if {queue
exhausts storage, queue is empty}

• Cores periodically check global idle
counter

• If global idle counter reaches threshold,
rebalance work gProximity: Fast Hierarchy Operations
on GPU Architectures, Lauterbach,
Mo, and Manocha, EG ’

Work Stealing & Donating
Lock Lock ...

I/O I/O I/O I/O I/O
Deque Deque Deque Deque Deque

SM SM SM SM SM

• Cederman and Tsigas: Stealing == best performance and
scalability (follows Arora CPU-based work)

• We showed how to do this with multiple kernels in an
uberkernel and persistent-thread programming style

• We added donating to minimize memory usage

Ingredients for Our Scheme
Implementation
questions that we
need to address:
What is the proper Warp Size
granularity for tasks? Work Granularity

How many threads to Persistent Threads
launch?
How to avoid global Uberkernels
synchronizations?
How to distribute Task Donation
tasks evenly?

44

Warp Size Work Granularity
• Problem: We want to emulate task level parallelism on
the GPU without loss in efficiency.
• Solution: we choose block sizes of 32 threads / block.
• Removes messy synchronization barriers.
• Can view each block as a MIMD thread. We call these blocks
processors

P P P P P P P P P

Physical Physical Physical
Processor Processor Processor

45

Uberkernel Processor Utilization
• Problem: Want to eliminate global kernel barriers for
better processor utilization
• Uberkernels pack multiple execution routes into one
kernel.
Data Flow Data Flow

Pipeline Stage 1 Uberkernel

Stage 1

Stage 2
Pipeline Stage 2

46

Persistent Thread Scheduler Emulation
• Problem: If input is irregular? How many threads do we
launch?
• Launch enough to ﬁll the GPU, and keep them alive so
they keep fetching work.

Life of a Thread: Thread:
Persistent

Fetch Process Write
Spawn Death
Data Data Output

How do we know when to stop?
When there is no more work left

47

Task Donation
• When a bin is full, processor can give work to someone
else.

P1 P2 P3 P4

This processor ‘s bin Smaller memory usage
is full and donates
its work to someone More complicated
else’s bin
48

Queuing Schemes
ent for Irregular-Parallel Workloads on the GPU

4)5'6(7,)8-)9-:)(2-'()*+,,+3-683-73;+-<75+-'+(-'()*+,,)( 4)5'6(7,)8-)9-:)(2-'()*+,,+3-683-73;+-<75+-'+(-'()*+,,)(
"0!!

25k 25k
1)(2 1)(2
"!! "0!!! "0!!!
.3;+ .3;+

"!!!

"!!!! "!!!!
&0!
1)(2-'()*+,,+3

1)(2-'()*+,,+3
.3;+-7<+(6<7)8,

.3;+-7<+(6<7)8,
&0!!
&0!!! &0!!!

&!!

&!!!!
&!!!
&!!!!
Gray bars: Load
0!
0!!!
0!!
0!!! balance
! ! ! !
! "! #! $! %! &!! &"! ! "! #! $! %! &!! &"!
'()*+,,)(-./ '()*+,,)(-./

(a) Block Queuing (b) Distributed Queuing Black line: Idle
time
$!!
5)6'7(8,)9-):-;)(3-'()*+,,+4-794-84<+-=86+-'+(-'()*+,,)(

2)(3
160
&$! 1!!
5)6'7(8,)9-):-;)(3-'()*+,,+4-794-84<+-=86+-'+(-'()*+,,)(

2)(3
140
&#!

.4<+ &#! .4<+
&"!
1!!
#!!
&"!
&!!
#!!
&!!
2)(3-'()*+,,+4

2)(3-'()*+,,+4
.4<+-8=+(7=8)9,

.4<+-8=+(7=8)9,
0!!
%!

0!! %!

$!
"!!
$!
"!!
#!
#!
&!!
&!!
"!
"!
Stanley Tzeng, Anjul Patney, and John D.
!
! "! #! $! %! &!! &"!
! !
! "! #! $! %! &!! &"!
!
Owens. Task Management for Irregular-
'()*+,,)(-./ '()*+,,)(-./
Parallel Workloads on the GPU. In
Proceedings of High Performance Graphics
(c) Task Stealing (d) Task Donating 2010, pages 29–37, June 2010.

Smooth Surfaces, High Detail

16 samples per pixel
>15 frames per second on GeForce GTX280

MapReduce

http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png

Why MapReduce?

• Simple programming model

• Parallel programming model

• Scalable

• Previous GPU work: neither multi-GPU nor out-of-core

Block Diagram
!"#$ !"#$
%#$ %#$
&'($ )'($

&'($8$ &'($9$
+#2$ :$$:$$:$ +#2$
)'($8$ )'($9$

63;12"01.$ 63;12"01.$

*+,$-$ *+,$-$
'+./+0$ '+./+0$
!12"31$-$ !12"31$-$
'+.//%#$ '+.//%#$

45#$ 45#$

6%.7$ 6%.7$

63;12"01.$ 63;12"01.$ Jeﬀ A. Stuart and John D. Owens. Multi-
GPU MapReduce on GPU Clusters. In
Proceedings of the 25th IEEE
!12"31$ !12"31$ International Parallel and Distributed
Processing Symposium, May 2011.

Keys to Performance
!"#$ !"#$
%#$ %#$
&'($ )'($

&'($8$ &'($9$
+#2$ :$$:$$:$ +#2$
)'($8$ )'($9$
• Process data in chunks 63;12"01.$ 63;12"01.$

• More eﬃcient transmission &
*+,$-$ *+,$-$
computation '+./+0$ '+./+0$
!12"31$-$ !12"31$-$

• Also allows out of core '+.//%#$ '+.//%#$

• Overlap computation and 45#$ 45#$
communication

• Accumulate 6%.7$ 6%.7$

• Partial Reduce 63;12"01.$ 63;12"01.$

!12"31$ !12"31$

Partial Reduce
!"#$ !"#$
%#$ %#$
&'($ )'($

&'($8$ &'($9$
+#2$ :$$:$$:$ +#2$
)'($8$ )'($9$

• User-speciﬁed function combines 63;12"01.$ 63;12"01.$
pairs with the same key
*+,$-$ *+,$-$
• Each map chunk spawns a partial '+./+0$
!12"31$-$
'+./+0$
!12"31$-$
reduction on that chunk '+.//%#$ '+.//%#$

• Only good when cost of reduction 45#$ 45#$
is less than cost of transmission

• Good for ~larger number of keys 6%.7$ 6%.7$

• Reduces GPU->CPU bw 63;12"01.$ 63;12"01.$

!12"31$ !12"31$

Accumulate
!"#$ !"#$
%#$ %#$
&'($ )'($

&'($8$ &'($9$
+#2$ :$$:$$:$ +#2$
)'($8$ )'($9$

• Mapper has explicit knowledge 63;12"01.$ 63;12"01.$

about nature of key-value pairs
*+,$-$ *+,$-$

• Each map chunk accumulates its
'+./+0$
!12"31$-$
'+.//%#$
'+./+0$
!12"31$-$
'+.//%#$
key-value outputs with GPU-
resident key-value accumulated
45#$ 45#$
outputs

• Good for small number of keys 6%.7$ 6%.7$

• Also reduces GPU->CPU bw 63;12"01.$ 63;12"01.$

!12"31$ !12"31$

Benchmarks—Which
• Matrix Multiplication (MM)

• Word Occurrence (WO)

• Sparse-Integer Occurrence (SIO)

• Linear Regression (LR)

• K-Means Clustering (KMC)

• (Volume Renderer—presented
@ MapReduce ’10)
Jeﬀ A. Stuart, Cheng-Kai Chen, Kwan-Liu Ma, and John D. Owens. Multi-GPU Volume Rendering using MapReduce. In HPDC '10:
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing / MAPREDUCE '10: The First
International Workshop on MapReduce and its Applications, pages 841–848, June 2010

Benchmarks—Why

• Needed to stress aspects of GPMR

• Unbalanced work (WO)

• Multiple emits/Non-uniform number of emits (LR, KMC, WO)

• Sparsity of keys (SIO)

• Accumulation (WO, LR, KMC)

• Many key-value pairs (SIO)

• Compute Bound Scalability (MM)

WO. We test GPMR against all available input sets.

Benchmarks—Results
MM KMC LR SIO WO
1-GPU Speedup 162.712 2.991 1.296 1.450 11.080
4-GPU Speedup 559.209 11.726 4.085 2.322 18.441
vs. CPU
TABLE 2: Speedup for GPMR over Phoenix on our large (second-
biggest) input data from our ﬁrst set. The exception is MM, for which
we use our small input set (Phoenix required almost twenty seconds TAB
to multiply two 1024 × 1024 matrices). wri
all
littl
boil
MM KMC WO fun
GPM
1-GPU Speedup 2.695 37.344 3.098 of t
4-GPU Speedup 10.760 129.425 11.709
vs. GPU
TABLE 3: Speedup for GPMR over Mars on 4096 × 4096 Matrix
Multiplication, an 8M-point K-Means Clustering, and a 512 MB
Word Occurrence. These sizes represent the largest problems that
can meet the in-core memory requirements of Mars.

Benchmarks—Results

Good

Benchmarks - Results

Good

MapReduce Summary

• Time is right to explore diverse programming models on
GPUs

• Few threads -> many, heavy threads -> light

• Lack of {bandwidth, GPU primitives for communication,
access to NIC} are challenges

• What happens if GPUs get a lightweight serial processor?

• Future hybrid hardware is exciting!

Nested Data Parallelism this?
How do we generalize
How do we abstract it? What
00 02
are design alternatives?
10 11 12 13 14 15

20

30
22

31
23

32
=
41 42 43 44 45

55


Segmented Scan

Segmented Reduction

Pixel i Pixel i+1

Hashing Tradeoﬀs between construction
time, access time, and space?
Incremental data structures?

Work Stealing & Donating Generalize and abstract task
queues? [structural pattern?]
Persistent threads / ...
Lock Lock other sticky
CUDA constructs?
I/O I/O I/O Whole pipelines?
I/O I/O
Deque Deque Priorities / real-time constraints?
Deque Deque Deque

SM SM SM SM SM

• Cederman and Tsigas: Stealing == best performance and
scalability

• We showed how to do this with an uberkernel and persistent
thread programming style

• We added donating to minimize memory usage

Horizontal vs. Vertical Development

Little code
Applications reuse! App 1 App 2 App 3

Higher-Level Libraries App App App
Primitive Libraries
1 2 3
Algorithm/Data Structure (Domain-speciﬁc,
Libraries Algorithm, Data Structure,
etc.)
Programming System
Primitives Programming System Primitives Programming System Primitives

Hardware Hardware Hardware

CPU GPU (Usual) GPU (Our Goal)
[CUDPP library]

Energy Cost per Operation
Energy Cost of Operations
Operation Energy (pJ)
64b Floating FMA (2 ops) 100
64b Integer Add 1
Write 64b DFF 0.5
Read 64b Register (64 x 32 bank) 3.5
Read 64b RAM (64 x 2K) 25
Read tags (24 x 2K) 8
Move 64b 1mm 6
Move 64b 20mm 120
Move 64b off chip 256
Read 64b from DRAM 2000

From Bill Dally talk, September 2009

Dwarfs
• 1. Dense Linear Algebra

• 2. Sparse Linear Algebra • 9. Graph Traversal

• 3. Spectral Methods • 10. Dynamic
Programming
• 4. N-Body Methods
• 11. Backtrack and
• 5. Structured Grids Branch-and-Bound
• 6. Unstructured Grids
• 12. Graphical Models
• 7. MapReduce
• 13. Finite State Machines
• 8. Combinational Logic

The Programmable Pipeline Bricks & mortar: how do we
allow programmers to build
stages without worrying about
assembling them together?
Raster Compose


Ray-Primitive
Ray Generation Ray Traversal Shading
Intersection

The Programmable Pipeline
Raster Compose

[Irregular] Parallelism


Heterogeneity

Ray Generation Ray Traversal
Ray-Primitive
Intersection
Shading What does it look like when you
build applications with GPUs as
the primary processor?

Owens Group Interests

• Irregular parallelism
• Optimization: self-tuning
and tools
• Fundamental data
structures and
algorithms
• Abstract models of GPU
computation
• Computational patterns
• Multi-GPU computing:
for parallelism programming models
and abstractions

Thanks to ...

• Nathan Bell, Michael Garland, David Luebke, Sumit Gupta,
Dan Alcantara, Anjul Patney, and Stanley Tzeng for helpful
comments and slide material.

• Funding agencies: Department of Energy (SciDAC Institute
for Ultrascale Visualization, Early Career Principal
Investigator Award), NSF, Intel Science and Technology
Center for Visual Computing, LANL, BMW, NVIDIA, HP, UC
MICRO, Microsoft, ChevronTexaco, Rambus

“If you were plowing a ﬁeld, which
would you rather use? Two strong
oxen or 1024 chickens?”
—Seymour Cray

Parallel Preﬁx Sum (Scan)
• Given an array A = [a0, a1, …, an-1]
and a binary associative operator ⊕ with identity I,

• scan(A) = [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-2)]

• Example: if ⊕ is addition, then scan on the set

• [3 1 7 0 4 1 6 3]

• returns the set

• [0 3 4 11 11 15 16 22]

Segmented Scan
• Example: if ⊕ is addition, then scan on the set

• [3 1 7 | 0 4 1 | 6 3]

• returns the set

• [0 3 4 | 0 0 4 | 0 6]

• Same computational complexity as scan, but additionally
have to keep track of segments (we use head ﬂags to mark
which elements are segment heads)

• Useful for nested data parallelism (quicksort)

[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (16)

Similar to [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Similar to [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis) (20)

More from npinto

More from npinto (15)

Recently uploaded

Recently uploaded (20)

[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)