This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.
3. Introduction
• Utilizing hardware features of the GPU
– Massive thread parallelism
– Fast inter-processor communication
– High memory bandwidth
– Coalesced access
9. Algorithms on GPU
• Tips for algorithm design
– Use the inherent concurrency.
– Keep SIMD nature in mind.
– Algorithms should be side-effect free.
– Memory properties:
• High memory bandwidth.
• Coalesced access (for spatial locality)
• Cache in local memory (for temporal locality)
• Access memory via indices and offsets.
11. Design and Implementation
• A complete set of parallel primitives:
– Map, scatter, gather, prefix scan, split, and
sort
• Low synchronization overhead.
• Scalable to hundreds of processors.
• Applicable to joins as well as other relational query
operators.
15. Prefix Scan
• A prefix scan applies a binary operator on the input
of size n and produces an output of size n.
• Ex: Prefix sum: cumulative sum of all elements to
the left of the current element.
– Exclusive (used in paper)
– Inclusive
22. Sort
• Bitonic sort
– Uses sorting networks, O(N log2N).
• Quick sort
– partition using a random pivot until partition fits in
local memory
– Sort each partition using bitonic sort.
– Partioning can be parallelized using split.
– Complexity is O(N logN).
– 30% faster than bitonic sort in experiments
– Use Quick sort for sorting
27. NINLJ on GPU
• Block nested
• Uses Map primitive on both relations
– Partition R into R’ and S into S’ blocks
respectively.
– Create R’ x S’ thread groups
– A thread in a thread group processes one
tuple from R’ and matches all tuples from S’.
– All tuples in S’ are in local cache.
28. B+ Tree vs CSS Tree
• B+ tree imposes
– Memory stalls when traversed (no spatial locality)
– Can’t perform multiple searches ( loses temporal
locality).
• CSS-Tree (Cache optimized search tree)
– One dimensional array where nodes are indexed.
– Replaces traversal with computation.
– Can also perform parallel key lookups.
29. Indexed Nested Loop Join (INLP)
• Uses Map primitive on outer relation
• Uses CSS tree for index.
• For each block in outer relation R
– Start with a root node to find the next level
• Binary search is shown to be better than sequential search.
– Go down until you find the data node.
• Upper level nodes are cached in local memory
since they are frequently accessed.
30. Sort Merge Join
• Sort the relations R, S using the sort primitive
• Merge phase
– Break S into chunks (s’) of size M.
– Find first and last key values of each chunk in s’ and
partition R into those many chunks.
– Merge all chunks in parallel using map
• Each thread group handles a pair
• Each thread compares 1 tuple in R with s’ using binary
search.
• Chunk size is chosen to fit in local memory.
31. Hash Join
• Uses split primitive on both relations
• Developed a parallel version of radix hash join
– Partitioning
• Split R and S into the same number of partitions, so S
partitions fit into the local memory
– Matching
• Choose smaller one of R and S partitions as inner partition to
be loaded into local memory
• Larger relation will be used as the outer relation
• Each tuple from outer relation uses a search on the inner
relation for matching.
32. Lock-Free Scheme for Result
Output
• Problems
– Unknown join result size. Max size of joins
doesn’t fit in memory.
– Concurrent writes are not atomic.
33. Lock-Free Scheme for Result
Output
• Solution: Three-phase scheme
– Each thread counts the number of join results.
– Compute a prefix sum on the counts to get an
array of write locations and the total number
of results generated by the join.
– Host code allocates memory on device.
– Run join again with outputs.
• Run joins twice. That’s ok, GPU’s are fast.
36. Workload
• R and S tables with 2 integer columns.
• SELECT R.rid, S.rid FROM R, S WHERE <predicate>
• SELECT R.rid, S.rid FROM R, S WHERE R.rid=S.rid
• SELECT R.rid, S.rid FROM R, S WHERE
R.rid<=S.rid<=R.rid + k
• Tested on all combinations:
– Fix R, Vary S. All values uniform distribution. |R| = 1M
– Performance impact varying join selectivity. |R| = |S| = 16M
– Non – uniform distribution of data sizes and also varying join
selectivity. |R| = |S| = 16M
• Also tested with columns as strings.
37. Implementation Details on CPU
• Highly optimized primitives and join
algorithms matching hardware architecture
• Tuned for cache performance.
• Compiled programs using MSVC 8.0 with
full optimizations.
• Used openMP for threading mechanisms.
• 2-6X faster than their sequential counter
parts.
38. Implementation Details on GPU
• CUDA parameters
– Number of thread groups (128)
– Number of threads for each thread group (64)
– Block size is 4MB (main memory to device
memory)
44. CUDA vs. DirectX10
• DirectX10 is difficult to program, because
the data is stored as textures.
• NINLJ and INLJ have similar performance.
• HJ and SMJ are slower because of texture
decoding.
• Summary: low level primitives on GPGPU
are better than graphics primitives on
GPU.
45. Criticisms
• Applications of skew handling are unclear.
• Primitives are sufficient to implement the
given joins, but they do not prove the set
of primitives to be minimal.
46. Limitations and future research
directions
• Lack of synchronization mechanisms for
handling read/write conflicts on GPU.
• More primitives.
• More open GPGPU hardware spec for
optimizations.
• Power consumption on GPU.
• Lack of support for complex data types.
• On GPU in-memory database.
• Automatic detection of thread groups and
number of threads using program analysis
techniques.
47. Conclusion
• GPU-based primitives and join algorithms
achieve a speedup of 2-27X over
optimized CPU-based counterparts.
• NINLJ, 7.0X; INLJ, 6.1X; SMJ, 2.4X; HJ,
1.9X
51. Skew Handling
• Skew in data results in an imbalanced
partition size in partitioned-based
algorithms (SMJ and HJ)
• Solution
– Identify partitions that do not fit into the local
memory
– Decompose partitions into multiple chunks the
size of local memory
52. Implementation Details on GPU
• CUDA parameters
– Number of threads for each thread group
– Number of thread groups
• DirectX10
– Join algorithms implemented using
programmable pipeline
• Vertex shader, geometry shader, and pixel shader