Presentation, HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter at the AMD Developer Summit (APU13) Nov. 11-13.
2. RELEVANCE OF B+ TREE SEARCHES
ïč B+ Trees are special case of B Trees
â Fundamental data structure used in several popular database
management systems
B Tree
B+ Tree
mongoDB
MySQL
CouchDB
SQLite
ïč High-throughput, read-only index searches are gaining traction in
- Video-copy detection
â Audio-search
â Online Transaction Processing (OLTP) Benchmarks
ïč Increase in memory capacity allows many database tables to
reside in memory
âBrings computational performance to the forefront
X
2 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
3. A REAL-WORLD USE-CASE OF READ-ONLY SEARCHES
Mobile:
Step 1. Record Audio
Step 2. Generate Audio Fingerprint
Step 3. Send search request to server
App on a
smartphone
Database
3
2
1
d1
2
d2
5
4
6
7
3
4
5
6
7 8
d3
d4
d5
d6
Server:
Step 1. Receive search requests
Step 2. Query Database
Step 3. Return search results to client
d7 d8
3 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
Thousands
of clients
send
requests
Music Library â
Millions of Songs
4. DATABASE PRIMITIVES ON ACCELERATORS
ïč Discrete graphics processing units (dGPUs) provide a compelling
mix of
â Performance per Watt
â Performance per Dollar
ïč dGPUs have been used to accelerate critical database primitives
â scan
â sort
â join
â aggregation
â B+ Tree Searches?
4 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
5. B+ TREE SEARCHES ON ACCELERATORS
ïč B+ Tree searches present significant challenges
â Irregular representation in memory
â An artifact of malloc() and new()
â Todayâs dGPUs do not have a direct mapping to the CPU virtual address
space
â Indirect links need to be converted to relative offsets
â Requirement to copy the tree to the dGPU, which entails
â One is always bound by the amount of GPU device memory
5 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
6. OUR SOLUTION
ïč Accelerated B+ Tree searches on a fused CPU+GPU processor (or
APU1)
â Eliminates data-copies by combining x86 CPU
and vector GPU cores on the same silicon die
ïč Developed a memory allocator to form a regular representation
of the tree in memory
â Fundamental data structure is not altered
â Merely parts of its layout is changed
[1] www.hsafoundation.com
6 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
7. OUTLINE
ïč Motivation and Contribution
ïč Background
â AMD APU Architecture
â B+ Trees
ïč Approach
â Transforming the Memory Layout
â Eliminating the Divergence
ïč Results
â Performance
â Analysis
ïč Summary and Next Steps
7 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
8. AMD APU ARCHITECTURE
System Memory
Host Memory
DRAM
Controller
x86
Cores
DRAM
Controller
RMB
System Request
Interface (SRI)
xBar
Link
Controll
er
MCT
UNB
GPU Frame-Buffer
FCL
Platform Interfaces
AMD 2nd Gen. A-series APU
UNB - Unified Northbridge, MCT - Memory Controller
ï§ The APU consists of a dedicated
IOMMUv2 hardware
GPU
Vector
Cores
- Provides direct mapping between
GPU and CPU virtual address (VA)
space
- Enables GPUs to access the system
memory
- Enables GPUs to track whether
pages are resident in memory
ï§ Today, GPU cores can access VA
space at a granularity of continuous
chunks
8 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
9. B+ TREES
3
2
5
4
6
7
1
ïč A B+ Tree âŠ
2
3
4
5
6
7
d1
d2
d3
d4
d5
d6
d7 d8
â is a dynamic, multi-level index
â Is efficient for retrieval of data, stored in a block-oriented context
â has a high fan-out to reduce disk I/O operations
ïč Order (b) of a B+ Tree measures the capacity of its nodes
ïč Number of children (m) in an internal node is
â [b/2] <= m <= b
â Root node can have as few as two children
ïč Number of keys in an internal node = (m â 1)
9 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
8
10. APPROACH FOR PARALLELIZATION
ïč Fine-grained (Accelerate a single query)
â Replace Binary search in each node with K-ary search
â Maximum performance improvement = log(k)/log(2)
â Results in poor occupancy of the GPU cores
ïč Coarse-grained (Perform many queries in parallel)
â Enables data-parallelism
âIncreases memory bandwidth with parallel reads
âIncreases throughput (transactions per second for OLTP)
10 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
11. TRANSFORMING THE MEMORY LAYOUT
nodes w/ metadata
..
keys
..
values
ïč Metadata
â Number of keys in a node
â Offset to keys/values in the buffer
â Offset to the first child node
â Whether a node is a leaf
ïč Pass a pointer to this memory buffer to the accelerator
11 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
..
12. ELIMINATING THE DIVERGENCE
ïč Each work-item/thread executes a single query
ïč May increase divergence within a wave-front
â Every query may follow a different path in the B+ Tree
WI-1
WI-2
2
3
5
4
6
7
WI-2
1
2
3
4
5
6
7
d1
d2
d3
d4
d5
d6
d7 d8
8
ïč Sort the keys to be searched
â Increases the chances of work-items within a wave-front to follow similar paths in the B+ Tree
â We use Radix Sort1 to sort the keys on the GPU
[1] D. G. Merrill and A. S. Grimshaw, âRevisiting sorting for gpgpu stream architectures,â in Proceedings of the 19th intl. conf. on
Parallel architectures and compilation techniques, ser. PACT â10. New York, NY, USA: ACM, 2010.
12 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
13. IMPACT OF DIVERGENCE IN B+ TREE SEARCHES
Impcat of Divergence
5
4
3
2
1
16K
32K
64K
128K
Number of Queries w/ Order of B+ Tree
TM
AMD Radeon HD 7660
TM
AMD Phenom II X6 1090T
Impact of Divergence on GPU â 3.7-fold (average)
Impact of Divergence on CPU â 1.8-fold (average)
13 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
128
64
32
16
8
128
64
32
16
8
128
64
32
16
8
128
64
32
16
8
0
14. OUTLINE
ïč Motivation and Contribution
ïč Background
â AMD APU Architecture
â B+ Trees
ïč Approach
â Transforming the Memory Layout
â Eliminating the Divergence
ïč Results
â Performance
â Analysis
ïč Summary and Next Steps
14 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
15. 60000
50000
40000
30000
20000
10000
0
100000
300000
500000
700000
900000
1100000
1300000
1500000
1700000
1900000
2100000
2300000
2500000
2700000
2900000
3100000
3300000
3500000
3700000
3900000
4100000
ïč Software
â A B+ Tree w/ 4M records is used
â Search queries are created using
Frequency
EXPERIMENTAL SETUP
â normal_distribution() (C++-11 feature)
â The queries have been sorted
â CPU Implementation from
â http://www.amittai.com/prose/bplustree.html
â Driver: AMD CatalystTM v12.8
â Programming Model: OpenCLTM
ïč Hardware
â AMD RadeonTM HD 7660 APU (Trinity)
â 4 cores w/ 6GB DDR3, 6 CUs w/ 2GB DDR3
â AMD PhenomTM II X6 1090T + AMD RadeonTM HD 7970 (Tahiti)
â 6 cores w/ 8GB DDR3, 32 CUs w/ 3GB GDDR5
â Device Memory does not include data-copy time
15 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
Bin
E ID (PKey)
Age
0000001
34
4
million
4194304
entries
50
16. RESULTS â QUERIES PER SECOND
700M Queries/Sec
Queries/Second (Million)
1000
100
10
16K
32K
64K
128K
Number of Queries w/ Order of B+Tree
AMD Phenom II X6 1090T (6-Threads+SSE)
AMD PhenomTM II 1090T (6-Threads+SSE)
AMD Radeon HDTM HD(Device Memory)
AMD Radeon 7970 7970 (Device Memory)
AMD Radeon HD 7970 (Pinned(Pinned Memory)
AMD RadeonTM HD 7970 Memory)
dGPU (device memory)
~350M Queries/Sec. (avg.)
dGPU (pinned memory)
~9M Queries/Sec. (avg.)
AMD PhenomTM CPU
~18M Queries/Sec. (avg.)
16 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
128
64
32
16
8
4
128
64
32
16
8
4
128
64
32
16
8
4
128
64
32
16
8
4
1
17. RESULTS â QUERIES PER SECOND
140
Queries/Second (Million)
120
100
80
60
40
20
16K
AMD Phenom II X6 1090T (6-Threads+SSE)
AMD PhenomTM II 1090T (6-Threads+SSE)
32K
Number of Queries w/ Order of B+Tree
HD 7660 (Device Memory)
128K
AMD Radeon HDTM HD(Pinned Memory)
AMD Radeon 7660 7660 (Pinned Memory)
APU (device memory)
~66M Queries/Sec. (avg.)
APU (pinned memory)
~40M Queries/Sec. (avg.)
APU (pinned memory) is faster than the CPU implementation
17 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
128
64
32
16
8
4
128
64K
AMD Radeon HD 7660 (Device Memory)
TM
AMD Radeon
64
32
16
8
4
128
64
32
16
8
4
128
64
32
16
8
4
0
18. RESULTS - SPEEDUP
12
Speedup
10
8
4.9-fold speedup
6
4
2
16K
32K
64K
128K
Number of Queries w/ Order of B+Tree
AMD Radeon HDTM HD(Pinned Memory)
AMD Radeon 7660 7660 (Pinned Memory)
AMD Radeon HDTM HD(Device(Device Memory)
AMD Radeon 7660 7660 Memory)
Baseline: six-threaded, hand-tuned, SSE-optimized CPU implementation.
Average Speedup â 4.3-fold (Device Memory), 2.5-fold (Pinned Memory)
âą Efficacy of IOMMUv2 + HSA on the APU
Platform
Discrete GPU (memory size = 3GB)
APU (prototype software)
< 1.5GB
â
â
18 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
Size of the B+ Tree
1.5GB â 2.7GB
â
â
> 2.7GB
â
â
128
64
32
16
8
4
128
64
32
16
8
4
128
64
32
16
8
4
128
64
32
16
8
4
0
19. ANALYSIS
ïč The accelerators and the CPU yield best performance for different
orders of the B+ Tree
â CPU ï order = 64
â Ability of CPUs to prefetch data is beneficial for higher orders
nodes w/ metadata
..
keys
..
values
..
â APU and dGPU ï order = 16
â GPUs do not have a prefetcher ï cache line should be most efficiently utilized
â GPUs have a cache-line size of 64 bytes
â Order = 16 is most beneficial (16 * 4 bytes)
19 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
20. ANALYSIS
ïč Minimum batch size to match the CPU performance
Order = 64
Order = 16
dGPU (device memory)
4K queries
2K queries
dGPU (pinned memory
N.A.
N.A.
APU (device memory)
10K queries
4K queries
APU (pinned memory
20K queries
16K queries
ïč reuse_factor - amortizing the cost of data-copies to the GPU
90% Queries
100% Queries
dGPU
15
54
APU
100
N.A.
20 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
21. PROGRAMMABILITY
GPU
CPU-SSE
int i = 0, j;
typedef global unsigned int g_uint;
node * c = root;
typedef global mynode g_mynode;
__m128i vkey = _mm_set1_epi32(key);
int tid = get_global_id(0);
__m128i vnodekey, *vptr;
int i = 0;
short int mask;
g_mynode *c = (g_mynode *)root;
/* find the leaf node */
/* find the leaf node */
while( !c->is_leaf ){
while(!c->is_leaf){
while (i < c->num_keys){
for(i = 0; i < (c->num_keys-3); i+=4){
if(keys[tid] >= ((g_uint *)((intptr_t)root + c->keys))[i])
vptr = (__m128i *)&(c->keys[i]);
i++;
vnodekey = _mm_load_si128(vptr);
else break;
mask = _mm_movemask_ps(_mm_cvtepi32_ps( _mm_cmplt_epi32(vkey, vnodekey)));
}
if((mask) & 8) break;
c = (g_mynode *)((intptr_t)root + c->ptr + i*sizeof(mynode));
}
for(j = i; j < c->num_keys; j++){
if(key < c->keys[j]) break;
}
}
/* match the key in the leaf node */
for(i=0; i<c->num_keys; i++){
if((((g_uint *)((intptr_t)root + c->keys))[i]) == keys[tid]) break;
c = (node *)c->pointers[j];
}
}
/* match the key in the leaf node */
/* retrieve the record */
for (i = 0; i < c->num_keys; i++)
if(i != c->num_keys)
if (c->keys[i] == key) break;
records[tid] = ((g_uint *)((intptr_t)root + c->is_leaf + i*sizeof(g_uint)))[0];
/* retrieve the record */
if (i != c->num_keys)
return (record *)c->pointers[i];
21 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
22. RELATED WORK
ïč J. Fix, A. Wilkes, and K. Skadron, "Accelerating Braided B+ Tree Searches on a GPU with CUDA." In
Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis,
Implementation, and Performance, in conjunction with ISCA, 2011
â Authors report ~10-fold speedup over single-thread-non-SSE CPU implementation, using a discrete NVIDIA GTX
480 GPU (do not take data-copies into account)
ïč C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, P. Dubey, âFAST:
fast architecture sensitive tree search on modern CPUs and GPUsâ, SIGMOD Conference, 2010
â Authors report ~100M queries per second using a discrete NVIDIA GTX 280 GPU (do not take data-copies into
account)
ïč J. Sewall, J. Chhugani, C. Kim, N. Satish, P. Dubey, âPALM: Parallel, Architecture-Friendly, Latch-Free
Modifications to B+ Trees on Multi-Core Processorsâ, Proceedings of VLDB Endowment, (VLDB 2011)
â Applicable for B+ Tree modifications on the GPU
22 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
24. SUMMARY
ïč B+ Tree is the fundamental data structure in many RDBMS
â Accelerating B+ Tree searches is critical
â Presents significant challenges on discrete GPUs
ïč We have accelerated B+ Tree searches by exploiting coarsegrained parallelism on a APU
â 2.5-fold (avg.) speedup over 6-threads+SSE CPU implementation
ïč Possible Next Steps
â HSA + IOMMUv2 would reduce the issue of modifying B+ Tree
representation
â Investigate CPU-GPU co-scheduling
â Investigate modifications on the B+ Tree
24 | APU â13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public