SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
EXPLOITING COARSE-GRAINED PARALLELISM IN
B+ TREE SEARCHES ON APUS
MAYANK DAGA, MARK NUTTER
AMD RESEARCH
RELEVANCE OF B+ TREE SEARCHES
ïč B+ Trees are special case of B Trees
‒ Fundamental data structure used in several popular database
management systems
B Tree
B+ Tree
mongoDB
MySQL

CouchDB
SQLite

ïč High-throughput, read-only index searches are gaining traction in
- Video-copy detection
‒ Audio-search
‒ Online Transaction Processing (OLTP) Benchmarks

ïč Increase in memory capacity allows many database tables to
reside in memory
‒Brings computational performance to the forefront

X

2 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
A REAL-WORLD USE-CASE OF READ-ONLY SEARCHES

Mobile:
Step 1. Record Audio
Step 2. Generate Audio Fingerprint
Step 3. Send search request to server

App on a
smartphone

Database
3
2
1
d1

2
d2

5

4

6

7

3

4

5

6

7 8

d3

d4

d5

d6

Server:
Step 1. Receive search requests
Step 2. Query Database
Step 3. Return search results to client

d7 d8

3 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

Thousands
of clients
send
requests

Music Library –
Millions of Songs
DATABASE PRIMITIVES ON ACCELERATORS
ïč Discrete graphics processing units (dGPUs) provide a compelling
mix of
‒ Performance per Watt
‒ Performance per Dollar

ïč dGPUs have been used to accelerate critical database primitives
‒ scan
‒ sort
‒ join
‒ aggregation
‒ B+ Tree Searches?
4 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
B+ TREE SEARCHES ON ACCELERATORS
ïč B+ Tree searches present significant challenges
‒ Irregular representation in memory
‒ An artifact of malloc() and new()

‒ Today’s dGPUs do not have a direct mapping to the CPU virtual address
space
‒ Indirect links need to be converted to relative offsets

‒ Requirement to copy the tree to the dGPU, which entails
‒ One is always bound by the amount of GPU device memory

5 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
OUR SOLUTION
ïč Accelerated B+ Tree searches on a fused CPU+GPU processor (or
APU1)
‒ Eliminates data-copies by combining x86 CPU
and vector GPU cores on the same silicon die

ïč Developed a memory allocator to form a regular representation
of the tree in memory
‒ Fundamental data structure is not altered
‒ Merely parts of its layout is changed

[1] www.hsafoundation.com
6 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
OUTLINE
ïč Motivation and Contribution
ïč Background
‒ AMD APU Architecture
‒ B+ Trees

ïč Approach
‒ Transforming the Memory Layout
‒ Eliminating the Divergence

ïč Results
‒ Performance
‒ Analysis

ïč Summary and Next Steps
7 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
AMD APU ARCHITECTURE
System Memory
Host Memory
DRAM
Controller

x86
Cores

DRAM
Controller

RMB
System Request
Interface (SRI)

xBar

Link
Controll
er

MCT

UNB

GPU Frame-Buffer

FCL

Platform Interfaces

AMD 2nd Gen. A-series APU
UNB - Unified Northbridge, MCT - Memory Controller

 The APU consists of a dedicated
IOMMUv2 hardware
GPU
Vector
Cores

- Provides direct mapping between
GPU and CPU virtual address (VA)
space
- Enables GPUs to access the system
memory
- Enables GPUs to track whether
pages are resident in memory

 Today, GPU cores can access VA
space at a granularity of continuous
chunks

8 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
B+ TREES

3
2

5

4

6

7

1

ïč A B+ Tree 


2

3

4

5

6

7

d1

d2

d3

d4

d5

d6

d7 d8

‒ is a dynamic, multi-level index
‒ Is efficient for retrieval of data, stored in a block-oriented context
‒ has a high fan-out to reduce disk I/O operations

ïč Order (b) of a B+ Tree measures the capacity of its nodes
ïč Number of children (m) in an internal node is
‒ [b/2] <= m <= b
‒ Root node can have as few as two children

ïč Number of keys in an internal node = (m – 1)
9 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

8
APPROACH FOR PARALLELIZATION
ïč Fine-grained (Accelerate a single query)
‒ Replace Binary search in each node with K-ary search
‒ Maximum performance improvement = log(k)/log(2)
‒ Results in poor occupancy of the GPU cores

ïč Coarse-grained (Perform many queries in parallel)
‒ Enables data-parallelism

‒Increases memory bandwidth with parallel reads
‒Increases throughput (transactions per second for OLTP)
10 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
TRANSFORMING THE MEMORY LAYOUT
nodes w/ metadata
..

keys
..

values

ïč Metadata
‒ Number of keys in a node
‒ Offset to keys/values in the buffer
‒ Offset to the first child node
‒ Whether a node is a leaf

ïč Pass a pointer to this memory buffer to the accelerator
11 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

..
ELIMINATING THE DIVERGENCE
ïč Each work-item/thread executes a single query
ïč May increase divergence within a wave-front
‒ Every query may follow a different path in the B+ Tree

WI-1

WI-2
2

3

5

4

6

7

WI-2

1

2

3

4

5

6

7

d1

d2

d3

d4

d5

d6

d7 d8

8

ïč Sort the keys to be searched
‒ Increases the chances of work-items within a wave-front to follow similar paths in the B+ Tree
‒ We use Radix Sort1 to sort the keys on the GPU

[1] D. G. Merrill and A. S. Grimshaw, “Revisiting sorting for gpgpu stream architectures,” in Proceedings of the 19th intl. conf. on
Parallel architectures and compilation techniques, ser. PACT ’10. New York, NY, USA: ACM, 2010.
12 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
IMPACT OF DIVERGENCE IN B+ TREE SEARCHES

Impcat of Divergence

5

4

3

2

1

16K

32K

64K

128K

Number of Queries w/ Order of B+ Tree
TM

AMD Radeon HD 7660

TM

AMD Phenom II X6 1090T

Impact of Divergence on GPU – 3.7-fold (average)
Impact of Divergence on CPU – 1.8-fold (average)
13 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

128

64

32

16

8

128

64

32

16

8

128

64

32

16

8

128

64

32

16

8

0
OUTLINE
ïč Motivation and Contribution
ïč Background
‒ AMD APU Architecture
‒ B+ Trees

ïč Approach
‒ Transforming the Memory Layout
‒ Eliminating the Divergence

ïč Results
‒ Performance
‒ Analysis

ïč Summary and Next Steps
14 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
60000
50000
40000
30000
20000
10000
0
100000
300000
500000
700000
900000
1100000
1300000
1500000
1700000
1900000
2100000
2300000
2500000
2700000
2900000
3100000
3300000
3500000
3700000
3900000
4100000

ïč Software
‒ A B+ Tree w/ 4M records is used
‒ Search queries are created using

Frequency

EXPERIMENTAL SETUP

‒ normal_distribution() (C++-11 feature)
‒ The queries have been sorted

‒ CPU Implementation from
‒ http://www.amittai.com/prose/bplustree.html

‒ Driver: AMD CatalystTM v12.8
‒ Programming Model: OpenCLTM
ïč Hardware
‒ AMD RadeonTM HD 7660 APU (Trinity)
‒ 4 cores w/ 6GB DDR3, 6 CUs w/ 2GB DDR3

‒ AMD PhenomTM II X6 1090T + AMD RadeonTM HD 7970 (Tahiti)
‒ 6 cores w/ 8GB DDR3, 32 CUs w/ 3GB GDDR5

‒ Device Memory does not include data-copy time
15 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

Bin

E ID (PKey)

Age

0000001

34

4

million

4194304

entries
50
RESULTS – QUERIES PER SECOND
700M Queries/Sec

Queries/Second (Million)

1000

100

10

16K

32K

64K

128K

Number of Queries w/ Order of B+Tree
AMD Phenom II X6 1090T (6-Threads+SSE)
AMD PhenomTM II 1090T (6-Threads+SSE)

AMD Radeon HDTM HD(Device Memory)
AMD Radeon 7970 7970 (Device Memory)

AMD Radeon HD 7970 (Pinned(Pinned Memory)
AMD RadeonTM HD 7970 Memory)

dGPU (device memory)

~350M Queries/Sec. (avg.)

dGPU (pinned memory)

~9M Queries/Sec. (avg.)

AMD PhenomTM CPU

~18M Queries/Sec. (avg.)

16 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

1
RESULTS – QUERIES PER SECOND
140

Queries/Second (Million)

120
100
80
60
40
20

16K
AMD Phenom II X6 1090T (6-Threads+SSE)
AMD PhenomTM II 1090T (6-Threads+SSE)

32K

Number of Queries w/ Order of B+Tree
HD 7660 (Device Memory)

128K
AMD Radeon HDTM HD(Pinned Memory)
AMD Radeon 7660 7660 (Pinned Memory)

APU (device memory)

~66M Queries/Sec. (avg.)

APU (pinned memory)

~40M Queries/Sec. (avg.)

APU (pinned memory) is faster than the CPU implementation
17 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

128

64

32

16

8

4

128

64K

AMD Radeon HD 7660 (Device Memory)
TM

AMD Radeon

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

0
RESULTS - SPEEDUP
12

Speedup

10
8

4.9-fold speedup

6
4
2

16K

32K

64K

128K

Number of Queries w/ Order of B+Tree
AMD Radeon HDTM HD(Pinned Memory)
AMD Radeon 7660 7660 (Pinned Memory)

AMD Radeon HDTM HD(Device(Device Memory)
AMD Radeon 7660 7660 Memory)

Baseline: six-threaded, hand-tuned, SSE-optimized CPU implementation.

Average Speedup – 4.3-fold (Device Memory), 2.5-fold (Pinned Memory)

‱ Efficacy of IOMMUv2 + HSA on the APU
Platform
Discrete GPU (memory size = 3GB)
APU (prototype software)

< 1.5GB
✓
✓

18 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

Size of the B+ Tree
1.5GB – 2.7GB
✓
✓

> 2.7GB

✗
✓

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

0
ANALYSIS
ïč The accelerators and the CPU yield best performance for different
orders of the B+ Tree
‒ CPU  order = 64
‒ Ability of CPUs to prefetch data is beneficial for higher orders

nodes w/ metadata
..

keys
..

values

..

‒ APU and dGPU  order = 16
‒ GPUs do not have a prefetcher  cache line should be most efficiently utilized
‒ GPUs have a cache-line size of 64 bytes
‒ Order = 16 is most beneficial (16 * 4 bytes)

19 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
ANALYSIS
ïč Minimum batch size to match the CPU performance
Order = 64

Order = 16

dGPU (device memory)

4K queries

2K queries

dGPU (pinned memory

N.A.

N.A.

APU (device memory)

10K queries

4K queries

APU (pinned memory

20K queries

16K queries

ïč reuse_factor - amortizing the cost of data-copies to the GPU

90% Queries

100% Queries

dGPU

15

54

APU

100

N.A.

20 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
PROGRAMMABILITY

GPU

CPU-SSE

int i = 0, j;

typedef global unsigned int g_uint;

node * c = root;

typedef global mynode g_mynode;

__m128i vkey = _mm_set1_epi32(key);

int tid = get_global_id(0);

__m128i vnodekey, *vptr;

int i = 0;

short int mask;

g_mynode *c = (g_mynode *)root;

/* find the leaf node */

/* find the leaf node */

while( !c->is_leaf ){

while(!c->is_leaf){

while (i < c->num_keys){

for(i = 0; i < (c->num_keys-3); i+=4){

if(keys[tid] >= ((g_uint *)((intptr_t)root + c->keys))[i])

vptr = (__m128i *)&(c->keys[i]);

i++;

vnodekey = _mm_load_si128(vptr);

else break;

mask = _mm_movemask_ps(_mm_cvtepi32_ps( _mm_cmplt_epi32(vkey, vnodekey)));
}

if((mask) & 8) break;

c = (g_mynode *)((intptr_t)root + c->ptr + i*sizeof(mynode));

}
for(j = i; j < c->num_keys; j++){
if(key < c->keys[j]) break;
}

}
/* match the key in the leaf node */
for(i=0; i<c->num_keys; i++){
if((((g_uint *)((intptr_t)root + c->keys))[i]) == keys[tid]) break;

c = (node *)c->pointers[j];
}

}

/* match the key in the leaf node */

/* retrieve the record */

for (i = 0; i < c->num_keys; i++)

if(i != c->num_keys)

if (c->keys[i] == key) break;

records[tid] = ((g_uint *)((intptr_t)root + c->is_leaf + i*sizeof(g_uint)))[0];

/* retrieve the record */
if (i != c->num_keys)
return (record *)c->pointers[i];

21 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
RELATED WORK
ïč J. Fix, A. Wilkes, and K. Skadron, "Accelerating Braided B+ Tree Searches on a GPU with CUDA." In
Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis,
Implementation, and Performance, in conjunction with ISCA, 2011
‒ Authors report ~10-fold speedup over single-thread-non-SSE CPU implementation, using a discrete NVIDIA GTX
480 GPU (do not take data-copies into account)

ïč C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, P. Dubey, “FAST:
fast architecture sensitive tree search on modern CPUs and GPUs”, SIGMOD Conference, 2010
‒ Authors report ~100M queries per second using a discrete NVIDIA GTX 280 GPU (do not take data-copies into
account)

ïč J. Sewall, J. Chhugani, C. Kim, N. Satish, P. Dubey, “PALM: Parallel, Architecture-Friendly, Latch-Free
Modifications to B+ Trees on Multi-Core Processors”, Proceedings of VLDB Endowment, (VLDB 2011)
‒ Applicable for B+ Tree modifications on the GPU

22 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
POSSIBILITIES WITH HSA
Memory
3
2

5

4

6

7

1

2

3

4

5

6

7

8

d1

d2

d3

d4

d5

d6

d7 d8

Parallel Reads for
Searches

Serial Modifications

GPU Vector Cores
x86 Cores

Graphics Core Next, User-mode Queuing, Shared Virtual Memory
23 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
SUMMARY
ïč B+ Tree is the fundamental data structure in many RDBMS
‒ Accelerating B+ Tree searches is critical
‒ Presents significant challenges on discrete GPUs

ïč We have accelerated B+ Tree searches by exploiting coarsegrained parallelism on a APU
‒ 2.5-fold (avg.) speedup over 6-threads+SSE CPU implementation

ïč Possible Next Steps
‒ HSA + IOMMUv2 would reduce the issue of modifying B+ Tree
representation
‒ Investigate CPU-GPU co-scheduling

‒ Investigate modifications on the B+ Tree
24 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Phenom, Radeon, Catalyst and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of
their respective owners.

25 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

Weitere Àhnliche Inhalte

Was ist angesagt?

PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...AMD Developer Central
 
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...AMD Developer Central
 
PG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovPG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovAMD Developer Central
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasAMD Developer Central
 
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...AMD Developer Central
 
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard HoffnungPG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard HoffnungAMD Developer Central
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelAMD Developer Central
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauAMD Developer Central
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...AMD Developer Central
 
HSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA-4122, "HSA Queuing Mode," by Ian BrattHSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA-4122, "HSA Queuing Mode," by Ian BrattAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with JavaKelum Senanayake
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosAMD Developer Central
 
HC-4021, Efficient scheduling of OpenMP and OpenCLℱ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCLℱ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCLℱ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCLℱ workloads on Accelerated ...AMD Developer Central
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerAMD Developer Central
 

Was ist angesagt? (20)

PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
 
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
 
PG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovPG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry Kozlov
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
 
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard HoffnungPG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
 
HSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA-4122, "HSA Queuing Mode," by Ian BrattHSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA-4122, "HSA Queuing Mode," by Ian Bratt
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
HC-4021, Efficient scheduling of OpenMP and OpenCLℱ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCLℱ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCLℱ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCLℱ workloads on Accelerated ...
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
 

Andere mochten auch

Andere mochten auch (7)

Sycl 1.2 Reference Card
Sycl 1.2 Reference CardSycl 1.2 Reference Card
Sycl 1.2 Reference Card
 
OpenGL SC 2.0 Quick Reference
OpenGL SC 2.0 Quick ReferenceOpenGL SC 2.0 Quick Reference
OpenGL SC 2.0 Quick Reference
 
OpenVX 1.1 Reference Guide
OpenVX 1.1 Reference GuideOpenVX 1.1 Reference Guide
OpenVX 1.1 Reference Guide
 
OpenGL ES 3.2 Reference Guide
OpenGL ES 3.2 Reference GuideOpenGL ES 3.2 Reference Guide
OpenGL ES 3.2 Reference Guide
 
OpenCL 2.1 Reference Guide
OpenCL 2.1 Reference GuideOpenCL 2.1 Reference Guide
OpenCL 2.1 Reference Guide
 
Vulkan 1.0 Quick Reference
Vulkan 1.0 Quick ReferenceVulkan 1.0 Quick Reference
Vulkan 1.0 Quick Reference
 
WebGL 2.0 Reference Guide
WebGL 2.0 Reference GuideWebGL 2.0 Reference Guide
WebGL 2.0 Reference Guide
 

Ähnlich wie HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21r Skip
 
Why sap hana
Why sap hanaWhy sap hana
Why sap hanaugur candan
 
SDC Server Sao Jose
SDC Server Sao JoseSDC Server Sao Jose
SDC Server Sao JoseRoberto Brandao
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
AMD AM1 Platform Presentation
AMD AM1 Platform PresentationAMD AM1 Platform Presentation
AMD AM1 Platform PresentationLow Hong Chuan
 
Gluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & TricksGluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & TricksGlusterFS
 
Designing Information Structures For Performance And Reliability
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliabilitybryanrandol
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
Multimedia hardware
Multimedia hardwareMultimedia hardware
Multimedia hardwareUtsav Roy
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718brettallison
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUsiguazio
 
Storage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewStorage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewAlan McSweeney
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Database Hardware Benchmarking
Database Hardware BenchmarkingDatabase Hardware Benchmarking
Database Hardware BenchmarkingCommand Prompt., Inc
 
Measuring Firebird Disk I/O
Measuring Firebird Disk I/OMeasuring Firebird Disk I/O
Measuring Firebird Disk I/OMind The Firebird
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 

Ähnlich wie HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter (20)

Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21
 
Why sap hana
Why sap hanaWhy sap hana
Why sap hana
 
SDC Server Sao Jose
SDC Server Sao JoseSDC Server Sao Jose
SDC Server Sao Jose
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
IO Dubi Lebel
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebel
 
AMD AM1 Platform Presentation
AMD AM1 Platform PresentationAMD AM1 Platform Presentation
AMD AM1 Platform Presentation
 
Gluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & TricksGluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & Tricks
 
Designing Information Structures For Performance And Reliability
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliability
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
Multimedia hardware
Multimedia hardwareMultimedia hardware
Multimedia hardware
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Storage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewStorage, San And Business Continuity Overview
Storage, San And Business Continuity Overview
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Database Hardware Benchmarking
Database Hardware BenchmarkingDatabase Hardware Benchmarking
Database Hardware Benchmarking
 
Measuring Firebird Disk I/O
Measuring Firebird Disk I/OMeasuring Firebird Disk I/O
Measuring Firebird Disk I/O
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 

Mehr von AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
Leverage the Speed of OpenCLℱ with AMD Math Libraries
Leverage the Speed of OpenCLℱ with AMD Math LibrariesLeverage the Speed of OpenCLℱ with AMD Math Libraries
Leverage the Speed of OpenCLℱ with AMD Math LibrariesAMD Developer Central
 
An Introduction to OpenCLℱ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCLℱ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCLℱ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCLℱ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
DirectGMA on AMD’S FireProℱ GPUS
DirectGMA on AMD’S  FireProℱ GPUSDirectGMA on AMD’S  FireProℱ GPUS
DirectGMA on AMD’S FireProℱ GPUSAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerAMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesAMD Developer Central
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 

Mehr von AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Leverage the Speed of OpenCLℱ with AMD Math Libraries
Leverage the Speed of OpenCLℱ with AMD Math LibrariesLeverage the Speed of OpenCLℱ with AMD Math Libraries
Leverage the Speed of OpenCLℱ with AMD Math Libraries
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCLℱ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCLℱ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCLℱ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCLℱ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FireProℱ GPUS
DirectGMA on AMD’S  FireProℱ GPUSDirectGMA on AMD’S  FireProℱ GPUS
DirectGMA on AMD’S FireProℱ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 

KĂŒrzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

KĂŒrzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

  • 1. EXPLOITING COARSE-GRAINED PARALLELISM IN B+ TREE SEARCHES ON APUS MAYANK DAGA, MARK NUTTER AMD RESEARCH
  • 2. RELEVANCE OF B+ TREE SEARCHES ïč B+ Trees are special case of B Trees ‒ Fundamental data structure used in several popular database management systems B Tree B+ Tree mongoDB MySQL CouchDB SQLite ïč High-throughput, read-only index searches are gaining traction in - Video-copy detection ‒ Audio-search ‒ Online Transaction Processing (OLTP) Benchmarks ïč Increase in memory capacity allows many database tables to reside in memory ‒Brings computational performance to the forefront X 2 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 3. A REAL-WORLD USE-CASE OF READ-ONLY SEARCHES Mobile: Step 1. Record Audio Step 2. Generate Audio Fingerprint Step 3. Send search request to server App on a smartphone Database 3 2 1 d1 2 d2 5 4 6 7 3 4 5 6 7 8 d3 d4 d5 d6 Server: Step 1. Receive search requests Step 2. Query Database Step 3. Return search results to client d7 d8 3 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public Thousands of clients send requests Music Library – Millions of Songs
  • 4. DATABASE PRIMITIVES ON ACCELERATORS ïč Discrete graphics processing units (dGPUs) provide a compelling mix of ‒ Performance per Watt ‒ Performance per Dollar ïč dGPUs have been used to accelerate critical database primitives ‒ scan ‒ sort ‒ join ‒ aggregation ‒ B+ Tree Searches? 4 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 5. B+ TREE SEARCHES ON ACCELERATORS ïč B+ Tree searches present significant challenges ‒ Irregular representation in memory ‒ An artifact of malloc() and new() ‒ Today’s dGPUs do not have a direct mapping to the CPU virtual address space ‒ Indirect links need to be converted to relative offsets ‒ Requirement to copy the tree to the dGPU, which entails ‒ One is always bound by the amount of GPU device memory 5 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 6. OUR SOLUTION ïč Accelerated B+ Tree searches on a fused CPU+GPU processor (or APU1) ‒ Eliminates data-copies by combining x86 CPU and vector GPU cores on the same silicon die ïč Developed a memory allocator to form a regular representation of the tree in memory ‒ Fundamental data structure is not altered ‒ Merely parts of its layout is changed [1] www.hsafoundation.com 6 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 7. OUTLINE ïč Motivation and Contribution ïč Background ‒ AMD APU Architecture ‒ B+ Trees ïč Approach ‒ Transforming the Memory Layout ‒ Eliminating the Divergence ïč Results ‒ Performance ‒ Analysis ïč Summary and Next Steps 7 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 8. AMD APU ARCHITECTURE System Memory Host Memory DRAM Controller x86 Cores DRAM Controller RMB System Request Interface (SRI) xBar Link Controll er MCT UNB GPU Frame-Buffer FCL Platform Interfaces AMD 2nd Gen. A-series APU UNB - Unified Northbridge, MCT - Memory Controller  The APU consists of a dedicated IOMMUv2 hardware GPU Vector Cores - Provides direct mapping between GPU and CPU virtual address (VA) space - Enables GPUs to access the system memory - Enables GPUs to track whether pages are resident in memory  Today, GPU cores can access VA space at a granularity of continuous chunks 8 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 9. B+ TREES 3 2 5 4 6 7 1 ïč A B+ Tree 
 2 3 4 5 6 7 d1 d2 d3 d4 d5 d6 d7 d8 ‒ is a dynamic, multi-level index ‒ Is efficient for retrieval of data, stored in a block-oriented context ‒ has a high fan-out to reduce disk I/O operations ïč Order (b) of a B+ Tree measures the capacity of its nodes ïč Number of children (m) in an internal node is ‒ [b/2] <= m <= b ‒ Root node can have as few as two children ïč Number of keys in an internal node = (m – 1) 9 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 8
  • 10. APPROACH FOR PARALLELIZATION ïč Fine-grained (Accelerate a single query) ‒ Replace Binary search in each node with K-ary search ‒ Maximum performance improvement = log(k)/log(2) ‒ Results in poor occupancy of the GPU cores ïč Coarse-grained (Perform many queries in parallel) ‒ Enables data-parallelism ‒Increases memory bandwidth with parallel reads ‒Increases throughput (transactions per second for OLTP) 10 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 11. TRANSFORMING THE MEMORY LAYOUT nodes w/ metadata .. keys .. values ïč Metadata ‒ Number of keys in a node ‒ Offset to keys/values in the buffer ‒ Offset to the first child node ‒ Whether a node is a leaf ïč Pass a pointer to this memory buffer to the accelerator 11 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public ..
  • 12. ELIMINATING THE DIVERGENCE ïč Each work-item/thread executes a single query ïč May increase divergence within a wave-front ‒ Every query may follow a different path in the B+ Tree WI-1 WI-2 2 3 5 4 6 7 WI-2 1 2 3 4 5 6 7 d1 d2 d3 d4 d5 d6 d7 d8 8 ïč Sort the keys to be searched ‒ Increases the chances of work-items within a wave-front to follow similar paths in the B+ Tree ‒ We use Radix Sort1 to sort the keys on the GPU [1] D. G. Merrill and A. S. Grimshaw, “Revisiting sorting for gpgpu stream architectures,” in Proceedings of the 19th intl. conf. on Parallel architectures and compilation techniques, ser. PACT ’10. New York, NY, USA: ACM, 2010. 12 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 13. IMPACT OF DIVERGENCE IN B+ TREE SEARCHES Impcat of Divergence 5 4 3 2 1 16K 32K 64K 128K Number of Queries w/ Order of B+ Tree TM AMD Radeon HD 7660 TM AMD Phenom II X6 1090T Impact of Divergence on GPU – 3.7-fold (average) Impact of Divergence on CPU – 1.8-fold (average) 13 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 128 64 32 16 8 128 64 32 16 8 128 64 32 16 8 128 64 32 16 8 0
  • 14. OUTLINE ïč Motivation and Contribution ïč Background ‒ AMD APU Architecture ‒ B+ Trees ïč Approach ‒ Transforming the Memory Layout ‒ Eliminating the Divergence ïč Results ‒ Performance ‒ Analysis ïč Summary and Next Steps 14 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 15. 60000 50000 40000 30000 20000 10000 0 100000 300000 500000 700000 900000 1100000 1300000 1500000 1700000 1900000 2100000 2300000 2500000 2700000 2900000 3100000 3300000 3500000 3700000 3900000 4100000 ïč Software ‒ A B+ Tree w/ 4M records is used ‒ Search queries are created using Frequency EXPERIMENTAL SETUP ‒ normal_distribution() (C++-11 feature) ‒ The queries have been sorted ‒ CPU Implementation from ‒ http://www.amittai.com/prose/bplustree.html ‒ Driver: AMD CatalystTM v12.8 ‒ Programming Model: OpenCLTM ïč Hardware ‒ AMD RadeonTM HD 7660 APU (Trinity) ‒ 4 cores w/ 6GB DDR3, 6 CUs w/ 2GB DDR3 ‒ AMD PhenomTM II X6 1090T + AMD RadeonTM HD 7970 (Tahiti) ‒ 6 cores w/ 8GB DDR3, 32 CUs w/ 3GB GDDR5 ‒ Device Memory does not include data-copy time 15 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public Bin E ID (PKey) Age 0000001 34 4 million 4194304 entries 50
  • 16. RESULTS – QUERIES PER SECOND 700M Queries/Sec Queries/Second (Million) 1000 100 10 16K 32K 64K 128K Number of Queries w/ Order of B+Tree AMD Phenom II X6 1090T (6-Threads+SSE) AMD PhenomTM II 1090T (6-Threads+SSE) AMD Radeon HDTM HD(Device Memory) AMD Radeon 7970 7970 (Device Memory) AMD Radeon HD 7970 (Pinned(Pinned Memory) AMD RadeonTM HD 7970 Memory) dGPU (device memory) ~350M Queries/Sec. (avg.) dGPU (pinned memory) ~9M Queries/Sec. (avg.) AMD PhenomTM CPU ~18M Queries/Sec. (avg.) 16 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 1
  • 17. RESULTS – QUERIES PER SECOND 140 Queries/Second (Million) 120 100 80 60 40 20 16K AMD Phenom II X6 1090T (6-Threads+SSE) AMD PhenomTM II 1090T (6-Threads+SSE) 32K Number of Queries w/ Order of B+Tree HD 7660 (Device Memory) 128K AMD Radeon HDTM HD(Pinned Memory) AMD Radeon 7660 7660 (Pinned Memory) APU (device memory) ~66M Queries/Sec. (avg.) APU (pinned memory) ~40M Queries/Sec. (avg.) APU (pinned memory) is faster than the CPU implementation 17 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 128 64 32 16 8 4 128 64K AMD Radeon HD 7660 (Device Memory) TM AMD Radeon 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 0
  • 18. RESULTS - SPEEDUP 12 Speedup 10 8 4.9-fold speedup 6 4 2 16K 32K 64K 128K Number of Queries w/ Order of B+Tree AMD Radeon HDTM HD(Pinned Memory) AMD Radeon 7660 7660 (Pinned Memory) AMD Radeon HDTM HD(Device(Device Memory) AMD Radeon 7660 7660 Memory) Baseline: six-threaded, hand-tuned, SSE-optimized CPU implementation. Average Speedup – 4.3-fold (Device Memory), 2.5-fold (Pinned Memory) ‱ Efficacy of IOMMUv2 + HSA on the APU Platform Discrete GPU (memory size = 3GB) APU (prototype software) < 1.5GB ✓ ✓ 18 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public Size of the B+ Tree 1.5GB – 2.7GB ✓ ✓ > 2.7GB ✗ ✓ 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 0
  • 19. ANALYSIS ïč The accelerators and the CPU yield best performance for different orders of the B+ Tree ‒ CPU  order = 64 ‒ Ability of CPUs to prefetch data is beneficial for higher orders nodes w/ metadata .. keys .. values .. ‒ APU and dGPU  order = 16 ‒ GPUs do not have a prefetcher  cache line should be most efficiently utilized ‒ GPUs have a cache-line size of 64 bytes ‒ Order = 16 is most beneficial (16 * 4 bytes) 19 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 20. ANALYSIS ïč Minimum batch size to match the CPU performance Order = 64 Order = 16 dGPU (device memory) 4K queries 2K queries dGPU (pinned memory N.A. N.A. APU (device memory) 10K queries 4K queries APU (pinned memory 20K queries 16K queries ïč reuse_factor - amortizing the cost of data-copies to the GPU 90% Queries 100% Queries dGPU 15 54 APU 100 N.A. 20 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 21. PROGRAMMABILITY GPU CPU-SSE int i = 0, j; typedef global unsigned int g_uint; node * c = root; typedef global mynode g_mynode; __m128i vkey = _mm_set1_epi32(key); int tid = get_global_id(0); __m128i vnodekey, *vptr; int i = 0; short int mask; g_mynode *c = (g_mynode *)root; /* find the leaf node */ /* find the leaf node */ while( !c->is_leaf ){ while(!c->is_leaf){ while (i < c->num_keys){ for(i = 0; i < (c->num_keys-3); i+=4){ if(keys[tid] >= ((g_uint *)((intptr_t)root + c->keys))[i]) vptr = (__m128i *)&(c->keys[i]); i++; vnodekey = _mm_load_si128(vptr); else break; mask = _mm_movemask_ps(_mm_cvtepi32_ps( _mm_cmplt_epi32(vkey, vnodekey))); } if((mask) & 8) break; c = (g_mynode *)((intptr_t)root + c->ptr + i*sizeof(mynode)); } for(j = i; j < c->num_keys; j++){ if(key < c->keys[j]) break; } } /* match the key in the leaf node */ for(i=0; i<c->num_keys; i++){ if((((g_uint *)((intptr_t)root + c->keys))[i]) == keys[tid]) break; c = (node *)c->pointers[j]; } } /* match the key in the leaf node */ /* retrieve the record */ for (i = 0; i < c->num_keys; i++) if(i != c->num_keys) if (c->keys[i] == key) break; records[tid] = ((g_uint *)((intptr_t)root + c->is_leaf + i*sizeof(g_uint)))[0]; /* retrieve the record */ if (i != c->num_keys) return (record *)c->pointers[i]; 21 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 22. RELATED WORK ïč J. Fix, A. Wilkes, and K. Skadron, "Accelerating Braided B+ Tree Searches on a GPU with CUDA." In Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis, Implementation, and Performance, in conjunction with ISCA, 2011 ‒ Authors report ~10-fold speedup over single-thread-non-SSE CPU implementation, using a discrete NVIDIA GTX 480 GPU (do not take data-copies into account) ïč C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, P. Dubey, “FAST: fast architecture sensitive tree search on modern CPUs and GPUs”, SIGMOD Conference, 2010 ‒ Authors report ~100M queries per second using a discrete NVIDIA GTX 280 GPU (do not take data-copies into account) ïč J. Sewall, J. Chhugani, C. Kim, N. Satish, P. Dubey, “PALM: Parallel, Architecture-Friendly, Latch-Free Modifications to B+ Trees on Multi-Core Processors”, Proceedings of VLDB Endowment, (VLDB 2011) ‒ Applicable for B+ Tree modifications on the GPU 22 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 23. POSSIBILITIES WITH HSA Memory 3 2 5 4 6 7 1 2 3 4 5 6 7 8 d1 d2 d3 d4 d5 d6 d7 d8 Parallel Reads for Searches Serial Modifications GPU Vector Cores x86 Cores Graphics Core Next, User-mode Queuing, Shared Virtual Memory 23 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 24. SUMMARY ïč B+ Tree is the fundamental data structure in many RDBMS ‒ Accelerating B+ Tree searches is critical ‒ Presents significant challenges on discrete GPUs ïč We have accelerated B+ Tree searches by exploiting coarsegrained parallelism on a APU ‒ 2.5-fold (avg.) speedup over 6-threads+SSE CPU implementation ïč Possible Next Steps ‒ HSA + IOMMUv2 would reduce the issue of modifying B+ Tree representation ‒ Investigate CPU-GPU co-scheduling ‒ Investigate modifications on the B+ Tree 24 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 25. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Phenom, Radeon, Catalyst and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. 25 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public