SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
Relational Joins on Graphics
        Processors
              Suman Karumuri
                 Jamie Jablin
Background
Introduction
• Utilizing hardware features of the GPU
  – Massive thread parallelism
  – Fast inter-processor communication
  – High memory bandwidth
  – Coalesced access
Relational Joins
•   Non-indexed nested-loop join (NINLJ)
•   Indexed nested-loop join (INLJ)
•   Sort-merge join (SMJ)
•   Hash join (HJ)
Block Non-indexed nested-loop join
             (NINLJ)

foreach r in R:
 foreach s in S:
    if condition(r,s)
        output = <r,s>
Block Indexed nested-loop join
             (INLJ)

foreach r in R:
    if S[].has(r.f1)
        if condition(r,s)
            output = <r,s>
Hash join (HJ)

Hr = Hashtable()
foreach r in R:
  Hr.add(r)
  if Hr.size() == MAX_MEMORY:
    for s in S:
     if Hr(s):
        output s
    Hr.clear()
Sort-merge join (SMJ)
Sort(R), Sort(S) ; i, j = 0
while !R.empty() && !S.empty():
    if (R[i] == S[j])
         output += R[i]
             i++ ;    j++
    elif (R[i] < S[j])
         i++
    else
         j++
Algorithms on GPU
• Tips for algorithm design
  – Use the inherent concurrency.
  – Keep SIMD nature in mind.
  – Algorithms should be side-effect free.
  – Memory properties:
     •   High memory bandwidth.
     •   Coalesced access (for spatial locality)
     •   Cache in local memory (for temporal locality)
     •   Access memory via indices and offsets.
GPU Primitives
Design and Implementation
• A complete set of parallel primitives:
  – Map, scatter, gather, prefix scan, split, and
    sort
     • Low synchronization overhead.
     • Scalable to hundreds of processors.
     • Applicable to joins as well as other relational query
       operators.
Map
Scatter
• Indexed writes to a relation.
Gather
• Performs indexed reads from a relation.
Prefix Scan
• A prefix scan applies a binary operator on the input
  of size n and produces an output of size n.
• Ex: Prefix sum: cumulative sum of all elements to
  the left of the current element.
   – Exclusive (used in paper)
   – Inclusive
Split
Each thread constructs
    tHist from Rin
L[(p-1)*#thread+t]
         =
     tHist[t][p]
Prefix sum
 L(i) = sum(L[0…i-1])
Gives the start location
      of partitions
tOffset[t][p]
         =
L[(p-1)*#thread+t]
Scatter tuples to Rout
  based on offset.
Sort
• Bitonic sort
   – Uses sorting networks, O(N log2N).
• Quick sort
   – partition using a random pivot until partition fits in
     local memory
   – Sort each partition using bitonic sort.
   – Partioning can be parallelized using split.
   – Complexity is O(N logN).
   – 30% faster than bitonic sort in experiments
   – Use Quick sort for sorting
Spatial and Temporal locality
Memory Optimizations
• Coalesced memory improves memory
  bandwidth utilization (spatial locality)
Local Memory Optimization
• Quick sort
  – Temporal locality
  – Use the bitonic sort to sort each chunk after
    the partitioning step.
Joins on GPGPU
NINLJ on GPU
• Block nested
• Uses Map primitive on both relations
  – Partition R into R’ and S into S’ blocks
    respectively.
  – Create R’ x S’ thread groups
  – A thread in a thread group processes one
    tuple from R’ and matches all tuples from S’.
  – All tuples in S’ are in local cache.
B+ Tree vs CSS Tree
• B+ tree imposes
  – Memory stalls when traversed (no spatial locality)
  – Can’t perform multiple searches ( loses temporal
    locality).
• CSS-Tree (Cache optimized search tree)
  – One dimensional array where nodes are indexed.
  – Replaces traversal with computation.
  – Can also perform parallel key lookups.
Indexed Nested Loop Join (INLP)
• Uses Map primitive on outer relation
• Uses CSS tree for index.
• For each block in outer relation R
  – Start with a root node to find the next level
     • Binary search is shown to be better than sequential search.
  – Go down until you find the data node.
• Upper level nodes are cached in local memory
  since they are frequently accessed.
Sort Merge Join
• Sort the relations R, S using the sort primitive
• Merge phase
   – Break S into chunks (s’) of size M.
   – Find first and last key values of each chunk in s’ and
     partition R into those many chunks.
   – Merge all chunks in parallel using map
      • Each thread group handles a pair
      • Each thread compares 1 tuple in R with s’ using binary
        search.
• Chunk size is chosen to fit in local memory.
Hash Join
• Uses split primitive on both relations
• Developed a parallel version of radix hash join
  – Partitioning
     • Split R and S into the same number of partitions, so S
       partitions fit into the local memory
  – Matching
     • Choose smaller one of R and S partitions as inner partition to
       be loaded into local memory
     • Larger relation will be used as the outer relation
     • Each tuple from outer relation uses a search on the inner
       relation for matching.
Lock-Free Scheme for Result
             Output
• Problems
  – Unknown join result size. Max size of joins
    doesn’t fit in memory.
  – Concurrent writes are not atomic.
Lock-Free Scheme for Result
              Output
• Solution: Three-phase scheme
  – Each thread counts the number of join results.
  – Compute a prefix sum on the counts to get an
    array of write locations and the total number
    of results generated by the join.
  – Host code allocates memory on device.
  – Run join again with outputs.
• Run joins twice. That’s ok, GPU’s are fast.
Experimental Results
Hardware Configuration




•   Theoretical Memory bandwidth
     – GPU: 86.4 GB/s
     – CPU: 10.4 GB/s
•   Practical Memory bandwidth
     – GPU: 69.2 GB/s
     – CPU: 5.6 GB/s
Workload
• R and S tables with 2 integer columns.
• SELECT R.rid, S.rid FROM R, S WHERE <predicate>
• SELECT R.rid, S.rid FROM R, S WHERE R.rid=S.rid
• SELECT R.rid, S.rid FROM R, S WHERE
  R.rid<=S.rid<=R.rid + k
• Tested on all combinations:
   – Fix R, Vary S. All values uniform distribution. |R| = 1M
   – Performance impact varying join selectivity. |R| = |S| = 16M
   – Non – uniform distribution of data sizes and also varying join
     selectivity. |R| = |S| = 16M
• Also tested with columns as strings.
Implementation Details on CPU
• Highly optimized primitives and join
  algorithms matching hardware architecture
• Tuned for cache performance.
• Compiled programs using MSVC 8.0 with
  full optimizations.
• Used openMP for threading mechanisms.
• 2-6X faster than their sequential counter
  parts.
Implementation Details on GPU
• CUDA parameters
  – Number of thread groups (128)
  – Number of threads for each thread group (64)
  – Block size is 4MB (main memory to device
    memory)
Memory Optimizations Work
Works when join selectivity is
         varied
Better than in-memory database
CUDA vs. DirectX10
• DirectX10 is difficult to program, because
  the data is stored as textures.
• NINLJ and INLJ have similar performance.
• HJ and SMJ are slower because of texture
  decoding.
• Summary: low level primitives on GPGPU
  are better than graphics primitives on
  GPU.
Criticisms
• Applications of skew handling are unclear.
• Primitives are sufficient to implement the
  given joins, but they do not prove the set
  of primitives to be minimal.
Limitations and future research
              directions
• Lack of synchronization mechanisms for
  handling read/write conflicts on GPU.
• More primitives.
• More open GPGPU hardware spec for
  optimizations.
• Power consumption on GPU.
• Lack of support for complex data types.
• On GPU in-memory database.
• Automatic detection of thread groups and
  number of threads using program analysis
  techniques.
Conclusion
• GPU-based primitives and join algorithms
  achieve a speedup of 2-27X over
  optimized CPU-based counterparts.
• NINLJ, 7.0X; INLJ, 6.1X; SMJ, 2.4X; HJ,
  1.9X
Refrerences
• Scan Primitives for GPU Computing,
  Sengupta et al
• wikipedia.org
• monetdb.cwi.nl
Thank You.
Scan Primitives for GPU Computing, Sengupta et al
Skew Handling
• Skew in data results in an imbalanced
  partition size in partitioned-based
  algorithms (SMJ and HJ)
• Solution
  – Identify partitions that do not fit into the local
    memory
  – Decompose partitions into multiple chunks the
    size of local memory
Implementation Details on GPU
• CUDA parameters
  – Number of threads for each thread group
  – Number of thread groups
• DirectX10
  – Join algorithms implemented using
    programmable pipeline
    • Vertex shader, geometry shader, and pixel shader
Gpu Join Presentation
Gpu Join Presentation
Gpu Join Presentation

Weitere ähnliche Inhalte

Was ist angesagt?

Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into CatalystCheng Lian
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_IndexKohei KaiGai
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
Exactly once with spark streaming
Exactly once with spark streamingExactly once with spark streaming
Exactly once with spark streamingQuentin Ambard
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
Jvm & Garbage collection tuning for low latencies application
Jvm & Garbage collection tuning for low latencies applicationJvm & Garbage collection tuning for low latencies application
Jvm & Garbage collection tuning for low latencies applicationQuentin Ambard
 
Jvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & CassandraJvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & CassandraQuentin Ambard
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010Ben Scofield
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multiKohei KaiGai
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 

Was ist angesagt? (20)

Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
Exactly once with spark streaming
Exactly once with spark streamingExactly once with spark streaming
Exactly once with spark streaming
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
Jvm & Garbage collection tuning for low latencies application
Jvm & Garbage collection tuning for low latencies applicationJvm & Garbage collection tuning for low latencies application
Jvm & Garbage collection tuning for low latencies application
 
MapReduce
MapReduceMapReduce
MapReduce
 
Jvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & CassandraJvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & Cassandra
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 

Andere mochten auch

[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
Halvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromisedHalvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromisedArea41
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...Spark Summit
 

Andere mochten auch (8)

[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
Hash join
Hash joinHash join
Hash join
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Halvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromisedHalvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromised
 
Join operation
Join operationJoin operation
Join operation
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 

Ähnlich wie Gpu Join Presentation

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicschangehee lee
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxssuser30e7d2
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systemsPrashant Raaghav
 
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008Sergio Bossa
 
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I💻 Anton Gerdelan
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)Amal R
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
If the data cannot come to the algorithm...
If the data cannot come to the algorithm...If the data cannot come to the algorithm...
If the data cannot come to the algorithm...Robert Burrell Donkin
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 

Ähnlich wie Gpu Join Presentation (20)

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
JavaScript on the GPU
JavaScript on the GPUJavaScript on the GPU
JavaScript on the GPU
 
Ixgbe internals
Ixgbe internalsIxgbe internals
Ixgbe internals
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptx
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systems
 
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Graph chi
Graph chiGraph chi
Graph chi
 
main
mainmain
main
 
If the data cannot come to the algorithm...
If the data cannot come to the algorithm...If the data cannot come to the algorithm...
If the data cannot come to the algorithm...
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 

Mehr von Suman Karumuri

Mehr von Suman Karumuri (10)

Monorepo at Pinterest
Monorepo at PinterestMonorepo at Pinterest
Monorepo at Pinterest
 
Pintrace: Distributed tracing @Pinterest
Pintrace: Distributed tracing @PinterestPintrace: Distributed tracing @Pinterest
Pintrace: Distributed tracing @Pinterest
 
Pintrace: Distributed tracing@Pinterest
Pintrace: Distributed tracing@PinterestPintrace: Distributed tracing@Pinterest
Pintrace: Distributed tracing@Pinterest
 
PinTrace Advanced AWS meetup
PinTrace Advanced AWS meetup PinTrace Advanced AWS meetup
PinTrace Advanced AWS meetup
 
Phobos
PhobosPhobos
Phobos
 
Dream Language!
Dream Language!Dream Language!
Dream Language!
 
Bittorrent
BittorrentBittorrent
Bittorrent
 
Practical Byzantine Fault Tolerance
Practical Byzantine Fault TolerancePractical Byzantine Fault Tolerance
Practical Byzantine Fault Tolerance
 
bluespec talk
bluespec talkbluespec talk
bluespec talk
 
GFS
GFSGFS
GFS
 

Kürzlich hochgeladen

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Kürzlich hochgeladen (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Gpu Join Presentation

  • 1. Relational Joins on Graphics Processors Suman Karumuri Jamie Jablin
  • 3. Introduction • Utilizing hardware features of the GPU – Massive thread parallelism – Fast inter-processor communication – High memory bandwidth – Coalesced access
  • 4. Relational Joins • Non-indexed nested-loop join (NINLJ) • Indexed nested-loop join (INLJ) • Sort-merge join (SMJ) • Hash join (HJ)
  • 5. Block Non-indexed nested-loop join (NINLJ) foreach r in R: foreach s in S: if condition(r,s) output = <r,s>
  • 6. Block Indexed nested-loop join (INLJ) foreach r in R: if S[].has(r.f1) if condition(r,s) output = <r,s>
  • 7. Hash join (HJ) Hr = Hashtable() foreach r in R: Hr.add(r) if Hr.size() == MAX_MEMORY: for s in S: if Hr(s): output s Hr.clear()
  • 8. Sort-merge join (SMJ) Sort(R), Sort(S) ; i, j = 0 while !R.empty() && !S.empty(): if (R[i] == S[j]) output += R[i] i++ ; j++ elif (R[i] < S[j]) i++ else j++
  • 9. Algorithms on GPU • Tips for algorithm design – Use the inherent concurrency. – Keep SIMD nature in mind. – Algorithms should be side-effect free. – Memory properties: • High memory bandwidth. • Coalesced access (for spatial locality) • Cache in local memory (for temporal locality) • Access memory via indices and offsets.
  • 11. Design and Implementation • A complete set of parallel primitives: – Map, scatter, gather, prefix scan, split, and sort • Low synchronization overhead. • Scalable to hundreds of processors. • Applicable to joins as well as other relational query operators.
  • 12. Map
  • 13. Scatter • Indexed writes to a relation.
  • 14. Gather • Performs indexed reads from a relation.
  • 15. Prefix Scan • A prefix scan applies a binary operator on the input of size n and produces an output of size n. • Ex: Prefix sum: cumulative sum of all elements to the left of the current element. – Exclusive (used in paper) – Inclusive
  • 16. Split
  • 17. Each thread constructs tHist from Rin
  • 18. L[(p-1)*#thread+t] = tHist[t][p]
  • 19. Prefix sum L(i) = sum(L[0…i-1]) Gives the start location of partitions
  • 20. tOffset[t][p] = L[(p-1)*#thread+t]
  • 21. Scatter tuples to Rout based on offset.
  • 22. Sort • Bitonic sort – Uses sorting networks, O(N log2N). • Quick sort – partition using a random pivot until partition fits in local memory – Sort each partition using bitonic sort. – Partioning can be parallelized using split. – Complexity is O(N logN). – 30% faster than bitonic sort in experiments – Use Quick sort for sorting
  • 24. Memory Optimizations • Coalesced memory improves memory bandwidth utilization (spatial locality)
  • 25. Local Memory Optimization • Quick sort – Temporal locality – Use the bitonic sort to sort each chunk after the partitioning step.
  • 27. NINLJ on GPU • Block nested • Uses Map primitive on both relations – Partition R into R’ and S into S’ blocks respectively. – Create R’ x S’ thread groups – A thread in a thread group processes one tuple from R’ and matches all tuples from S’. – All tuples in S’ are in local cache.
  • 28. B+ Tree vs CSS Tree • B+ tree imposes – Memory stalls when traversed (no spatial locality) – Can’t perform multiple searches ( loses temporal locality). • CSS-Tree (Cache optimized search tree) – One dimensional array where nodes are indexed. – Replaces traversal with computation. – Can also perform parallel key lookups.
  • 29. Indexed Nested Loop Join (INLP) • Uses Map primitive on outer relation • Uses CSS tree for index. • For each block in outer relation R – Start with a root node to find the next level • Binary search is shown to be better than sequential search. – Go down until you find the data node. • Upper level nodes are cached in local memory since they are frequently accessed.
  • 30. Sort Merge Join • Sort the relations R, S using the sort primitive • Merge phase – Break S into chunks (s’) of size M. – Find first and last key values of each chunk in s’ and partition R into those many chunks. – Merge all chunks in parallel using map • Each thread group handles a pair • Each thread compares 1 tuple in R with s’ using binary search. • Chunk size is chosen to fit in local memory.
  • 31. Hash Join • Uses split primitive on both relations • Developed a parallel version of radix hash join – Partitioning • Split R and S into the same number of partitions, so S partitions fit into the local memory – Matching • Choose smaller one of R and S partitions as inner partition to be loaded into local memory • Larger relation will be used as the outer relation • Each tuple from outer relation uses a search on the inner relation for matching.
  • 32. Lock-Free Scheme for Result Output • Problems – Unknown join result size. Max size of joins doesn’t fit in memory. – Concurrent writes are not atomic.
  • 33. Lock-Free Scheme for Result Output • Solution: Three-phase scheme – Each thread counts the number of join results. – Compute a prefix sum on the counts to get an array of write locations and the total number of results generated by the join. – Host code allocates memory on device. – Run join again with outputs. • Run joins twice. That’s ok, GPU’s are fast.
  • 35. Hardware Configuration • Theoretical Memory bandwidth – GPU: 86.4 GB/s – CPU: 10.4 GB/s • Practical Memory bandwidth – GPU: 69.2 GB/s – CPU: 5.6 GB/s
  • 36. Workload • R and S tables with 2 integer columns. • SELECT R.rid, S.rid FROM R, S WHERE <predicate> • SELECT R.rid, S.rid FROM R, S WHERE R.rid=S.rid • SELECT R.rid, S.rid FROM R, S WHERE R.rid<=S.rid<=R.rid + k • Tested on all combinations: – Fix R, Vary S. All values uniform distribution. |R| = 1M – Performance impact varying join selectivity. |R| = |S| = 16M – Non – uniform distribution of data sizes and also varying join selectivity. |R| = |S| = 16M • Also tested with columns as strings.
  • 37. Implementation Details on CPU • Highly optimized primitives and join algorithms matching hardware architecture • Tuned for cache performance. • Compiled programs using MSVC 8.0 with full optimizations. • Used openMP for threading mechanisms. • 2-6X faster than their sequential counter parts.
  • 38. Implementation Details on GPU • CUDA parameters – Number of thread groups (128) – Number of threads for each thread group (64) – Block size is 4MB (main memory to device memory)
  • 39.
  • 40.
  • 42. Works when join selectivity is varied
  • 44. CUDA vs. DirectX10 • DirectX10 is difficult to program, because the data is stored as textures. • NINLJ and INLJ have similar performance. • HJ and SMJ are slower because of texture decoding. • Summary: low level primitives on GPGPU are better than graphics primitives on GPU.
  • 45. Criticisms • Applications of skew handling are unclear. • Primitives are sufficient to implement the given joins, but they do not prove the set of primitives to be minimal.
  • 46. Limitations and future research directions • Lack of synchronization mechanisms for handling read/write conflicts on GPU. • More primitives. • More open GPGPU hardware spec for optimizations. • Power consumption on GPU. • Lack of support for complex data types. • On GPU in-memory database. • Automatic detection of thread groups and number of threads using program analysis techniques.
  • 47. Conclusion • GPU-based primitives and join algorithms achieve a speedup of 2-27X over optimized CPU-based counterparts. • NINLJ, 7.0X; INLJ, 6.1X; SMJ, 2.4X; HJ, 1.9X
  • 48. Refrerences • Scan Primitives for GPU Computing, Sengupta et al • wikipedia.org • monetdb.cwi.nl
  • 50. Scan Primitives for GPU Computing, Sengupta et al
  • 51. Skew Handling • Skew in data results in an imbalanced partition size in partitioned-based algorithms (SMJ and HJ) • Solution – Identify partitions that do not fit into the local memory – Decompose partitions into multiple chunks the size of local memory
  • 52. Implementation Details on GPU • CUDA parameters – Number of threads for each thread group – Number of thread groups • DirectX10 – Join algorithms implemented using programmable pipeline • Vertex shader, geometry shader, and pixel shader