SlideShare a Scribd company logo
1 of 34
Accelerating Foreign-Key Joins using
  Asymmetric Memory Channels

  Holger Pirk     Stefan Manegold     Martin Kersten
  holger@cwi.nl     manegold@cwi.nl      mk@cwi.nl
Why?
• Trivia: Joins   are important

• But: Many    Joins are (Indexed) Foreign Key Joins

• Important       example: Data Warehouse Star Joins

  • Large   Fact Table (many Gigabytes) with Foreign Keys on

  • Small   Dimension Tables (hundreds of Megabytes)

  • Joins   are Performed for Projection or Condition Evaluation
A History of In Memory Join Processing
       (and incidentally of this talk)
• Foreign-Key   Joins on CPUs

 • Naïve   (Indexed Hash) Foreign-Key Joins

 • Clustered    Foreign-Key Joins
                                              t
• Foreign-Key   Joins on GPGPUs

 • Clustered    Foreign-Key Joins

 • Naïve   (Asymmetric) Foreign-Key Joins
Premises for this talk

• Focus   is on Indexed Foreign Key Joins

• We   assume an unclustered Foreign Key Index Column

• i.e., Pointers   to Matching Tuples of the Target Table
Foreign-Key Joins on CPUs
The Problem: Bandwidth Bound CPUs
Core 1      Core 2    ...    Core 4




         Memory Controller            • Bottleneck for almost all
                                       relational (Memory
                    8.5 GB/s
                                       Resident) DBMS operations
             8.5 GB/s
                                      • Valuable Target   for
Memory       Memory          Memory    Optimization
Naïve Foreign-Key Joins on CPUs
Naïve Foreign-Key Joins on CPUs

•Foreign-Key Index is scanned sequentially

•Target Table is randomly accessed
Naïve Foreign-Key Joins on CPUs



• Optimization Targets:

  • Foreign-Key   Index is scanned sequentially

  • Index   Cache lines accessed once, i.e., not much to optimize
Naïve Foreign-Key Joins on CPUs

• Large Target Tables   cause heavy Cache Thrashing

• Assume    a (theoretical) 1-Slot Cache with 2-Value Cache Lines

• How   many Cache Misses do you get?




                                           }
                                           }
    • 80%   of are Capacitive Misses         6        6
• Target Table   Lookups are rewarding optimization Targets
Clustered Foreign-Key Joins on CPUs
Clustered Foreign-Key Joins on CPUs

• Require   a Value-partitioned Foreign-Key Index

• Foreign-Key    Index Partitions are scanned



• Target Table   Random Access Locality is improved
Clustered Foreign-Key Joins on CPUs

    • Require   a Value-partitioned Foreign-Key Index

    • Foreign-Key    Index Partitions are scanned

[                                   ][                    ]
    • Target Table   Random Access Locality is improved

            [        ][     ]
Clustered Foreign-Key Joins on CPUs

• What    is the number of Target Table Cache Misses?

• So   Cache behavior is greatly improved
                             [       ][       ]
• Also: (disjoint)   Partitions can be processed in parallel
                             }
  • Parallelism              }   1        1
                  is generally good for GPUs

• But: The   Index has to be Value (e.g., Radix)-partitioned first
Clustered Foreign-Key Joins on CPUs
                                 6.0
                                                        Unpartitioned Join (CPU)
    Processing Time in seconds

                                                        Partition (CPU)
                                                        Join (CPU)
                                 4.5
                                                        Radix Join (CPU)


                                 3.0



                                                                                 about 30 %
                                                                                 (consistent with [Blanas11])
                                 1.5




                                  0
                                       1        10           100          1000              10000

                                           Number of Partitions in Clustering Step
                                               (Target Table Size: 64M)
Clustered Foreign-Key Joins on CPUs
                                25.00

                                          CPU          GPU
    Partition time in Seconds



                                                                        20.512
                                18.75




                                12.50

                                                   11.040


                                 6.25
                                                             6.492860



                                        1.695271
                                   0

                                         256 Clusters        65536 Clusters
Foreign-Key Joins on GPUs
GPU Architecture Overview
    Processing Unit 1                                                  Processing Unit 16




                                  Instruction Dispatcher




                                                                                                        Instruction Dispatcher
     Core   Core   Core    Core                                            Core   Core   Core    Core

     Core   Core   Core    Core                                            Core   Core   Core    Core

     Core   Core   Core    Core
                                                            .....          Core   Core   Core    Core

     Core   Core   Core    Core                                            Core   Core   Core    Core

            Local Memory                                                          Local Memory



Expensive Atomic Operations
                                                           Global Memory
The Problem: Bandwidth Bound GPUs

  Core 1      Core 2    ...    Core 4                     GPU        GPU
                                             PCI-E x 16     177.4
                                               (8 GB/s)      GB/s

                                                 I/O
           Memory Controller                     Hub
                                       QPI
                                   (25.6 GB/s)
• PCI-EBottleneck is a big
                 8.5 GB/s
 Problem for GPUs
          8.5 GB/s

                                                       Graphics     Graphics
  Memory       Memory          Memory                  Memory       Memory
Clustered Foreign-Key Joins on GPUs
Clustered Foreign-Key Joins on GPUs



• Port   of the Radix-Join from CPU to GPU [He08]

• Improved    Cache Performance is Important on GPUs too

• Allows   efficient parallel processing of Partitions
Clustered Foreign-Key Joins on GPUs
                               6.0
                                         Join (GPU)
                                         Partition (CPU)
  Processing Time in seconds




                               4.5
                                         Radix Join (CPU/GPU)



                               3.0



                                                                                           about 11 %
                               1.5




                                                                                           about 70 %
                                0
                                     1        10           100           1000      10000

                                         Number of Partitions in Clustering Step
Naïve Foreign-Key Joins on GPUs
Naïve Foreign-Key Joins on GPUs


• Primary   Bottleneck is PCI-E, not GPU Memory

• Idea: Accelerate   FK Joins by exploiting the Strength of GPUs:

  • Random Access      to target table

• Limit   the PCI-E traffic to a minimum:

  • Stream    only the Index
Asymmetric Foreign-Key Joins on GPUs


• Primary   Bottleneck is PCI-E, not GPU Memory

• Idea: Accelerate   FK Joins by exploiting the Strength of GPUs:

  • Random Access      to target table

• Limit   the PCI-E traffic to a minimum:

  • Stream    only the Index
Asymmetric Foreign-Key Joins on GPUs
                                6.0



                                          Join (GPU)
   Processing Time in seconds




                                4.5
                                          Partition (CPU)
                                          Radix Join (CPU/GPU)

                                3.0




                                1.5




                                 0
                                      1        10           100           1000      10000

                                          Number of Partitions in Clustering Step
Asymmetric Foreign-Key Joins on GPUs
                                6.0
                                          Unpartitioned Join (GPU)
                                          Join (GPU)
   Processing Time in seconds




                                4.5
                                          Partition (CPU)
                                          Radix Join (CPU/GPU)

                                3.0




                                1.5
                                                                                 about factor 3.5


                                 0
                                      1        10           100           1000              10000

                                          Number of Partitions in Clustering Step
Asymmetric Foreign-Key Joins on GPUs



• Message: Clustering   improves performance on GPUs

• But: Cluster   Costs don’t pay off in Cache Friendliness

• But   what about parallel processing of Partitions?
But what about Parallelization
                             1.00

                                                                              Total Costs
                                                                              Data transfer Costs
Processing time in Seconds




                             0.75


                                                                                       about 41 %
                                                                                       (not enough)
                             0.50




                             0.25




                               0
                                    1                 10                100                   1000

                                    Number of Utilized Cores per Processor (Work Group Size)
Foreign-Key Joins on GPUs vs CPUs
                               1.100
                                            Intel i7 860
  Processing Time in Seconds


                                            ATI Radeon HD 5850
                               0.825
                                            PCI-E Data Transfer Costs



                               0.550




                               0.275




                                  0
                                       1K    10K      100K     1000K    10000K   100000K

                                                      Dimension Table Size
Foreign-Key Joins on GPUs vs CPUs
                               1.100
                                            Intel i7 860
  Processing Time in Seconds


                                            ATI Radeon HD 5850 (processing only)
                               0.825
                                            PCI-E Data Transfer Costs



                               0.550




                               0.275




                                  0
                                       1K    10K      100K    1000K    10000K   100000K

                                                      Dimension Table Size
Conclusion



• Unclustered   FK-Joins Cause Heavy Cache Thrashing

• Thrashing   can be offloaded to fast GPU memory

• PCI-E   Bandwidth can be saturated without Clustering

• Naïve   FK-Joins on GPUs outperform Clustered Joins on CPUs
Thank you for Listening
Questions?   Feedback?

More Related Content

More from PlanetData Network of Excellence

On Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksOn Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksPlanetData Network of Excellence
 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingPlanetData Network of Excellence
 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamPlanetData Network of Excellence
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...PlanetData Network of Excellence
 
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchLinking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchPlanetData Network of Excellence
 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSPlanetData Network of Excellence
 
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduceScalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReducePlanetData Network of Excellence
 
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...PlanetData Network of Excellence
 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsPlanetData Network of Excellence
 
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...PlanetData Network of Excellence
 
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsAdaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsPlanetData Network of Excellence
 

More from PlanetData Network of Excellence (20)

On Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksOn Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory Sensing
 
Privacy-Preserving Schema Reuse
Privacy-Preserving Schema ReusePrivacy-Preserving Schema Reuse
Privacy-Preserving Schema Reuse
 
Pay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching NetworksPay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching Networks
 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
 
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchLinking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMS
 
CLODA: A Crowdsourced Linked Open Data Architecture
CLODA: A Crowdsourced Linked Open Data ArchitectureCLODA: A Crowdsourced Linked Open Data Architecture
CLODA: A Crowdsourced Linked Open Data Architecture
 
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduceScalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
 
Data and Knowledge Evolution
Data and Knowledge Evolution  Data and Knowledge Evolution
Data and Knowledge Evolution
 
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
 
Access Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract ModelsAccess Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract Models
 
Arrays in Databases, the next frontier?
Arrays in Databases, the next frontier?Arrays in Databases, the next frontier?
Arrays in Databases, the next frontier?
 
Abstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF DatasetsAbstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF Datasets
 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
 
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
 
Heuristic based Query Optimisation for SPARQL
Heuristic based Query Optimisation for SPARQLHeuristic based Query Optimisation for SPARQL
Heuristic based Query Optimisation for SPARQL
 
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsAdaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of Endpoints
 

Recently uploaded

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Recently uploaded (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

  • 1. Accelerating Foreign-Key Joins using Asymmetric Memory Channels Holger Pirk Stefan Manegold Martin Kersten holger@cwi.nl manegold@cwi.nl mk@cwi.nl
  • 2. Why? • Trivia: Joins are important • But: Many Joins are (Indexed) Foreign Key Joins • Important example: Data Warehouse Star Joins • Large Fact Table (many Gigabytes) with Foreign Keys on • Small Dimension Tables (hundreds of Megabytes) • Joins are Performed for Projection or Condition Evaluation
  • 3. A History of In Memory Join Processing (and incidentally of this talk) • Foreign-Key Joins on CPUs • Naïve (Indexed Hash) Foreign-Key Joins • Clustered Foreign-Key Joins t • Foreign-Key Joins on GPGPUs • Clustered Foreign-Key Joins • Naïve (Asymmetric) Foreign-Key Joins
  • 4. Premises for this talk • Focus is on Indexed Foreign Key Joins • We assume an unclustered Foreign Key Index Column • i.e., Pointers to Matching Tuples of the Target Table
  • 6. The Problem: Bandwidth Bound CPUs Core 1 Core 2 ... Core 4 Memory Controller • Bottleneck for almost all relational (Memory 8.5 GB/s Resident) DBMS operations 8.5 GB/s • Valuable Target for Memory Memory Memory Optimization
  • 8. Naïve Foreign-Key Joins on CPUs •Foreign-Key Index is scanned sequentially •Target Table is randomly accessed
  • 9. Naïve Foreign-Key Joins on CPUs • Optimization Targets: • Foreign-Key Index is scanned sequentially • Index Cache lines accessed once, i.e., not much to optimize
  • 10. Naïve Foreign-Key Joins on CPUs • Large Target Tables cause heavy Cache Thrashing • Assume a (theoretical) 1-Slot Cache with 2-Value Cache Lines • How many Cache Misses do you get? } } • 80% of are Capacitive Misses 6 6 • Target Table Lookups are rewarding optimization Targets
  • 12. Clustered Foreign-Key Joins on CPUs • Require a Value-partitioned Foreign-Key Index • Foreign-Key Index Partitions are scanned • Target Table Random Access Locality is improved
  • 13. Clustered Foreign-Key Joins on CPUs • Require a Value-partitioned Foreign-Key Index • Foreign-Key Index Partitions are scanned [ ][ ] • Target Table Random Access Locality is improved [ ][ ]
  • 14. Clustered Foreign-Key Joins on CPUs • What is the number of Target Table Cache Misses? • So Cache behavior is greatly improved [ ][ ] • Also: (disjoint) Partitions can be processed in parallel } • Parallelism } 1 1 is generally good for GPUs • But: The Index has to be Value (e.g., Radix)-partitioned first
  • 15. Clustered Foreign-Key Joins on CPUs 6.0 Unpartitioned Join (CPU) Processing Time in seconds Partition (CPU) Join (CPU) 4.5 Radix Join (CPU) 3.0 about 30 % (consistent with [Blanas11]) 1.5 0 1 10 100 1000 10000 Number of Partitions in Clustering Step (Target Table Size: 64M)
  • 16. Clustered Foreign-Key Joins on CPUs 25.00 CPU GPU Partition time in Seconds 20.512 18.75 12.50 11.040 6.25 6.492860 1.695271 0 256 Clusters 65536 Clusters
  • 18. GPU Architecture Overview Processing Unit 1 Processing Unit 16 Instruction Dispatcher Instruction Dispatcher Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core ..... Core Core Core Core Core Core Core Core Core Core Core Core Local Memory Local Memory Expensive Atomic Operations Global Memory
  • 19. The Problem: Bandwidth Bound GPUs Core 1 Core 2 ... Core 4 GPU GPU PCI-E x 16 177.4 (8 GB/s) GB/s I/O Memory Controller Hub QPI (25.6 GB/s) • PCI-EBottleneck is a big 8.5 GB/s Problem for GPUs 8.5 GB/s Graphics Graphics Memory Memory Memory Memory Memory
  • 21. Clustered Foreign-Key Joins on GPUs • Port of the Radix-Join from CPU to GPU [He08] • Improved Cache Performance is Important on GPUs too • Allows efficient parallel processing of Partitions
  • 22. Clustered Foreign-Key Joins on GPUs 6.0 Join (GPU) Partition (CPU) Processing Time in seconds 4.5 Radix Join (CPU/GPU) 3.0 about 11 % 1.5 about 70 % 0 1 10 100 1000 10000 Number of Partitions in Clustering Step
  • 24. Naïve Foreign-Key Joins on GPUs • Primary Bottleneck is PCI-E, not GPU Memory • Idea: Accelerate FK Joins by exploiting the Strength of GPUs: • Random Access to target table • Limit the PCI-E traffic to a minimum: • Stream only the Index
  • 25. Asymmetric Foreign-Key Joins on GPUs • Primary Bottleneck is PCI-E, not GPU Memory • Idea: Accelerate FK Joins by exploiting the Strength of GPUs: • Random Access to target table • Limit the PCI-E traffic to a minimum: • Stream only the Index
  • 26. Asymmetric Foreign-Key Joins on GPUs 6.0 Join (GPU) Processing Time in seconds 4.5 Partition (CPU) Radix Join (CPU/GPU) 3.0 1.5 0 1 10 100 1000 10000 Number of Partitions in Clustering Step
  • 27. Asymmetric Foreign-Key Joins on GPUs 6.0 Unpartitioned Join (GPU) Join (GPU) Processing Time in seconds 4.5 Partition (CPU) Radix Join (CPU/GPU) 3.0 1.5 about factor 3.5 0 1 10 100 1000 10000 Number of Partitions in Clustering Step
  • 28. Asymmetric Foreign-Key Joins on GPUs • Message: Clustering improves performance on GPUs • But: Cluster Costs don’t pay off in Cache Friendliness • But what about parallel processing of Partitions?
  • 29. But what about Parallelization 1.00 Total Costs Data transfer Costs Processing time in Seconds 0.75 about 41 % (not enough) 0.50 0.25 0 1 10 100 1000 Number of Utilized Cores per Processor (Work Group Size)
  • 30. Foreign-Key Joins on GPUs vs CPUs 1.100 Intel i7 860 Processing Time in Seconds ATI Radeon HD 5850 0.825 PCI-E Data Transfer Costs 0.550 0.275 0 1K 10K 100K 1000K 10000K 100000K Dimension Table Size
  • 31. Foreign-Key Joins on GPUs vs CPUs 1.100 Intel i7 860 Processing Time in Seconds ATI Radeon HD 5850 (processing only) 0.825 PCI-E Data Transfer Costs 0.550 0.275 0 1K 10K 100K 1000K 10000K 100000K Dimension Table Size
  • 32. Conclusion • Unclustered FK-Joins Cause Heavy Cache Thrashing • Thrashing can be offloaded to fast GPU memory • PCI-E Bandwidth can be saturated without Clustering • Naïve FK-Joins on GPUs outperform Clustered Joins on CPUs
  • 33. Thank you for Listening
  • 34. Questions? Feedback?