The presentation was presented by Holger Pirk at the Second International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS 2011) In conjunction with VLDB 2011, Seattle, WA on September 2, 2011.
Publication: http://bit.ly/za1DDF
Abstract:
Indexed Foreign-Key Joins expose a very asymmetric ac- cess pattern: the Foreign-Key Index is sequentially scanned whilst the Primary-Key table is target of many quasi-random lookups which is the dominant cost factor. To reduce the costs of the random lookups the fact-table can be (re-) partitioned at runtime to increase access locality on the dimen- sion table, and thus limit the random memory access to inside the CPU's cache. However, this is very hard to opti- mize and the performance impact on recent architectures is limited because the partitioning costs consume most of the achievable join improvement.
GPGPUs on the other hand have an architecture that is well suited for this operation: a relatively slow connection to the large system memory and a very fast connection to the smaller internal device memory. We show how to accelerate Foreign-Key Joins by executing the random table lookups on the GPU's VRAM while sequentially streaming the Foreign- Key-Index through the PCI-E Bus. We also experimentally study the memory access costs on GPU and CPU to provide estimations of the benefit of this technique.
Accelerating Foreign-Key Joins using Asymmetric Memory Channels
1. Accelerating Foreign-Key Joins using
Asymmetric Memory Channels
Holger Pirk Stefan Manegold Martin Kersten
holger@cwi.nl manegold@cwi.nl mk@cwi.nl
2. Why?
• Trivia: Joins are important
• But: Many Joins are (Indexed) Foreign Key Joins
• Important example: Data Warehouse Star Joins
• Large Fact Table (many Gigabytes) with Foreign Keys on
• Small Dimension Tables (hundreds of Megabytes)
• Joins are Performed for Projection or Condition Evaluation
3. A History of In Memory Join Processing
(and incidentally of this talk)
• Foreign-Key Joins on CPUs
• Naïve (Indexed Hash) Foreign-Key Joins
• Clustered Foreign-Key Joins
t
• Foreign-Key Joins on GPGPUs
• Clustered Foreign-Key Joins
• Naïve (Asymmetric) Foreign-Key Joins
4. Premises for this talk
• Focus is on Indexed Foreign Key Joins
• We assume an unclustered Foreign Key Index Column
• i.e., Pointers to Matching Tuples of the Target Table
8. Naïve Foreign-Key Joins on CPUs
•Foreign-Key Index is scanned sequentially
•Target Table is randomly accessed
9. Naïve Foreign-Key Joins on CPUs
• Optimization Targets:
• Foreign-Key Index is scanned sequentially
• Index Cache lines accessed once, i.e., not much to optimize
10. Naïve Foreign-Key Joins on CPUs
• Large Target Tables cause heavy Cache Thrashing
• Assume a (theoretical) 1-Slot Cache with 2-Value Cache Lines
• How many Cache Misses do you get?
}
}
• 80% of are Capacitive Misses 6 6
• Target Table Lookups are rewarding optimization Targets
12. Clustered Foreign-Key Joins on CPUs
• Require a Value-partitioned Foreign-Key Index
• Foreign-Key Index Partitions are scanned
• Target Table Random Access Locality is improved
13. Clustered Foreign-Key Joins on CPUs
• Require a Value-partitioned Foreign-Key Index
• Foreign-Key Index Partitions are scanned
[ ][ ]
• Target Table Random Access Locality is improved
[ ][ ]
14. Clustered Foreign-Key Joins on CPUs
• What is the number of Target Table Cache Misses?
• So Cache behavior is greatly improved
[ ][ ]
• Also: (disjoint) Partitions can be processed in parallel
}
• Parallelism } 1 1
is generally good for GPUs
• But: The Index has to be Value (e.g., Radix)-partitioned first
15. Clustered Foreign-Key Joins on CPUs
6.0
Unpartitioned Join (CPU)
Processing Time in seconds
Partition (CPU)
Join (CPU)
4.5
Radix Join (CPU)
3.0
about 30 %
(consistent with [Blanas11])
1.5
0
1 10 100 1000 10000
Number of Partitions in Clustering Step
(Target Table Size: 64M)
16. Clustered Foreign-Key Joins on CPUs
25.00
CPU GPU
Partition time in Seconds
20.512
18.75
12.50
11.040
6.25
6.492860
1.695271
0
256 Clusters 65536 Clusters
21. Clustered Foreign-Key Joins on GPUs
• Port of the Radix-Join from CPU to GPU [He08]
• Improved Cache Performance is Important on GPUs too
• Allows efficient parallel processing of Partitions
22. Clustered Foreign-Key Joins on GPUs
6.0
Join (GPU)
Partition (CPU)
Processing Time in seconds
4.5
Radix Join (CPU/GPU)
3.0
about 11 %
1.5
about 70 %
0
1 10 100 1000 10000
Number of Partitions in Clustering Step
24. Naïve Foreign-Key Joins on GPUs
• Primary Bottleneck is PCI-E, not GPU Memory
• Idea: Accelerate FK Joins by exploiting the Strength of GPUs:
• Random Access to target table
• Limit the PCI-E traffic to a minimum:
• Stream only the Index
25. Asymmetric Foreign-Key Joins on GPUs
• Primary Bottleneck is PCI-E, not GPU Memory
• Idea: Accelerate FK Joins by exploiting the Strength of GPUs:
• Random Access to target table
• Limit the PCI-E traffic to a minimum:
• Stream only the Index
26. Asymmetric Foreign-Key Joins on GPUs
6.0
Join (GPU)
Processing Time in seconds
4.5
Partition (CPU)
Radix Join (CPU/GPU)
3.0
1.5
0
1 10 100 1000 10000
Number of Partitions in Clustering Step
27. Asymmetric Foreign-Key Joins on GPUs
6.0
Unpartitioned Join (GPU)
Join (GPU)
Processing Time in seconds
4.5
Partition (CPU)
Radix Join (CPU/GPU)
3.0
1.5
about factor 3.5
0
1 10 100 1000 10000
Number of Partitions in Clustering Step
28. Asymmetric Foreign-Key Joins on GPUs
• Message: Clustering improves performance on GPUs
• But: Cluster Costs don’t pay off in Cache Friendliness
• But what about parallel processing of Partitions?
29. But what about Parallelization
1.00
Total Costs
Data transfer Costs
Processing time in Seconds
0.75
about 41 %
(not enough)
0.50
0.25
0
1 10 100 1000
Number of Utilized Cores per Processor (Work Group Size)
30. Foreign-Key Joins on GPUs vs CPUs
1.100
Intel i7 860
Processing Time in Seconds
ATI Radeon HD 5850
0.825
PCI-E Data Transfer Costs
0.550
0.275
0
1K 10K 100K 1000K 10000K 100000K
Dimension Table Size
31. Foreign-Key Joins on GPUs vs CPUs
1.100
Intel i7 860
Processing Time in Seconds
ATI Radeon HD 5850 (processing only)
0.825
PCI-E Data Transfer Costs
0.550
0.275
0
1K 10K 100K 1000K 10000K 100000K
Dimension Table Size
32. Conclusion
• Unclustered FK-Joins Cause Heavy Cache Thrashing
• Thrashing can be offloaded to fast GPU memory
• PCI-E Bandwidth can be saturated without Clustering
• Naïve FK-Joins on GPUs outperform Clustered Joins on CPUs