This document discusses leveraging intra-node parallelization in HPCC Systems to improve the performance of set similarity joins (SSJ). It describes a naïve approach to computing SSJ that suffers from memory exhaustion and straggling executors. The presented approach replicates and groups independent data using hashing to address these issues while enabling efficient use of multiple CPU cores through multithreading. Experiments show the approach scales to larger datasets and achieves better performance by increasing the number of threads per executor. Lessons learned include that less complex optimizations are more robust in distributed environments.
3. Parallelize Set Similarity Join
• Many applications need to identify similar
pairs of documents:
• Plagiarism detection
• Community mining in social networks
• Near-duplicate web page detection
• Document clustering
• ...
• Operation: Set similarity join (SSJ)
• Find all pairs of records (r, s) where
sim(r, s) ≥ t (r ∈ R, s ∈ S)
• Nice to have in a distributed system
Leveraging Intra-Node Parallelization in HPCC Systems 3
sr
5. Naïve Approach to Compute SSJ
Leveraging Intra-Node Parallelization in HPCC Systems 5
6. Naïve Approach to Compute SSJ
Leveraging Intra-Node Parallelization in HPCC Systems 6
Issue a:
Memory
exhaustion due
to too high
replication
7. Parallelize Filter-and-Verification Approaches
• Use data characteristics to replicate and group independent data (inverted index)
Leveraging Intra-Node Parallelization in HPCC Systems 7
r1 a b e
r2 a d e
r3 b c d e f g
a r1, r2
b r1, r3
c r3
d r2, r3
e r1, r2, r3
f r3
g r3
8. Parallelize Filter-and-Verification Approaches
Leveraging Intra-Node Parallelization in HPCC Systems 8
Issue b:
Straggling
executors
Issue c:
Not scalable:
only suitable for
small datasets
Cf. Fier et al.:
Set Similarity
Joins on
MapReduce: An
experimental
Survey
10. Basic Ideas
Leveraging Intra-Node Parallelization in HPCC Systems 10
1. Global replication and grouping: -> a, c
• Without data dependencies
• Regarding system restrictions (RAM)
2. Use local parallelization more efficiently (> 1 Core per executor) -> b
• Use existing approaches local data structures, accessible my multiple cores
Wish list:
a) Stay in RAM
b) Efficient use of CPUs
c) Scalability to Big
Data
Potential!
11. Idea 1: Global Replication and Grouping
Leveraging Intra-Node Parallelization in HPCC Systems 11
• Apply hash: no data dependency
• Choose hash such that
• #groups < #executors
• Groups fit into RAM of executor
1 2 3 4
1
2
3
4
p p1 1⋈ p p1 2⋈
p p2 2⋈ p p2 3⋈ p p2 4⋈
p p1 3⋈ p p1 4⋈
p p3 3
⋈ p p3 4⋈
p p4 4⋈
example: self-join
12. Idea 2: Leverage Local Parallelization
Leveraging Intra-Node Parallelization in HPCC Systems 12
• HPCC Systems allows to have multiple executors per node
• However, executors cannot share data without copying
• Use multithreading in each executor with access to global inverted index
• C++ Std Threads within one executor
• allows fine-granular control over threads, especially regarding pinning to
avoid CPU migrations (NUMA effects)
• Multithreaded user-defined functions are not officially supported… ;-)
• Necessary to write a plugin; embedded code doesn‘t work
16. Compile and Install Plugin
Leveraging Intra-Node Parallelization in HPCC Systems 16
• Download HPCC source code (same version like on
cluster)
• Make it compile ;-)
• Refer to plugins/exampleplugin
• C++ Mappings in ECL documentation -> „undefined
symbol“
• Add the new plugin to cmake config files
• Compile and deploy .so file to each cluster node
• Cluster in „blocked“ state: pkill on all executors on all
slave nodes
• use DBGLOG() to write to ECL Logs
19. Experiments: Data Scalability
• DBLP dataset 1x-25x
• threshold(Jaccard)=0.7
• numThreads=2
Leveraging Intra-Node Parallelization in HPCC Systems 19
0
20
40
60
80
100
120
0 5 10 15 20 25 30
Runtime(S)
Dataset Scale
Wish list:
a) Stay in RAM
b) Efficient use of CPUs
c) Scalability to Big
Data
21. Current Work
Leveraging Intra-Node Parallelization in HPCC Systems 21
• Utilize local parallelization better
• Optimize approach to NUMA effects by pinning threads on cores in one CPU that
share datasets
22. Lessons Learned
Leveraging Intra-Node Parallelization in HPCC Systems 22
• Less (complexity) is more
• Hash-based replication and grouping is more robust than relying on data
characteristics
• Fine-granular optimizations of filters (filter-and-verification approach) do not
have a big effect on the overall runtime in a distributed environment. In fact,
we didn‘t use any sophisticated filter here.
23. Thank you!
Special thanks to LexisNexis for providing a research grant
Leveraging Intra-Node Parallelization in HPCC Systems 23
24. Leveraging Intra-Node Parallelization in HPCC Systems 24
View this presentation on YouTube:
https://www.youtube.com/watch?v=nTWpfa0wdDk&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=7&t=0s (3:13)
Hinweis der Redaktion
5 executors/node 24 cores (4 CPUs)
es ist nicht die Synchronisation; selbst ohne geht’s nicht schneller