2019 HPCC Systems Community Day Challenge

2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
Leveraging Intra-Node
Parallelization in HPCC Systems
Fabian Fier

Parallelize Set Similarity Join
• Many applications need to identify similar
pairs of documents:
• Plagiarism detection
• Community mining in social networks
• Near-duplicate web page detection
• Document clustering
• ...
• Operation: Set similarity join (SSJ)
• Find all pairs of records (r, s) where
sim(r, s) ≥ t (r ∈ R, s ∈ S)
• Nice to have in a distributed system
Leveraging Intra-Node Parallelization in HPCC Systems 3
sr

Naïve Approach to Compute SSJ
• …
• L_R computeSimilarity(L_R r, L_R s) := TRANSFORM
• SELF.RecordId1 := r.RecordId;
• SELF.RecordId2 := s.RecordId;
• SELF.Sim := (compute similarity);
• END;
• …
• resToFilter := JOIN(R, S, TRUE, computeSimilarity(LEFT, RIGHT), ALL);
• result := resToFilter(Sim > 90);

Issue a:
Memory
exhaustion due
to too high
replication

Parallelize Filter-and-Verification Approaches
• Use data characteristics to replicate and group independent data (inverted index)
r1 a b e
r2 a d e
r3 b c d e f g
a r1, r2
b r1, r3
c r3
d r2, r3
e r1, r2, r3
f r3
g r3

Parallelize Filter-and-Verification Approaches
Issue b:
Straggling
executors
Issue c:
Not scalable:
only suitable for
small datasets
Cf. Fier et al.:
Set Similarity
Joins on
MapReduce: An
experimental
Survey

Basic Ideas
1. Global replication and grouping: -> a, c
• Without data dependencies
• Regarding system restrictions (RAM)
2. Use local parallelization more efficiently (> 1 Core per executor) -> b
• Use existing approaches local data structures, accessible my multiple cores
Wish list:
a) Stay in RAM
b) Efficient use of CPUs
c) Scalability to Big
Data
Potential!

Idea 1: Global Replication and Grouping
• Apply hash: no data dependency
• Choose hash such that
• #groups < #executors
• Groups fit into RAM of executor
1 2 3 4
1
2
3
4
p p1 1⋈ p p1 2⋈
p p2 2⋈ p p2 3⋈ p p2 4⋈
p p1 3⋈ p p1 4⋈
p p3 3
⋈ p p3 4⋈
p p4 4⋈
example: self-join

Idea 2: Leverage Local Parallelization
• HPCC Systems allows to have multiple executors per node
• However, executors cannot share data without copying
• Use multithreading in each executor with access to global inverted index
• C++ Std Threads within one executor
• allows fine-granular control over threads, especially regarding pinning to
avoid CPU migrations (NUMA effects)
• Multithreaded user-defined functions are not officially supported… ;-)
• Necessary to write a plugin; embedded code doesn‘t work

Details

Details
• main thread (void ppj2())
• copies input into InputDS: array of struct + pointers to token arrays (necessary
for random access)
• creates inverted index
• spawns threads
• copies threadResults into resultDS dataset
• worker thread
• process batchSize records
• write results back to a shared vector threadResults -> synchronization
necessary
1 2 3 4
1
2
3
4
executors
- InputDS
- Inverted Index
- threadResults
- ResultDS
main thread
worker threads
...
...

Compile and Install Plugin
• Download HPCC source code (same version like on
cluster)
• Make it compile ;-)
• Refer to plugins/exampleplugin
• C++ Mappings in ECL documentation -> „undefined
symbol“
• Add the new plugin to cmake config files
• Compile and deploy .so file to each cluster node
• Cluster in „blocked“ state: pkill on all executors on all
slave nodes
• use DBGLOG() to write to ECL Logs

Monitoring: netdata
Installation: bash <(curl -Ss https://my-netdata.io/kickstart.sh)

Monitoring: netdata
Browser generates graphs. Custom Dashboards showing multiple nodes:

Experiments: Data Scalability
• DBLP dataset 1x-25x
• threshold(Jaccard)=0.7
• numThreads=2
0
20
40
60
80
100
120
0 5 10 15 20 25 30
Runtime(S)
Dataset Scale
Wish list:
a) Stay in RAM
b) Efficient use of CPUs
c) Scalability to Big
Data

Experiments: Thread Scalability
• DBLP dataset 25x
• threshold(Jaccard)=0.7
• numThreads=2-32
90
91
92
93
94
95
96
97
98
99
100
101
0 5 10 15 20 25 30 35
Runtime(S)
Number of Threads / executor

Current Work
• Utilize local parallelization better
• Optimize approach to NUMA effects by pinning threads on cores in one CPU that
share datasets

Lessons Learned
• Less (complexity) is more
• Hash-based replication and grouping is more robust than relying on data
characteristics
• Fine-granular optimizations of filters (filter-and-verification approach) do not
have a big effect on the overall runtime in a distributed environment. In fact,
we didn‘t use any sophisticated filter here.

Thank you!
Special thanks to LexisNexis for providing a research grant

View this presentation on YouTube:
https://www.youtube.com/watch?v=nTWpfa0wdDk&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=7&t=0s (3:13)

2019 HPCC Systems Community Day Challenge

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 2019 HPCC Systems Community Day Challenge

Ähnlich wie 2019 HPCC Systems Community Day Challenge (20)

Mehr von HPCC Systems

Mehr von HPCC Systems (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (17)

2019 HPCC Systems Community Day Challenge

Hinweis der Redaktion