SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
Leveraging Intra-Node
Parallelization in HPCC Systems
Fabian Fier
Motivation
Parallelize Set Similarity Join
• Many applications need to identify similar
pairs of documents:
• Plagiarism detection
• Community mining in social networks
• Near-duplicate web page detection
• Document clustering
• ...
• Operation: Set similarity join (SSJ)
• Find all pairs of records (r, s) where
sim(r, s) ≥ t (r ∈ R, s ∈ S)
• Nice to have in a distributed system
Leveraging Intra-Node Parallelization in HPCC Systems 3
sr
Naïve Approach to Compute SSJ
• …
• L_R computeSimilarity(L_R r, L_R s) := TRANSFORM
• SELF.RecordId1 := r.RecordId;
• SELF.RecordId2 := s.RecordId;
• SELF.Sim := (compute similarity);
• END;
• …
• resToFilter := JOIN(R, S, TRUE, computeSimilarity(LEFT, RIGHT), ALL);
• result := resToFilter(Sim > 90);
Leveraging Intra-Node Parallelization in HPCC Systems 4
Naïve Approach to Compute SSJ
Leveraging Intra-Node Parallelization in HPCC Systems 5
Naïve Approach to Compute SSJ
Leveraging Intra-Node Parallelization in HPCC Systems 6
Issue a:
Memory
exhaustion due
to too high
replication
Parallelize Filter-and-Verification Approaches
• Use data characteristics to replicate and group independent data (inverted index)
Leveraging Intra-Node Parallelization in HPCC Systems 7
r1 a b e
r2 a d e
r3 b c d e f g
a r1, r2
b r1, r3
c r3
d r2, r3
e r1, r2, r3
f r3
g r3
Parallelize Filter-and-Verification Approaches
Leveraging Intra-Node Parallelization in HPCC Systems 8
Issue b:
Straggling
executors
Issue c:
Not scalable:
only suitable for
small datasets
Cf. Fier et al.:
Set Similarity
Joins on
MapReduce: An
experimental
Survey
Approach
Basic Ideas
Leveraging Intra-Node Parallelization in HPCC Systems 10
1. Global replication and grouping: -> a, c
• Without data dependencies
• Regarding system restrictions (RAM)
2. Use local parallelization more efficiently (> 1 Core per executor) -> b
• Use existing approaches local data structures, accessible my multiple cores
Wish list:
a) Stay in RAM
b) Efficient use of CPUs
c) Scalability to Big
Data
Potential!
Idea 1: Global Replication and Grouping
Leveraging Intra-Node Parallelization in HPCC Systems 11
• Apply hash: no data dependency
• Choose hash such that
• #groups < #executors
• Groups fit into RAM of executor
1 2 3 4
1
2
3
4
p p1 1⋈ p p1 2⋈
p p2 2⋈ p p2 3⋈ p p2 4⋈
p p1 3⋈ p p1 4⋈
p p3 3
⋈ p p3 4⋈
p p4 4⋈
example: self-join
Idea 2: Leverage Local Parallelization
Leveraging Intra-Node Parallelization in HPCC Systems 12
• HPCC Systems allows to have multiple executors per node
• However, executors cannot share data without copying
• Use multithreading in each executor with access to global inverted index
• C++ Std Threads within one executor
• allows fine-granular control over threads, especially regarding pinning to
avoid CPU migrations (NUMA effects)
• Multithreaded user-defined functions are not officially supported… ;-)
• Necessary to write a plugin; embedded code doesn‘t work
Details
Leveraging Intra-Node Parallelization in HPCC Systems 13
Details
Leveraging Intra-Node Parallelization in HPCC Systems 14
• main thread (void ppj2())
• copies input into InputDS: array of struct + pointers to token arrays (necessary
for random access)
• creates inverted index
• spawns threads
• copies threadResults into resultDS dataset
• worker thread
• process batchSize records
• write results back to a shared vector threadResults -> synchronization
necessary
1 2 3 4
1
2
3
4
executors
- InputDS
- Inverted Index
- threadResults
- ResultDS
main thread
worker threads
...
...
Hands On!
Compile and Install Plugin
Leveraging Intra-Node Parallelization in HPCC Systems 16
• Download HPCC source code (same version like on
cluster)
• Make it compile ;-)
• Refer to plugins/exampleplugin
• C++ Mappings in ECL documentation -> „undefined
symbol“
• Add the new plugin to cmake config files
• Compile and deploy .so file to each cluster node
• Cluster in „blocked“ state: pkill on all executors on all
slave nodes
• use DBGLOG() to write to ECL Logs
Monitoring: netdata
Leveraging Intra-Node Parallelization in HPCC Systems 17
Installation: bash <(curl -Ss https://my-netdata.io/kickstart.sh)
Monitoring: netdata
Leveraging Intra-Node Parallelization in HPCC Systems 18
Browser generates graphs. Custom Dashboards showing multiple nodes:
Experiments: Data Scalability
• DBLP dataset 1x-25x
• threshold(Jaccard)=0.7
• numThreads=2
Leveraging Intra-Node Parallelization in HPCC Systems 19
0
20
40
60
80
100
120
0 5 10 15 20 25 30
Runtime(S)
Dataset Scale
Wish list:
a) Stay in RAM
b) Efficient use of CPUs
c) Scalability to Big
Data
Experiments: Thread Scalability
• DBLP dataset 25x
• threshold(Jaccard)=0.7
• numThreads=2-32
Leveraging Intra-Node Parallelization in HPCC Systems 20
90
91
92
93
94
95
96
97
98
99
100
101
0 5 10 15 20 25 30 35
Runtime(S)
Number of Threads / executor
Current Work
Leveraging Intra-Node Parallelization in HPCC Systems 21
• Utilize local parallelization better
• Optimize approach to NUMA effects by pinning threads on cores in one CPU that
share datasets
Lessons Learned
Leveraging Intra-Node Parallelization in HPCC Systems 22
• Less (complexity) is more
• Hash-based replication and grouping is more robust than relying on data
characteristics
• Fine-granular optimizations of filters (filter-and-verification approach) do not
have a big effect on the overall runtime in a distributed environment. In fact,
we didn‘t use any sophisticated filter here.
Thank you!
Special thanks to LexisNexis for providing a research grant
Leveraging Intra-Node Parallelization in HPCC Systems 23
Leveraging Intra-Node Parallelization in HPCC Systems 24
View this presentation on YouTube:
https://www.youtube.com/watch?v=nTWpfa0wdDk&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=7&t=0s (3:13)

Weitere ähnliche Inhalte

Was ist angesagt?

Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBleesjensen
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaValery Tkachenko
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...InfluxData
 
Devoxx france 2015 influxdb
Devoxx france 2015 influxdbDevoxx france 2015 influxdb
Devoxx france 2015 influxdbNicolas Muller
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Andra Lungu
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData InfluxData
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0Sigmoid
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryInfluxData
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...Altinity Ltd
 
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
 

Was ist angesagt? (20)

Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
 
Graphite cluster setup blueprint
Graphite cluster setup blueprintGraphite cluster setup blueprint
Graphite cluster setup blueprint
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek Vavrusa
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
 
Devoxx france 2015 influxdb
Devoxx france 2015 influxdbDevoxx france 2015 influxdb
Devoxx france 2015 influxdb
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
 
Distcp gobblin
Distcp gobblinDistcp gobblin
Distcp gobblin
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Scala+data
Scala+dataScala+data
Scala+data
 

Ähnlich wie 2019 HPCC Systems Community Day Challenge

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging振东 刘
 
Cloudstone - Sharpening Your Weapons Through Big Data
Cloudstone - Sharpening Your Weapons Through Big DataCloudstone - Sharpening Your Weapons Through Big Data
Cloudstone - Sharpening Your Weapons Through Big DataChristopher Grayson
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScyllaDB
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
HPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems
 
Cassandra
CassandraCassandra
Cassandraexsuns
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialRoger Rafanell Mas
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage systemArunit Gupta
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learninginside-BigData.com
 

Ähnlich wie 2019 HPCC Systems Community Day Challenge (20)

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Intro to HPC
Intro to HPCIntro to HPC
Intro to HPC
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
 
Cloudstone - Sharpening Your Weapons Through Big Data
Cloudstone - Sharpening Your Weapons Through Big DataCloudstone - Sharpening Your Weapons Through Big Data
Cloudstone - Sharpening Your Weapons Through Big Data
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
HPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 Highlights
 
Cassandra
CassandraCassandra
Cassandra
 
c-quilibrium R forecasting integration
c-quilibrium R forecasting integrationc-quilibrium R forecasting integration
c-quilibrium R forecasting integration
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learning
 
Presentation
PresentationPresentation
Presentation
 

Mehr von HPCC Systems

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...HPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsHPCC Systems
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn HPCC Systems
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingHPCC Systems
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle ChangesHPCC Systems
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index HPCC Systems
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningHPCC Systems
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch HPCC Systems
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem HPCC Systems
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis ToolHPCC Systems
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony HPCC Systems
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterHPCC Systems
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...HPCC Systems
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...HPCC Systems
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...HPCC Systems
 
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...HPCC Systems
 

Mehr von HPCC Systems (20)

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex Systems
 
Welcome
WelcomeWelcome
Welcome
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon Cutting
 
Path to 8.0
Path to 8.0 Path to 8.0
Path to 8.0
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle Changes
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 
Docker Support
Docker Support Docker Support
Docker Support
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis Tool
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL Neater
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
 
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
 

Kürzlich hochgeladen

Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 

Kürzlich hochgeladen (17)

Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 

2019 HPCC Systems Community Day Challenge

  • 1. 2019 HPCC Systems® Community Day Challenge Yourself – Challenge the Status Quo Leveraging Intra-Node Parallelization in HPCC Systems Fabian Fier
  • 3. Parallelize Set Similarity Join • Many applications need to identify similar pairs of documents: • Plagiarism detection • Community mining in social networks • Near-duplicate web page detection • Document clustering • ... • Operation: Set similarity join (SSJ) • Find all pairs of records (r, s) where sim(r, s) ≥ t (r ∈ R, s ∈ S) • Nice to have in a distributed system Leveraging Intra-Node Parallelization in HPCC Systems 3 sr
  • 4. Naïve Approach to Compute SSJ • … • L_R computeSimilarity(L_R r, L_R s) := TRANSFORM • SELF.RecordId1 := r.RecordId; • SELF.RecordId2 := s.RecordId; • SELF.Sim := (compute similarity); • END; • … • resToFilter := JOIN(R, S, TRUE, computeSimilarity(LEFT, RIGHT), ALL); • result := resToFilter(Sim > 90); Leveraging Intra-Node Parallelization in HPCC Systems 4
  • 5. Naïve Approach to Compute SSJ Leveraging Intra-Node Parallelization in HPCC Systems 5
  • 6. Naïve Approach to Compute SSJ Leveraging Intra-Node Parallelization in HPCC Systems 6 Issue a: Memory exhaustion due to too high replication
  • 7. Parallelize Filter-and-Verification Approaches • Use data characteristics to replicate and group independent data (inverted index) Leveraging Intra-Node Parallelization in HPCC Systems 7 r1 a b e r2 a d e r3 b c d e f g a r1, r2 b r1, r3 c r3 d r2, r3 e r1, r2, r3 f r3 g r3
  • 8. Parallelize Filter-and-Verification Approaches Leveraging Intra-Node Parallelization in HPCC Systems 8 Issue b: Straggling executors Issue c: Not scalable: only suitable for small datasets Cf. Fier et al.: Set Similarity Joins on MapReduce: An experimental Survey
  • 10. Basic Ideas Leveraging Intra-Node Parallelization in HPCC Systems 10 1. Global replication and grouping: -> a, c • Without data dependencies • Regarding system restrictions (RAM) 2. Use local parallelization more efficiently (> 1 Core per executor) -> b • Use existing approaches local data structures, accessible my multiple cores Wish list: a) Stay in RAM b) Efficient use of CPUs c) Scalability to Big Data Potential!
  • 11. Idea 1: Global Replication and Grouping Leveraging Intra-Node Parallelization in HPCC Systems 11 • Apply hash: no data dependency • Choose hash such that • #groups < #executors • Groups fit into RAM of executor 1 2 3 4 1 2 3 4 p p1 1⋈ p p1 2⋈ p p2 2⋈ p p2 3⋈ p p2 4⋈ p p1 3⋈ p p1 4⋈ p p3 3 ⋈ p p3 4⋈ p p4 4⋈ example: self-join
  • 12. Idea 2: Leverage Local Parallelization Leveraging Intra-Node Parallelization in HPCC Systems 12 • HPCC Systems allows to have multiple executors per node • However, executors cannot share data without copying • Use multithreading in each executor with access to global inverted index • C++ Std Threads within one executor • allows fine-granular control over threads, especially regarding pinning to avoid CPU migrations (NUMA effects) • Multithreaded user-defined functions are not officially supported… ;-) • Necessary to write a plugin; embedded code doesn‘t work
  • 14. Details Leveraging Intra-Node Parallelization in HPCC Systems 14 • main thread (void ppj2()) • copies input into InputDS: array of struct + pointers to token arrays (necessary for random access) • creates inverted index • spawns threads • copies threadResults into resultDS dataset • worker thread • process batchSize records • write results back to a shared vector threadResults -> synchronization necessary 1 2 3 4 1 2 3 4 executors - InputDS - Inverted Index - threadResults - ResultDS main thread worker threads ... ...
  • 16. Compile and Install Plugin Leveraging Intra-Node Parallelization in HPCC Systems 16 • Download HPCC source code (same version like on cluster) • Make it compile ;-) • Refer to plugins/exampleplugin • C++ Mappings in ECL documentation -> „undefined symbol“ • Add the new plugin to cmake config files • Compile and deploy .so file to each cluster node • Cluster in „blocked“ state: pkill on all executors on all slave nodes • use DBGLOG() to write to ECL Logs
  • 17. Monitoring: netdata Leveraging Intra-Node Parallelization in HPCC Systems 17 Installation: bash <(curl -Ss https://my-netdata.io/kickstart.sh)
  • 18. Monitoring: netdata Leveraging Intra-Node Parallelization in HPCC Systems 18 Browser generates graphs. Custom Dashboards showing multiple nodes:
  • 19. Experiments: Data Scalability • DBLP dataset 1x-25x • threshold(Jaccard)=0.7 • numThreads=2 Leveraging Intra-Node Parallelization in HPCC Systems 19 0 20 40 60 80 100 120 0 5 10 15 20 25 30 Runtime(S) Dataset Scale Wish list: a) Stay in RAM b) Efficient use of CPUs c) Scalability to Big Data
  • 20. Experiments: Thread Scalability • DBLP dataset 25x • threshold(Jaccard)=0.7 • numThreads=2-32 Leveraging Intra-Node Parallelization in HPCC Systems 20 90 91 92 93 94 95 96 97 98 99 100 101 0 5 10 15 20 25 30 35 Runtime(S) Number of Threads / executor
  • 21. Current Work Leveraging Intra-Node Parallelization in HPCC Systems 21 • Utilize local parallelization better • Optimize approach to NUMA effects by pinning threads on cores in one CPU that share datasets
  • 22. Lessons Learned Leveraging Intra-Node Parallelization in HPCC Systems 22 • Less (complexity) is more • Hash-based replication and grouping is more robust than relying on data characteristics • Fine-granular optimizations of filters (filter-and-verification approach) do not have a big effect on the overall runtime in a distributed environment. In fact, we didn‘t use any sophisticated filter here.
  • 23. Thank you! Special thanks to LexisNexis for providing a research grant Leveraging Intra-Node Parallelization in HPCC Systems 23
  • 24. Leveraging Intra-Node Parallelization in HPCC Systems 24 View this presentation on YouTube: https://www.youtube.com/watch?v=nTWpfa0wdDk&list=PL- 8MJMUpp8IKH5-d56az56t52YccleX5h&index=7&t=0s (3:13)

Hinweis der Redaktion

  1. 5 executors/node 24 cores (4 CPUs)
  2. es ist nicht die Synchronisation; selbst ohne geht’s nicht schneller