SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Probabilistic Join Optimization
Deterministic Output From A Probabilistic Approach
OVERVIEW
• Overview of Probabilistic Algorithms
• Problem Statement – Joining Large Spark DataFrames
• Intuitive Approach
• Bloom Filter Background
• Bloom Filter Application
• Tuning False-Positives
• Performance Results
INTRODUCTION
• SaaS Marketing Analytics platform for
large Advertisers and Ad Agencies
• Machine Learning-driven analytics for
online and offline media
• Lots of data:
• Online Media (Impressions, Clicks)
• Offline Media (TV, Radio, etc)
• Conversions (Online / offline)
• Exogenous Data (Weather, Stock, etc.)
• Presenters:
• Grant Kushida (Head of Engineering)
• Vish Mandapaka (Principal Engineer)
PROBABILISTIC ALGORITHMS
• With Big Data clusters, computing exact answers on huge datasets is
possible
• But:
• Do you really *need* the exact answer?
• Approximations are often “good enough”
• Approximations are often *much cheaper*
• Use hashing, sketches and other math tricks
• In general, the trade-off is between
• More space (memory)
• Lower accuracy
• Faster execution time
• In some scenarios, can trade-off only space while preserving accuracy
and still reducing time
COMMON PROBABILISTIC ALGORITHMS
• Cardinality Estimation
• Counting Uniques (Users, etc.)
• Brute-Force: Store every value
• HyperLogLog: Use hashes to update
a fixed-size buffer
• Top-K Estimation
• Top Posters, Campaigns, etc
• Brute-Force: Aggregate and sort;
need to store each value
• Count-Min: Use hashes to increment
a fixed number of counters
• Set Similarity
• Document similarity, etc.
• Brute-Force: Jaccard Similarity;
need to compute intersection
• Min-Hash: Use hashes to
estimate intersection
• Set Membership
• De-duping, exactly-once, etc.
• Brute-Force: Store every value
for exact-match lookup
• Bloom Filter: Use hashes to
update a fixed-size filter
PROBLEM STATEMENT
• Join large datasets
• Ad Tech: Impressions and Clicks
• Marketing Analytics: Media and
Conversions
• E-Commerce: Visitors and Buyers
• Etc.
SET A SET B
A ∩B
• Data characteristics:
• Joined by key (e.g. User ID)
• Relatively small overlap
• Need to output additional columns
from both sets
• Un-sorted
• Problems:
• Jobs running out of memory
• Jobs taking too long
• Too much $$$ to run all the nodes
• Causes:
• Partition Skew
• Excess Shuffling
NAIVE APPROACH: Spark DataFrame Join
• Two unsorted
DataFrames
• Relatively
small overlap
• Spark
Optimizer
chooses Sort-
Merge Join
SET A SET B
un-
sorted
un-
sorted
A ∩B
NAIVE APPROACH: Spark DataFrame Join
• Split into
partitions by
join key
• Will shuffle
data across
nodes
• Potentially a lot
of data transfer
PART A0 PART B0
PART Ax PART Bx
PART A1 PART B1
SET A SET B
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
NAIVE APPROACH: Spark DataFrame Join
• Sort each
partition by join
key
• Parallelized,
but still time-
consuming
PART A0 PART B0
PART Ax PART Bx
PART A1 PART B1
SET A SET B
sortedsortedsorted
sortedsortedsorted
un-
sorted
un-
sorted
NAIVE APPROACH: Spark DataFrame Join
• Merge
partitions from
Set A and Set
B
• Find common
join keys
PART A0 PART B0
A0 ∩B0
PART Ax PART Bx
Ax ∩Bx
PART A1 PART B1
A1 ∩B1
SET A SET B
merge
merge
merge
merge
merge
merge
sortedsortedsorted
sortedsortedsorted
un-
sorted
un-
sorted
NAIVE APPROACH: Spark DataFrame Join
• Write output to
storage
• Parallelized
• Each partition
is sorted
write
PART A0 PART B0
A0 ∩B0
PART Ax PART Bx
Ax ∩Bx
PART A1 PART B1
A1 ∩B1
SET A SET B
A0 ∩B0
merge
merge
merge
merge
merge
merge
sortedsortedsorted
sortedsortedsorted
un-
sorted
un-
sorted
INTUITIVE OPTIMIZATION
• Lots of unnecessary sorting
• We want to sort less…
• Can we eliminate some data up-
front, without compromising the
result?
SET A SET B
A ∩B
sorted
sorted
SET A SET B
A ∩B
XX
XX
BLOOM FILTER APPROACH
• Approximate Set Membership
• Probabilistically remove data from
either (or both) sides of the join
• Bloom Filters:
• Can approximate set membership
• Err only on the False Positive side
(item is not actually in set)
• We are going to join anyway, so
false-positives are OK
SET B’
FILTER A
apply writebuild
SET A’ SET B’
A ∩B
False-Pos False-Pos
SET A SET B
SET A’
FILTER B
apply buildwrite
SET A SET B
BLOOM FILTER
• Burton Howard Bloom – 1970
• Space-efficient means of testing
elements in a set:
• Hyphenation
• Spell-checking
Filter
• Fixed number of bits (m)
Hashes
• Uniform distribution
• Range of m distinct values
• Not necessarily cryptographic
• Not necessarily different
algorithms
BLOOM FILTER - CONSTRUCTION
Adding a value:
• Allocate m bits
BLOOM FILTER - CONSTRUCTION
Adding a value:
• Allocate m bits
• Compute k hashes
BLOOM FILTER - CONSTRUCTION
Adding a value:
• Allocate m bits
• Compute k hashes
• Set k bits in the filter
BLOOM FILTER - CONSTRUCTION
Adding a value:
• Allocate m bits
• Compute k hashes
• Set k bits in the filter
Repeat for all values in
the set
BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Set bits for first item
BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Repeat for all items
BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Set bits for 3 items
True Positive:
All 3 bits set
BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Set bits for 3 items
True Negative:
1 of 3 bits set
BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Set bits for 3 items
False Positive:
3 of 3 bits set
Not present in initial set
BLOOM FILTER – EVALUATION (DISTRIBUTED)
• Evaluation can be
distributed and
executed in parallel
• Filter is:
• Small
• Immutable
• Easy to serialize
BLOOM FILTER – CONSTRUCTION (DISTRIBUTED)
• Construction can be
partially-distributed
• But, filters must be
consolidated
• Consolidate via
bitwise OR
FILTERED JOIN APPROACH
• Build Bloom Filter from Set A
• Evaluate all keys in Set B
• Remove any keys not in Set A
• Keep a few keys not set Set A (false-positive)
• Execute the Join
• Remove the false-positives
FILTERED JOIN I: BUILD FILTER
• Can build in parallel
• No need to co-locate
keys
• Need enough memory to
allocate entire filter in
each executor
PART A0
PART Ax
PART A1
SET A
un-
sorted
FILTERED JOIN I: BUILD FILTER
• Compute hashes and
set bits for each key
• No impact of setting
same key in multiple
filters
PART A0
PART Ax
PART A1
SET A
FILTER A0
FILTER A1
FILTER Ax
hash
hash
hashun-
sorted
FILTERED JOIN I: BUILD FILTER
• Merge all the filters
• Eventually requires
merging into one filter
• Can be a bottleneck for
large filters
PART A0
PART Ax
PART A1
SET A
FILTER A0
FILTER A1
FILTER Ax
FILTER A
hash
hash
hash
OR
un-
sorted
FILTERED JOIN II: APPLY FILTER
• Apply the filter to each
key in Set B
• Need to distribute filter
bits to each executorPART B0
PART Bx
PART B1
SET B
FILTER A
un-
sorted
FILTERED JOIN II: APPLY FILTER
• Compute hashes
and remove keys
PART B0
PART Bx
PART B1
SET B
FILTER A
PART B0’filter
PART B1’filter
PART Bx’filter
un-
sorted
FILTERED JOIN II: APPLY FILTER
• Collect Set B
writewrite
PART B0
PART Bx
PART B1
SET B
FILTER A
PART B0’filter
PART B1’filter
PART Bx’filter
SET B’
un-
sorted
FILTERED JOIN III: EXECUTE JOIN
Set B’ is now (significantly) smaller:
• n’ = n – (matches + false-positives)
• % filtered = 1– (overlap % + false-positive %)
Join will match all of the keys deterministically
No loss of accuracy from false-positives (loss of efficacy)
SET A’ SET B’
False-Pos False-Pos
filtered filtered
SET B
A ∩B
SET A
FALSE POSITIVE TUNING
• Important Numbers:
• m: number of items
• m: number of bits in filter
• k: number of hashes
• r: false-positive rate
• False-Positive Rate:
𝒓 = 𝟏 − 𝒆−
𝒌𝒏
𝒎
𝒌
• Given any two, can compute
optimal values for the other two
• Generally N is known (or
estimated)
• Many libraries will compute
optimal values automatically
• Online calculators available
THANKS!
Bloom Filter References:
• https://en.wikipedia.org/wiki/Bloom_filter
• https://www.di-mgt.com.au/bloom-calculator.html
• Other algorithms:
• http://dataconomy.com/2017/04/big-data-101-data-structures/
• https://medium.com/@muppal/probabilistic-data-structures-in-the-
big-data-world-code-b9387cff0c55

Weitere ähnliche Inhalte

Ähnlich wie Data Con LA 2018 - Applying Probabilistic Algorithms by Grant Kushida

Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Lec16-CS110 Computational Engineering
Lec16-CS110 Computational EngineeringLec16-CS110 Computational Engineering
Lec16-CS110 Computational EngineeringSri Harsha Pamu
 
Real-Time Voice Actuation
Real-Time Voice ActuationReal-Time Voice Actuation
Real-Time Voice ActuationPragya Agrawal
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAtner Yegorov
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisHiye Biniam
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Implement Advanced Scheduling Techniques in Kubernetes
Implement Advanced Scheduling Techniques in Kubernetes Implement Advanced Scheduling Techniques in Kubernetes
Implement Advanced Scheduling Techniques in Kubernetes Kublr
 
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Aman Sinha
 
adaptive_ecg_cdr_edittedforpublic.pptx
adaptive_ecg_cdr_edittedforpublic.pptxadaptive_ecg_cdr_edittedforpublic.pptx
adaptive_ecg_cdr_edittedforpublic.pptxssuser6f1a8e1
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Sveta Smirnova
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Sveta Smirnova
 
New T-SQL Features in SQL Server 2012
New T-SQL Features in SQL Server 2012 New T-SQL Features in SQL Server 2012
New T-SQL Features in SQL Server 2012 Richie Rump
 
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...Sveta Smirnova
 
Small is Beautiful- Fully Automate your Test Case Design
Small is Beautiful- Fully Automate your Test Case DesignSmall is Beautiful- Fully Automate your Test Case Design
Small is Beautiful- Fully Automate your Test Case DesignGeorgina Tilby
 
Game Programming 07 - Procedural Content Generation
Game Programming 07 - Procedural Content GenerationGame Programming 07 - Procedural Content Generation
Game Programming 07 - Procedural Content GenerationNick Pruehs
 
MongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkMongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkChris Westin
 
Migrating To PostgreSQL
Migrating To PostgreSQLMigrating To PostgreSQL
Migrating To PostgreSQLGrant Fritchey
 

Ähnlich wie Data Con LA 2018 - Applying Probabilistic Algorithms by Grant Kushida (20)

Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Old code doesn't stink
Old code doesn't stinkOld code doesn't stink
Old code doesn't stink
 
Lec16-CS110 Computational Engineering
Lec16-CS110 Computational EngineeringLec16-CS110 Computational Engineering
Lec16-CS110 Computational Engineering
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Real-Time Voice Actuation
Real-Time Voice ActuationReal-Time Voice Actuation
Real-Time Voice Actuation
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysis
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysis
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Implement Advanced Scheduling Techniques in Kubernetes
Implement Advanced Scheduling Techniques in Kubernetes Implement Advanced Scheduling Techniques in Kubernetes
Implement Advanced Scheduling Techniques in Kubernetes
 
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
 
adaptive_ecg_cdr_edittedforpublic.pptx
adaptive_ecg_cdr_edittedforpublic.pptxadaptive_ecg_cdr_edittedforpublic.pptx
adaptive_ecg_cdr_edittedforpublic.pptx
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?
 
Group technology
Group technologyGroup technology
Group technology
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?
 
New T-SQL Features in SQL Server 2012
New T-SQL Features in SQL Server 2012 New T-SQL Features in SQL Server 2012
New T-SQL Features in SQL Server 2012
 
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
 
Small is Beautiful- Fully Automate your Test Case Design
Small is Beautiful- Fully Automate your Test Case DesignSmall is Beautiful- Fully Automate your Test Case Design
Small is Beautiful- Fully Automate your Test Case Design
 
Game Programming 07 - Procedural Content Generation
Game Programming 07 - Procedural Content GenerationGame Programming 07 - Procedural Content Generation
Game Programming 07 - Procedural Content Generation
 
MongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkMongoDB's New Aggregation framework
MongoDB's New Aggregation framework
 
Migrating To PostgreSQL
Migrating To PostgreSQLMigrating To PostgreSQL
Migrating To PostgreSQL
 

Mehr von Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Mehr von Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Kürzlich hochgeladen

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Data Con LA 2018 - Applying Probabilistic Algorithms by Grant Kushida

  • 1. Probabilistic Join Optimization Deterministic Output From A Probabilistic Approach
  • 2. OVERVIEW • Overview of Probabilistic Algorithms • Problem Statement – Joining Large Spark DataFrames • Intuitive Approach • Bloom Filter Background • Bloom Filter Application • Tuning False-Positives • Performance Results
  • 3. INTRODUCTION • SaaS Marketing Analytics platform for large Advertisers and Ad Agencies • Machine Learning-driven analytics for online and offline media • Lots of data: • Online Media (Impressions, Clicks) • Offline Media (TV, Radio, etc) • Conversions (Online / offline) • Exogenous Data (Weather, Stock, etc.) • Presenters: • Grant Kushida (Head of Engineering) • Vish Mandapaka (Principal Engineer)
  • 4. PROBABILISTIC ALGORITHMS • With Big Data clusters, computing exact answers on huge datasets is possible • But: • Do you really *need* the exact answer? • Approximations are often “good enough” • Approximations are often *much cheaper* • Use hashing, sketches and other math tricks • In general, the trade-off is between • More space (memory) • Lower accuracy • Faster execution time • In some scenarios, can trade-off only space while preserving accuracy and still reducing time
  • 5. COMMON PROBABILISTIC ALGORITHMS • Cardinality Estimation • Counting Uniques (Users, etc.) • Brute-Force: Store every value • HyperLogLog: Use hashes to update a fixed-size buffer • Top-K Estimation • Top Posters, Campaigns, etc • Brute-Force: Aggregate and sort; need to store each value • Count-Min: Use hashes to increment a fixed number of counters • Set Similarity • Document similarity, etc. • Brute-Force: Jaccard Similarity; need to compute intersection • Min-Hash: Use hashes to estimate intersection • Set Membership • De-duping, exactly-once, etc. • Brute-Force: Store every value for exact-match lookup • Bloom Filter: Use hashes to update a fixed-size filter
  • 6. PROBLEM STATEMENT • Join large datasets • Ad Tech: Impressions and Clicks • Marketing Analytics: Media and Conversions • E-Commerce: Visitors and Buyers • Etc. SET A SET B A ∩B • Data characteristics: • Joined by key (e.g. User ID) • Relatively small overlap • Need to output additional columns from both sets • Un-sorted • Problems: • Jobs running out of memory • Jobs taking too long • Too much $$$ to run all the nodes • Causes: • Partition Skew • Excess Shuffling
  • 7. NAIVE APPROACH: Spark DataFrame Join • Two unsorted DataFrames • Relatively small overlap • Spark Optimizer chooses Sort- Merge Join SET A SET B un- sorted un- sorted A ∩B
  • 8. NAIVE APPROACH: Spark DataFrame Join • Split into partitions by join key • Will shuffle data across nodes • Potentially a lot of data transfer PART A0 PART B0 PART Ax PART Bx PART A1 PART B1 SET A SET B un- sorted un- sorted un- sorted un- sorted un- sorted un- sorted un- sorted un- sorted
  • 9. NAIVE APPROACH: Spark DataFrame Join • Sort each partition by join key • Parallelized, but still time- consuming PART A0 PART B0 PART Ax PART Bx PART A1 PART B1 SET A SET B sortedsortedsorted sortedsortedsorted un- sorted un- sorted
  • 10. NAIVE APPROACH: Spark DataFrame Join • Merge partitions from Set A and Set B • Find common join keys PART A0 PART B0 A0 ∩B0 PART Ax PART Bx Ax ∩Bx PART A1 PART B1 A1 ∩B1 SET A SET B merge merge merge merge merge merge sortedsortedsorted sortedsortedsorted un- sorted un- sorted
  • 11. NAIVE APPROACH: Spark DataFrame Join • Write output to storage • Parallelized • Each partition is sorted write PART A0 PART B0 A0 ∩B0 PART Ax PART Bx Ax ∩Bx PART A1 PART B1 A1 ∩B1 SET A SET B A0 ∩B0 merge merge merge merge merge merge sortedsortedsorted sortedsortedsorted un- sorted un- sorted
  • 12. INTUITIVE OPTIMIZATION • Lots of unnecessary sorting • We want to sort less… • Can we eliminate some data up- front, without compromising the result? SET A SET B A ∩B sorted sorted SET A SET B A ∩B XX XX
  • 13. BLOOM FILTER APPROACH • Approximate Set Membership • Probabilistically remove data from either (or both) sides of the join • Bloom Filters: • Can approximate set membership • Err only on the False Positive side (item is not actually in set) • We are going to join anyway, so false-positives are OK SET B’ FILTER A apply writebuild SET A’ SET B’ A ∩B False-Pos False-Pos SET A SET B SET A’ FILTER B apply buildwrite SET A SET B
  • 14. BLOOM FILTER • Burton Howard Bloom – 1970 • Space-efficient means of testing elements in a set: • Hyphenation • Spell-checking Filter • Fixed number of bits (m) Hashes • Uniform distribution • Range of m distinct values • Not necessarily cryptographic • Not necessarily different algorithms
  • 15. BLOOM FILTER - CONSTRUCTION Adding a value: • Allocate m bits
  • 16. BLOOM FILTER - CONSTRUCTION Adding a value: • Allocate m bits • Compute k hashes
  • 17. BLOOM FILTER - CONSTRUCTION Adding a value: • Allocate m bits • Compute k hashes • Set k bits in the filter
  • 18. BLOOM FILTER - CONSTRUCTION Adding a value: • Allocate m bits • Compute k hashes • Set k bits in the filter Repeat for all values in the set
  • 19. BLOOM FILTER - EVALUATION Example: m = 16 bits k = 3 (hex values) Set bits for first item
  • 20. BLOOM FILTER - EVALUATION Example: m = 16 bits k = 3 (hex values) Repeat for all items
  • 21. BLOOM FILTER - EVALUATION Example: m = 16 bits k = 3 (hex values) Set bits for 3 items True Positive: All 3 bits set
  • 22. BLOOM FILTER - EVALUATION Example: m = 16 bits k = 3 (hex values) Set bits for 3 items True Negative: 1 of 3 bits set
  • 23. BLOOM FILTER - EVALUATION Example: m = 16 bits k = 3 (hex values) Set bits for 3 items False Positive: 3 of 3 bits set Not present in initial set
  • 24. BLOOM FILTER – EVALUATION (DISTRIBUTED) • Evaluation can be distributed and executed in parallel • Filter is: • Small • Immutable • Easy to serialize
  • 25. BLOOM FILTER – CONSTRUCTION (DISTRIBUTED) • Construction can be partially-distributed • But, filters must be consolidated • Consolidate via bitwise OR
  • 26. FILTERED JOIN APPROACH • Build Bloom Filter from Set A • Evaluate all keys in Set B • Remove any keys not in Set A • Keep a few keys not set Set A (false-positive) • Execute the Join • Remove the false-positives
  • 27. FILTERED JOIN I: BUILD FILTER • Can build in parallel • No need to co-locate keys • Need enough memory to allocate entire filter in each executor PART A0 PART Ax PART A1 SET A un- sorted
  • 28. FILTERED JOIN I: BUILD FILTER • Compute hashes and set bits for each key • No impact of setting same key in multiple filters PART A0 PART Ax PART A1 SET A FILTER A0 FILTER A1 FILTER Ax hash hash hashun- sorted
  • 29. FILTERED JOIN I: BUILD FILTER • Merge all the filters • Eventually requires merging into one filter • Can be a bottleneck for large filters PART A0 PART Ax PART A1 SET A FILTER A0 FILTER A1 FILTER Ax FILTER A hash hash hash OR un- sorted
  • 30. FILTERED JOIN II: APPLY FILTER • Apply the filter to each key in Set B • Need to distribute filter bits to each executorPART B0 PART Bx PART B1 SET B FILTER A un- sorted
  • 31. FILTERED JOIN II: APPLY FILTER • Compute hashes and remove keys PART B0 PART Bx PART B1 SET B FILTER A PART B0’filter PART B1’filter PART Bx’filter un- sorted
  • 32. FILTERED JOIN II: APPLY FILTER • Collect Set B writewrite PART B0 PART Bx PART B1 SET B FILTER A PART B0’filter PART B1’filter PART Bx’filter SET B’ un- sorted
  • 33. FILTERED JOIN III: EXECUTE JOIN Set B’ is now (significantly) smaller: • n’ = n – (matches + false-positives) • % filtered = 1– (overlap % + false-positive %) Join will match all of the keys deterministically No loss of accuracy from false-positives (loss of efficacy) SET A’ SET B’ False-Pos False-Pos filtered filtered SET B A ∩B SET A
  • 34. FALSE POSITIVE TUNING • Important Numbers: • m: number of items • m: number of bits in filter • k: number of hashes • r: false-positive rate • False-Positive Rate: 𝒓 = 𝟏 − 𝒆− 𝒌𝒏 𝒎 𝒌 • Given any two, can compute optimal values for the other two • Generally N is known (or estimated) • Many libraries will compute optimal values automatically • Online calculators available
  • 35. THANKS! Bloom Filter References: • https://en.wikipedia.org/wiki/Bloom_filter • https://www.di-mgt.com.au/bloom-calculator.html • Other algorithms: • http://dataconomy.com/2017/04/big-data-101-data-structures/ • https://medium.com/@muppal/probabilistic-data-structures-in-the- big-data-world-code-b9387cff0c55