Data Con LA 2018 - Applying Probabilistic Algorithms by Grant Kushida

Probabilistic Join Optimization
Deterministic Output From A Probabilistic Approach

OVERVIEW
• Overview of Probabilistic Algorithms
• Problem Statement – Joining Large Spark DataFrames
• Intuitive Approach
• Bloom Filter Background
• Bloom Filter Application
• Tuning False-Positives
• Performance Results

INTRODUCTION
• SaaS Marketing Analytics platform for
large Advertisers and Ad Agencies
• Machine Learning-driven analytics for
online and offline media
• Lots of data:
• Online Media (Impressions, Clicks)
• Offline Media (TV, Radio, etc)
• Conversions (Online / offline)
• Exogenous Data (Weather, Stock, etc.)
• Presenters:
• Grant Kushida (Head of Engineering)
• Vish Mandapaka (Principal Engineer)

PROBABILISTIC ALGORITHMS
• With Big Data clusters, computing exact answers on huge datasets is
possible
• But:
• Do you really *need* the exact answer?
• Approximations are often “good enough”
• Approximations are often *much cheaper*
• Use hashing, sketches and other math tricks
• In general, the trade-off is between
• More space (memory)
• Lower accuracy
• Faster execution time
• In some scenarios, can trade-off only space while preserving accuracy
and still reducing time

COMMON PROBABILISTIC ALGORITHMS
• Cardinality Estimation
• Counting Uniques (Users, etc.)
• Brute-Force: Store every value
• HyperLogLog: Use hashes to update
a fixed-size buffer
• Top-K Estimation
• Top Posters, Campaigns, etc
• Brute-Force: Aggregate and sort;
need to store each value
• Count-Min: Use hashes to increment
a fixed number of counters
• Set Similarity
• Document similarity, etc.
• Brute-Force: Jaccard Similarity;
need to compute intersection
• Min-Hash: Use hashes to
estimate intersection
• Set Membership
• De-duping, exactly-once, etc.
• Brute-Force: Store every value
for exact-match lookup
• Bloom Filter: Use hashes to
update a fixed-size filter

PROBLEM STATEMENT
• Join large datasets
• Ad Tech: Impressions and Clicks
• Marketing Analytics: Media and
Conversions
• E-Commerce: Visitors and Buyers
• Etc.
SET A SET B
A ∩B
• Data characteristics:
• Joined by key (e.g. User ID)
• Relatively small overlap
• Need to output additional columns
from both sets
• Un-sorted
• Problems:
• Jobs running out of memory
• Jobs taking too long
• Too much $$$ to run all the nodes
• Causes:
• Partition Skew
• Excess Shuffling

NAIVE APPROACH: Spark DataFrame Join
• Two unsorted
DataFrames
• Relatively
small overlap
• Spark
Optimizer
chooses Sort-
Merge Join
SET A SET B
un-
sorted
un-
sorted
A ∩B

• Split into
partitions by
join key
• Will shuffle
data across
nodes
• Potentially a lot
of data transfer
PART A0 PART B0
PART Ax PART Bx
PART A1 PART B1
SET A SET B
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted

• Sort each
partition by join
key
• Parallelized,
but still time-
consuming
PART A0 PART B0
PART Ax PART Bx
PART A1 PART B1
SET A SET B
sortedsortedsorted
sortedsortedsorted
un-
sorted
un-
sorted

• Merge
partitions from
Set A and Set
B
• Find common
join keys
PART A0 PART B0
A0 ∩B0
PART Ax PART Bx
Ax ∩Bx
PART A1 PART B1
A1 ∩B1
SET A SET B
merge
merge
merge
merge
merge
merge
sortedsortedsorted
sortedsortedsorted
un-
sorted
un-
sorted

• Write output to
storage
• Parallelized
• Each partition
is sorted
write
PART A0 PART B0
A0 ∩B0
PART Ax PART Bx
Ax ∩Bx
PART A1 PART B1
A1 ∩B1
SET A SET B
A0 ∩B0
merge
merge
merge
merge
merge
merge
sortedsortedsorted
sortedsortedsorted
un-
sorted
un-
sorted

INTUITIVE OPTIMIZATION
• Lots of unnecessary sorting
• We want to sort less…
• Can we eliminate some data up-
front, without compromising the
result?
SET A SET B
A ∩B
sorted
sorted
SET A SET B
A ∩B
XX
XX

BLOOM FILTER APPROACH
• Approximate Set Membership
• Probabilistically remove data from
either (or both) sides of the join
• Bloom Filters:
• Can approximate set membership
• Err only on the False Positive side
(item is not actually in set)
• We are going to join anyway, so
false-positives are OK
SET B’
FILTER A
apply writebuild
SET A’ SET B’
A ∩B
False-Pos False-Pos
SET A SET B
SET A’
FILTER B
apply buildwrite
SET A SET B

BLOOM FILTER
• Burton Howard Bloom – 1970
• Space-efficient means of testing
elements in a set:
• Hyphenation
• Spell-checking
Filter
• Fixed number of bits (m)
Hashes
• Uniform distribution
• Range of m distinct values
• Not necessarily cryptographic
• Not necessarily different
algorithms

BLOOM FILTER - CONSTRUCTION
Adding a value:
• Allocate m bits

Adding a value:
• Allocate m bits
• Compute k hashes

Adding a value:
• Allocate m bits
• Set k bits in the filter

Adding a value:
• Allocate m bits
• Set k bits in the filter
Repeat for all values in
the set

BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Set bits for first item

Example:
m = 16 bits
k = 3 (hex values)
Repeat for all items

Example:
m = 16 bits
k = 3 (hex values)
Set bits for 3 items
True Positive:
All 3 bits set

Example:
m = 16 bits
k = 3 (hex values)
True Negative:
1 of 3 bits set

Example:
m = 16 bits
k = 3 (hex values)
False Positive:
3 of 3 bits set
Not present in initial set

BLOOM FILTER – EVALUATION (DISTRIBUTED)
• Evaluation can be
distributed and
executed in parallel
• Filter is:
• Small
• Immutable
• Easy to serialize

BLOOM FILTER – CONSTRUCTION (DISTRIBUTED)
• Construction can be
partially-distributed
• But, filters must be
consolidated
• Consolidate via
bitwise OR

FILTERED JOIN APPROACH
• Build Bloom Filter from Set A
• Evaluate all keys in Set B
• Remove any keys not in Set A
• Keep a few keys not set Set A (false-positive)
• Execute the Join
• Remove the false-positives

FILTERED JOIN I: BUILD FILTER
• Can build in parallel
• No need to co-locate
keys
• Need enough memory to
allocate entire filter in
each executor
PART A0
PART Ax
PART A1
SET A
un-
sorted

• Compute hashes and
set bits for each key
• No impact of setting
same key in multiple
filters
PART A0
PART Ax
PART A1
SET A
FILTER A0
FILTER A1
FILTER Ax
hash
hash
hashun-
sorted

• Merge all the filters
• Eventually requires
merging into one filter
• Can be a bottleneck for
large filters
PART A0
PART Ax
PART A1
SET A
FILTER A0
FILTER A1
FILTER Ax
FILTER A
hash
hash
hash
OR
un-
sorted

FILTERED JOIN II: APPLY FILTER
• Apply the filter to each
key in Set B
• Need to distribute filter
bits to each executorPART B0
PART Bx
PART B1
SET B
FILTER A
un-
sorted

• Compute hashes
and remove keys
PART B0
PART Bx
PART B1
SET B
FILTER A
PART B0’filter
PART B1’filter
PART Bx’filter
un-
sorted

• Collect Set B
writewrite
PART B0
PART Bx
PART B1
SET B
FILTER A
PART B0’filter
PART B1’filter
PART Bx’filter
SET B’
un-
sorted

FILTERED JOIN III: EXECUTE JOIN
Set B’ is now (significantly) smaller:
• n’ = n – (matches + false-positives)
• % filtered = 1– (overlap % + false-positive %)
Join will match all of the keys deterministically
No loss of accuracy from false-positives (loss of efficacy)
SET A’ SET B’
False-Pos False-Pos
ﬁltered ﬁltered
SET B
A ∩B
SET A

FALSE POSITIVE TUNING
• Important Numbers:
• m: number of items
• m: number of bits in filter
• k: number of hashes
• r: false-positive rate
• False-Positive Rate:
𝒓 = 𝟏 − 𝒆−
𝒌𝒏
𝒎
𝒌
• Given any two, can compute
optimal values for the other two
• Generally N is known (or
estimated)
• Many libraries will compute
optimal values automatically
• Online calculators available

THANKS!
Bloom Filter References:
• https://en.wikipedia.org/wiki/Bloom_filter
• https://www.di-mgt.com.au/bloom-calculator.html
• Other algorithms:
• http://dataconomy.com/2017/04/big-data-101-data-structures/
• https://medium.com/@muppal/probabilistic-data-structures-in-the-
big-data-world-code-b9387cff0c55

Data Con LA 2018 - Applying Probabilistic Algorithms by Grant Kushida

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Data Con LA 2018 - Applying Probabilistic Algorithms by Grant Kushida

Ähnlich wie Data Con LA 2018 - Applying Probabilistic Algorithms by Grant Kushida (20)

Mehr von Data Con LA

Mehr von Data Con LA (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Con LA 2018 - Applying Probabilistic Algorithms by Grant Kushida