Applying Probabilistic Algorithms by Grant Kushida, Head of Engineering, Conversion Logic
We have seen dramatic improvements to job runtimes and associated costs by applying probabilistic algorithms when appropriate. With big-data jobs running at scale, computing exact answers is often overkill - instead, we can often answer the question ""accurately enough"" by approximating a reasonably-correct answer. For our use case (marketing analytics) we have seen benefit from: - Approximate set membership (Bloom filter) - Approximate cardinality (Hyper Log-Log) This talk will focus on use-cases, considerations and impact; not on the details of the algorithms or implementation.
3. INTRODUCTION
• SaaS Marketing Analytics platform for
large Advertisers and Ad Agencies
• Machine Learning-driven analytics for
online and offline media
• Lots of data:
• Online Media (Impressions, Clicks)
• Offline Media (TV, Radio, etc)
• Conversions (Online / offline)
• Exogenous Data (Weather, Stock, etc.)
• Presenters:
• Grant Kushida (Head of Engineering)
• Vish Mandapaka (Principal Engineer)
4. PROBABILISTIC ALGORITHMS
• With Big Data clusters, computing exact answers on huge datasets is
possible
• But:
• Do you really *need* the exact answer?
• Approximations are often “good enough”
• Approximations are often *much cheaper*
• Use hashing, sketches and other math tricks
• In general, the trade-off is between
• More space (memory)
• Lower accuracy
• Faster execution time
• In some scenarios, can trade-off only space while preserving accuracy
and still reducing time
5. COMMON PROBABILISTIC ALGORITHMS
• Cardinality Estimation
• Counting Uniques (Users, etc.)
• Brute-Force: Store every value
• HyperLogLog: Use hashes to update
a fixed-size buffer
• Top-K Estimation
• Top Posters, Campaigns, etc
• Brute-Force: Aggregate and sort;
need to store each value
• Count-Min: Use hashes to increment
a fixed number of counters
• Set Similarity
• Document similarity, etc.
• Brute-Force: Jaccard Similarity;
need to compute intersection
• Min-Hash: Use hashes to
estimate intersection
• Set Membership
• De-duping, exactly-once, etc.
• Brute-Force: Store every value
for exact-match lookup
• Bloom Filter: Use hashes to
update a fixed-size filter
6. PROBLEM STATEMENT
• Join large datasets
• Ad Tech: Impressions and Clicks
• Marketing Analytics: Media and
Conversions
• E-Commerce: Visitors and Buyers
• Etc.
SET A SET B
A ∩B
• Data characteristics:
• Joined by key (e.g. User ID)
• Relatively small overlap
• Need to output additional columns
from both sets
• Un-sorted
• Problems:
• Jobs running out of memory
• Jobs taking too long
• Too much $$$ to run all the nodes
• Causes:
• Partition Skew
• Excess Shuffling
7. NAIVE APPROACH: Spark DataFrame Join
• Two unsorted
DataFrames
• Relatively
small overlap
• Spark
Optimizer
chooses Sort-
Merge Join
SET A SET B
un-
sorted
un-
sorted
A ∩B
8. NAIVE APPROACH: Spark DataFrame Join
• Split into
partitions by
join key
• Will shuffle
data across
nodes
• Potentially a lot
of data transfer
PART A0 PART B0
PART Ax PART Bx
PART A1 PART B1
SET A SET B
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
un-
sorted
9. NAIVE APPROACH: Spark DataFrame Join
• Sort each
partition by join
key
• Parallelized,
but still time-
consuming
PART A0 PART B0
PART Ax PART Bx
PART A1 PART B1
SET A SET B
sortedsortedsorted
sortedsortedsorted
un-
sorted
un-
sorted
10. NAIVE APPROACH: Spark DataFrame Join
• Merge
partitions from
Set A and Set
B
• Find common
join keys
PART A0 PART B0
A0 ∩B0
PART Ax PART Bx
Ax ∩Bx
PART A1 PART B1
A1 ∩B1
SET A SET B
merge
merge
merge
merge
merge
merge
sortedsortedsorted
sortedsortedsorted
un-
sorted
un-
sorted
11. NAIVE APPROACH: Spark DataFrame Join
• Write output to
storage
• Parallelized
• Each partition
is sorted
write
PART A0 PART B0
A0 ∩B0
PART Ax PART Bx
Ax ∩Bx
PART A1 PART B1
A1 ∩B1
SET A SET B
A0 ∩B0
merge
merge
merge
merge
merge
merge
sortedsortedsorted
sortedsortedsorted
un-
sorted
un-
sorted
12. INTUITIVE OPTIMIZATION
• Lots of unnecessary sorting
• We want to sort less…
• Can we eliminate some data up-
front, without compromising the
result?
SET A SET B
A ∩B
sorted
sorted
SET A SET B
A ∩B
XX
XX
13. BLOOM FILTER APPROACH
• Approximate Set Membership
• Probabilistically remove data from
either (or both) sides of the join
• Bloom Filters:
• Can approximate set membership
• Err only on the False Positive side
(item is not actually in set)
• We are going to join anyway, so
false-positives are OK
SET B’
FILTER A
apply writebuild
SET A’ SET B’
A ∩B
False-Pos False-Pos
SET A SET B
SET A’
FILTER B
apply buildwrite
SET A SET B
14. BLOOM FILTER
• Burton Howard Bloom – 1970
• Space-efficient means of testing
elements in a set:
• Hyphenation
• Spell-checking
Filter
• Fixed number of bits (m)
Hashes
• Uniform distribution
• Range of m distinct values
• Not necessarily cryptographic
• Not necessarily different
algorithms
15. BLOOM FILTER - CONSTRUCTION
Adding a value:
• Allocate m bits
16. BLOOM FILTER - CONSTRUCTION
Adding a value:
• Allocate m bits
• Compute k hashes
17. BLOOM FILTER - CONSTRUCTION
Adding a value:
• Allocate m bits
• Compute k hashes
• Set k bits in the filter
18. BLOOM FILTER - CONSTRUCTION
Adding a value:
• Allocate m bits
• Compute k hashes
• Set k bits in the filter
Repeat for all values in
the set
19. BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Set bits for first item
20. BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Repeat for all items
21. BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Set bits for 3 items
True Positive:
All 3 bits set
22. BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Set bits for 3 items
True Negative:
1 of 3 bits set
23. BLOOM FILTER - EVALUATION
Example:
m = 16 bits
k = 3 (hex values)
Set bits for 3 items
False Positive:
3 of 3 bits set
Not present in initial set
24. BLOOM FILTER – EVALUATION (DISTRIBUTED)
• Evaluation can be
distributed and
executed in parallel
• Filter is:
• Small
• Immutable
• Easy to serialize
25. BLOOM FILTER – CONSTRUCTION (DISTRIBUTED)
• Construction can be
partially-distributed
• But, filters must be
consolidated
• Consolidate via
bitwise OR
26. FILTERED JOIN APPROACH
• Build Bloom Filter from Set A
• Evaluate all keys in Set B
• Remove any keys not in Set A
• Keep a few keys not set Set A (false-positive)
• Execute the Join
• Remove the false-positives
27. FILTERED JOIN I: BUILD FILTER
• Can build in parallel
• No need to co-locate
keys
• Need enough memory to
allocate entire filter in
each executor
PART A0
PART Ax
PART A1
SET A
un-
sorted
28. FILTERED JOIN I: BUILD FILTER
• Compute hashes and
set bits for each key
• No impact of setting
same key in multiple
filters
PART A0
PART Ax
PART A1
SET A
FILTER A0
FILTER A1
FILTER Ax
hash
hash
hashun-
sorted
29. FILTERED JOIN I: BUILD FILTER
• Merge all the filters
• Eventually requires
merging into one filter
• Can be a bottleneck for
large filters
PART A0
PART Ax
PART A1
SET A
FILTER A0
FILTER A1
FILTER Ax
FILTER A
hash
hash
hash
OR
un-
sorted
30. FILTERED JOIN II: APPLY FILTER
• Apply the filter to each
key in Set B
• Need to distribute filter
bits to each executorPART B0
PART Bx
PART B1
SET B
FILTER A
un-
sorted
31. FILTERED JOIN II: APPLY FILTER
• Compute hashes
and remove keys
PART B0
PART Bx
PART B1
SET B
FILTER A
PART B0’filter
PART B1’filter
PART Bx’filter
un-
sorted
32. FILTERED JOIN II: APPLY FILTER
• Collect Set B
writewrite
PART B0
PART Bx
PART B1
SET B
FILTER A
PART B0’filter
PART B1’filter
PART Bx’filter
SET B’
un-
sorted
33. FILTERED JOIN III: EXECUTE JOIN
Set B’ is now (significantly) smaller:
• n’ = n – (matches + false-positives)
• % filtered = 1– (overlap % + false-positive %)
Join will match all of the keys deterministically
No loss of accuracy from false-positives (loss of efficacy)
SET A’ SET B’
False-Pos False-Pos
filtered filtered
SET B
A ∩B
SET A
34. FALSE POSITIVE TUNING
• Important Numbers:
• m: number of items
• m: number of bits in filter
• k: number of hashes
• r: false-positive rate
• False-Positive Rate:
𝒓 = 𝟏 − 𝒆−
𝒌𝒏
𝒎
𝒌
• Given any two, can compute
optimal values for the other two
• Generally N is known (or
estimated)
• Many libraries will compute
optimal values automatically
• Online calculators available