Bloom filters are a space-efficient probabilistic data structure for representing a set in order to support membership queries. A Bloom filter is a bit array of m bits that are all set to 0 initially. Hash functions are used to map elements of a set to bit positions in the array, setting those bits to 1. To query if an element is in the set, its hash values are checked; if any bit is 0, the element is definitely not in the set. Otherwise, the element may be in the set, with a false positive probability related to the number of hash functions and size of the bit array. Compression can reduce the false positive rate by allowing a larger bit array size for a given number of bits
2. Agenda
• A Membership Query Problem
• What is Bloom Filter
• BloomFilter Math Theory
• Compression
• Application Scenario
2
3. Membership Query Problem
Problem Description
Given an element E, query whether it
belongs to an big elements set S.
– Fast as soon as possible
– Small as soon as possible
3
6. What is Bloom Filter
Support approximate set membership
Given a set S = {x ,x ,…,x }, construct data
1 2 n
structure to answer queries of the form “Is
y in S?”
Data structure should be:
–Fast (Faster than searching through S).
–Small (Smaller than explicit representation).
To obtain speed and size improvements,
allow some probability of error.
–False positives: y ∉ S but we report y ∈ S
–False negatives: y ∈ S but we report y ∉ S
6
7. What is Bloom Filter
Start with an m bit array, filled with 0s.
B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.
B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0
To check if y is in S, check B at Hi(y). All k values must be 1.
B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0
Possible to have a false positive; all k values are 1, but y is not in S.
B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0
n items m = cn bits k hash functions 7 7
8. What is Bloom Filter
False Positive
0
0
1
hash1
0
A 1
hash2 0
0
B 0
hash3
0
1
0
8
9. Bloom Filter Math Theory
Pr(specific bit of filter is 0) is
p ' ≡ (1 − 1 / m) kn ≈ e − kn / m ≡ p
If ρ is fraction of 0 bits in the filter then false
positive probability is
(1 − ρ ) k ≈ (1 − p ' ) k ≈ (1 − p ) k = (1 − e − k / c ) k
Approximations valid as ρ is concentrated
around E[ρ].
–Martingale argument suffices.
Find optimal at k = (ln 2)m/n by calculus.
–So optimal fpp is about (0.6185)m/n
n items m = cn bits k hash functions
9
10. Bloom Filter Math Theory
0.1
0.09
0.08
False positive rate
0.07
m/n = 8
0.06 Opt k = 8 ln 2 = 5.45...
0.05
0.04
0.03
0.02
0.01
0
0 1 2 3 4 5 6 7 8 9 10
Hash functions
n items m = cn bits k hash functions 10
11. Bloom Filter Compression
Use BF on Network Transmission
BF as a message, should be small enough
to transmitted over the network
Compressing bit vector is easy
Arithmetic coding gets close to entropy.
Can Bloom filters be compressed?
11
12. Bloom Filter Compression
• Optimize to minimize false positive.
p = Pr[cell is empty] = (1 − 1 / m) kn ≈ e − kn / m
k − kn / m k
f = Pr[false pos] = (1 − p ) ≈ (1 − e )
k = (m ln 2) / n
• At k = m (ln 2) /n, p = 1/2.
• Bloom filter looks like a random string.
– Can’t compress it.
– H(p) = -plog2p – (1-p)log2(1-p)
12
13. Bloom Filter Compression
With more decompressed size (storage),
we can achive compression.
• Assumption: optimal compressor, z =
mH(p).
– H(p) is entropy function; optimally get
H(p) compressed bits per original table bit.
– Arithmetic coding close to optimal.
• Optimization: Given z bits for compressed
filter and n elements, choose table size m
and number of hash functions k to
minimize /f. ; f ≈ (1 − e − kn / m ) k ; z ≈ mH ( p )
p≈e − kn m 13
15. Bloom Filter Compression
Conclusion
• At k = m (ln 2) /n, false positives are
maximized with a compressed Bloom
filter.
– Best case without compression is worst case
with compression; compression always
helps.
– Side benefit: Use fewer hash functions with
compression; possible speedup.
15 15
16. Application Scenario
Speed up answers in a key-value like syetem
filter(memory storage(memory)
)
key1
no
key2 disk access
yes success
key3 disk access
yes fail
16