Bloom filter

Bloom Filter

xuanzi.wp@taobao.com
2011-11-18

1

Agenda

• A Membership Query Problem

• What is Bloom Filter

• BloomFilter Math Theory

• Compression

• Application Scenario
2

Membership Query Problem

Problem Description

Given an element E, query whether it
belongs to an big elements set S.
– Fast as soon as possible

– Small as soon as possible

3


Some Solutions
 hashtable

fast but big data structure
 bitmap index

can be smaller?

4


Tradeoff Solutions
To obtain speed and size improvements,
allow some probability of error.

Bloom Filter

5

What is Bloom Filter
 Support approximate set membership
 Given a set S = {x ,x ,…,x }, construct data
1 2 n
structure to answer queries of the form “Is
y in S?”
 Data structure should be:

–Fast (Faster than searching through S).
–Small (Smaller than explicit representation).
 To obtain speed and size improvements,
allow some probability of error.
–False positives: y ∉ S but we report y ∈ S
–False negatives: y ∈ S but we report y ∉ S

6

Start with an m bit array, filled with 0s.

B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.

B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0

To check if y is in S, check B at Hi(y). All k values must be 1.

B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0

Possible to have a false positive; all k values are 1, but y is not in S.
B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0

n items m = cn bits k hash functions 7 7

False Positive
0
0
1
hash1
0
A 1
hash2 0
0
B 0
hash3
0
1
0

8

Bloom Filter Math Theory
 Pr(specific bit of filter is 0) is
p ' ≡ (1 − 1 / m) kn ≈ e − kn / m ≡ p
 If ρ is fraction of 0 bits in the filter then false

positive probability is
(1 − ρ ) k ≈ (1 − p ' ) k ≈ (1 − p ) k = (1 − e − k / c ) k
 Approximations valid as ρ is concentrated

around E[ρ].
–Martingale argument suffices.
 Find optimal at k = (ln 2)m/n by calculus.
–So optimal fpp is about (0.6185)m/n
n items m = cn bits k hash functions

9

Bloom Filter Math Theory

0.1
0.09
0.08
False positive rate

0.07
m/n = 8
0.06 Opt k = 8 ln 2 = 5.45...
0.05
0.04
0.03
0.02
0.01
0
0 1 2 3 4 5 6 7 8 9 10
Hash functions
n items m = cn bits k hash functions 10

Bloom Filter Compression

Use BF on Network Transmission
 BF as a message, should be small enough

to transmitted over the network
 Compressing bit vector is easy
Arithmetic coding gets close to entropy.

 Can Bloom filters be compressed?

11

• Optimize to minimize false positive.
p = Pr[cell is empty] = (1 − 1 / m) kn ≈ e − kn / m
k − kn / m k
f = Pr[false pos] = (1 − p ) ≈ (1 − e )
k = (m ln 2) / n
• At k = m (ln 2) /n, p = 1/2.
• Bloom filter looks like a random string.
– Can’t compress it.
– H(p) = -plog2p – (1-p)log2(1-p)

12

 With more decompressed size (storage),
we can achive compression.
• Assumption: optimal compressor, z =
mH(p).
– H(p) is entropy function; optimally get
H(p) compressed bits per original table bit.
– Arithmetic coding close to optimal.
• Optimization: Given z bits for compressed
filter and n elements, choose table size m
and number of hash functions k to
minimize /f. ; f ≈ (1 − e − kn / m ) k ; z ≈ mH ( p )
p≈e − kn m 13


0.1
0.09
0.08
Original
z/n = 8
False positive rate

0.07 Compressed
0.06
0.05
0.04
0.03
0.02
0.01
0
0 1 2 3 4 5 6 7 8 9 10
Hash functions
14
14


Conclusion

• At k = m (ln 2) /n, false positives are
maximized with a compressed Bloom
filter.
– Best case without compression is worst case
with compression; compression always
helps.
– Side benefit: Use fewer hash functions with
compression; possible speedup.
15 15

Application Scenario

 Speed up answers in a key-value like syetem
filter(memory storage(memory)
)
key1
no

key2 disk access
yes success

key3 disk access
yes fail

16

Application Scenario

 Web Cache

cache1 cache2 …… cache3

Web Server

17

Bloom filter

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Bloom filter

Ähnlich wie Bloom filter (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bloom filter

Hinweis der Redaktion