SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Bloom Filter

xuanzi.wp@taobao.com
      2011-11-18




                       1
Agenda

• A Membership Query Problem

• What is Bloom Filter

• BloomFilter Math Theory

• Compression

• Application Scenario
                               2
Membership Query Problem

Problem Description

 Given an element E, query whether it
 belongs to an big elements set S.
  – Fast as soon as possible

  – Small as soon as possible




                                        3
Membership Query Problem

Some Solutions
   hashtable

    fast but big data structure
   bitmap index

    can be smaller?



                                  4
Membership Query Problem

Tradeoff Solutions
   To obtain speed and size improvements,
   allow some probability of error.



         Bloom Filter

                                            5
What is Bloom Filter
 Support approximate set membership
 Given a set S = {x ,x ,…,x }, construct data
                    1 2     n
  structure to answer queries of the form “Is
  y in S?”
 Data structure should be:

     –Fast (Faster than searching through S).
     –Small (Smaller than explicit representation).
    To obtain speed and size improvements,
    allow some probability of error.
     –False positives: y ∉ S but we report y ∈ S
     –False negatives: y ∈ S but we report y ∉ S

                                                      6
What is Bloom Filter
                     Start with an m bit array, filled with 0s.

B   0       0    0    0    0     0    0      0   0   0     0   0   0    0    0   0

          Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.

B   0       1    0    0    1     0    1      0   0   1     1   1   0    1    1   0

          To check if y is in S, check B at Hi(y). All k values must be 1.

B   0       1    0    0    1     0    1      0   0   1     1   1   0    1    1   0

        Possible to have a false positive; all k values are 1, but y is not in S.
B   0       1    0    0    1     0    1      0   0   1     1   1   0    1    1   0

n items                        m = cn bits               k hash functions            7 7
What is Bloom Filter
False Positive
                            0
                            0
                            1
                 hash1
                            0
    A                       1
                 hash2      0
                            0
    B                       0
                 hash3
                            0
                            1
                            0


                                8
Bloom Filter Math Theory
 Pr(specific bit of filter is 0) is
            p ' ≡ (1 − 1 / m) kn ≈ e − kn / m ≡ p
 If ρ is fraction of 0 bits in the filter then false

positive probability is
    (1 − ρ ) k ≈ (1 − p ' ) k ≈ (1 − p ) k = (1 − e − k / c ) k
 Approximations valid as ρ is concentrated

around E[ρ].
     –Martingale argument suffices.
   Find optimal at k = (ln 2)m/n by calculus.
     –So optimal fpp is about (0.6185)m/n
n items                m = cn bits           k hash functions

                                                                  9
Bloom Filter Math Theory

                       0.1
                      0.09
                      0.08
False positive rate




                      0.07
                                                                                     m/n = 8
                      0.06                      Opt k = 8 ln 2 = 5.45...
                      0.05
                      0.04
                      0.03
                      0.02
                      0.01
                         0
                             0    1   2     3     4     5   6     7   8    9    10
                                                 Hash functions
n items                                   m = cn bits             k hash functions       10
Bloom Filter Compression

Use BF on Network Transmission
     BF as a message, should be small enough

      to transmitted over the network
     Compressing bit vector is easy
      Arithmetic coding gets close to entropy.

     Can Bloom filters be compressed?


                                                 11
Bloom Filter Compression
• Optimize to minimize false positive.
    p = Pr[cell is empty] = (1 − 1 / m) kn ≈ e − kn / m
                                k          − kn / m k
    f = Pr[false pos] = (1 − p ) ≈ (1 − e          )
    k = (m ln 2) / n
• At k = m (ln 2) /n, p = 1/2.
• Bloom filter looks like a random string.
  – Can’t compress it.
  – H(p) = -plog2p – (1-p)log2(1-p)

                                                          12
Bloom Filter Compression
 With more decompressed size (storage),
  we can achive compression.
• Assumption: optimal compressor, z =
  mH(p).
    – H(p) is entropy function; optimally get
      H(p) compressed bits per original table bit.
    – Arithmetic coding close to optimal.
• Optimization: Given z bits for compressed
  filter and n elements, choose table size m
  and number of hash functions k to
  minimize /f. ; f ≈ (1 − e − kn / m ) k ; z ≈ mH ( p )
       p≈e − kn m                                       13
Bloom Filter Compression

                       0.1
                      0.09
                      0.08
                                                                  Original
                                                                                  z/n = 8
False positive rate




                      0.07                                        Compressed
                      0.06
                      0.05
                      0.04
                      0.03
                      0.02
                      0.01
                         0
                             0   1   2   3    4    5    6     7     8    9   10
                                             Hash functions
                                                                                           14
                                                                                      14
Bloom Filter Compression

Conclusion

• At k = m (ln 2) /n, false positives are
  maximized with a compressed Bloom
  filter.
  – Best case without compression is worst case
    with compression; compression always
    helps.
  – Side benefit: Use fewer hash functions with
    compression; possible speedup.
                                             15 15
Application Scenario

   Speed up answers in a key-value like syetem
           filter(memory                 storage(memory)
           )
    key1
    no

    key2                   disk access
     yes                     success

    key3                   disk access
     yes                       fail




                                                           16
Application Scenario

   Web Cache

      cache1    cache2    ……    cache3




                   Web Server




                                         17
Q&A




Q&A


       18
Bloom filter

Weitere ähnliche Inhalte

Was ist angesagt?

STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
Albert Bifet
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Albert Bifet
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
Yasuo Tabei
 

Was ist angesagt? (20)

Bloom filters
Bloom filtersBloom filters
Bloom filters
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Scalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSHScalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSH
 
Smalltalk
SmalltalkSmalltalk
Smalltalk
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
Hashing
HashingHashing
Hashing
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning Programmers
 
Hashing gt1
Hashing gt1Hashing gt1
Hashing gt1
 
LSH for
 Prediction Problem in Recommendation
LSH for
 Prediction Problem in RecommendationLSH for
 Prediction Problem in Recommendation
LSH for
 Prediction Problem in Recommendation
 
Numpy Talk at SIAM
Numpy Talk at SIAMNumpy Talk at SIAM
Numpy Talk at SIAM
 

Ähnlich wie Bloom filter

Csr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenkoCsr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenko
CSR2011
 
Csr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenkoCsr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenko
CSR2011
 
2D1431 Machine Learning
2D1431 Machine Learning2D1431 Machine Learning
2D1431 Machine Learning
butest
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
Eric Zhang
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
butest
 

Ähnlich wie Bloom filter (20)

December 7, Projects
December 7, ProjectsDecember 7, Projects
December 7, Projects
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
 
Simplex3
Simplex3Simplex3
Simplex3
 
Csr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenkoCsr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenko
 
Csr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenkoCsr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenko
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Automatic bayesian cubature
Automatic bayesian cubatureAutomatic bayesian cubature
Automatic bayesian cubature
 
06_finite_elements_basics.ppt
06_finite_elements_basics.ppt06_finite_elements_basics.ppt
06_finite_elements_basics.ppt
 
5994944.ppt
5994944.ppt5994944.ppt
5994944.ppt
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
2D1431 Machine Learning
2D1431 Machine Learning2D1431 Machine Learning
2D1431 Machine Learning
 
Quantum error correction
Quantum error correctionQuantum error correction
Quantum error correction
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
ilp-nlp-slides.pdf
ilp-nlp-slides.pdfilp-nlp-slides.pdf
ilp-nlp-slides.pdf
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
 
Stochastic Approximation and Simulated Annealing
Stochastic Approximation and Simulated AnnealingStochastic Approximation and Simulated Annealing
Stochastic Approximation and Simulated Annealing
 
Test
TestTest
Test
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Bloom filter

  • 2. Agenda • A Membership Query Problem • What is Bloom Filter • BloomFilter Math Theory • Compression • Application Scenario 2
  • 3. Membership Query Problem Problem Description Given an element E, query whether it belongs to an big elements set S. – Fast as soon as possible – Small as soon as possible 3
  • 4. Membership Query Problem Some Solutions  hashtable fast but big data structure  bitmap index can be smaller? 4
  • 5. Membership Query Problem Tradeoff Solutions To obtain speed and size improvements, allow some probability of error. Bloom Filter 5
  • 6. What is Bloom Filter  Support approximate set membership  Given a set S = {x ,x ,…,x }, construct data 1 2 n structure to answer queries of the form “Is y in S?”  Data structure should be: –Fast (Faster than searching through S). –Small (Smaller than explicit representation).  To obtain speed and size improvements, allow some probability of error. –False positives: y ∉ S but we report y ∈ S –False negatives: y ∈ S but we report y ∉ S 6
  • 7. What is Bloom Filter Start with an m bit array, filled with 0s. B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1. B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 To check if y is in S, check B at Hi(y). All k values must be 1. B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 Possible to have a false positive; all k values are 1, but y is not in S. B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 n items m = cn bits k hash functions 7 7
  • 8. What is Bloom Filter False Positive 0 0 1 hash1 0 A 1 hash2 0 0 B 0 hash3 0 1 0 8
  • 9. Bloom Filter Math Theory  Pr(specific bit of filter is 0) is p ' ≡ (1 − 1 / m) kn ≈ e − kn / m ≡ p  If ρ is fraction of 0 bits in the filter then false positive probability is (1 − ρ ) k ≈ (1 − p ' ) k ≈ (1 − p ) k = (1 − e − k / c ) k  Approximations valid as ρ is concentrated around E[ρ]. –Martingale argument suffices.  Find optimal at k = (ln 2)m/n by calculus. –So optimal fpp is about (0.6185)m/n n items m = cn bits k hash functions 9
  • 10. Bloom Filter Math Theory 0.1 0.09 0.08 False positive rate 0.07 m/n = 8 0.06 Opt k = 8 ln 2 = 5.45... 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 Hash functions n items m = cn bits k hash functions 10
  • 11. Bloom Filter Compression Use BF on Network Transmission  BF as a message, should be small enough to transmitted over the network  Compressing bit vector is easy Arithmetic coding gets close to entropy.  Can Bloom filters be compressed? 11
  • 12. Bloom Filter Compression • Optimize to minimize false positive. p = Pr[cell is empty] = (1 − 1 / m) kn ≈ e − kn / m k − kn / m k f = Pr[false pos] = (1 − p ) ≈ (1 − e ) k = (m ln 2) / n • At k = m (ln 2) /n, p = 1/2. • Bloom filter looks like a random string. – Can’t compress it. – H(p) = -plog2p – (1-p)log2(1-p) 12
  • 13. Bloom Filter Compression  With more decompressed size (storage), we can achive compression. • Assumption: optimal compressor, z = mH(p). – H(p) is entropy function; optimally get H(p) compressed bits per original table bit. – Arithmetic coding close to optimal. • Optimization: Given z bits for compressed filter and n elements, choose table size m and number of hash functions k to minimize /f. ; f ≈ (1 − e − kn / m ) k ; z ≈ mH ( p ) p≈e − kn m 13
  • 14. Bloom Filter Compression 0.1 0.09 0.08 Original z/n = 8 False positive rate 0.07 Compressed 0.06 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 Hash functions 14 14
  • 15. Bloom Filter Compression Conclusion • At k = m (ln 2) /n, false positives are maximized with a compressed Bloom filter. – Best case without compression is worst case with compression; compression always helps. – Side benefit: Use fewer hash functions with compression; possible speedup. 15 15
  • 16. Application Scenario  Speed up answers in a key-value like syetem filter(memory storage(memory) ) key1 no key2 disk access yes success key3 disk access yes fail 16
  • 17. Application Scenario  Web Cache cache1 cache2 …… cache3 Web Server 17
  • 18. Q&A Q&A 18

Hinweis der Redaktion

  1. 按照时间顺序介绍一下 , 在本年度参与的技术项目,包括每个项目的情况 ,自己 工作 内容 描述