SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Bloom Filter
YHD Search Sharing 2013-04-23
Outline
● Interview questions
● Bloom Filter
– Data structure
– Probability of false positives
– Set properties
● Application
– Cache sharing : squid
– Speed up data access : Hbase
– ID Mapping : zoie
● Materials
Interview questions
● Crawler
– Billions web pages
– How to keep track crawled urls
● Straggler Detection
– You are manning the security desk of a large building
– Everyone checks in or checks out with their id
– At the end of day, identify the few stragglers left in the
building
Data structure
● Data structure
– Init : a bit array of m bits, all set to 0
– Add an element
● K hash function to get K array positions
● Set the bits at all these positions to 1
● Query an element (test whether it's in the set)
– K hash function to get K array positions
– If any position are 0, not in the set
– If all are 1, probabilistic in the set
Probability of false positives
● 1 hash function
0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0
p( A[i]=1∣hash[ x1 ,…, xn])=1−(1−
1
m
)
n
p(hash[ x1]=i)=
1
m
p( A[i]=0∣hash[x1])=1−
1
m
p( A[i]=0∣hash[x1 ,…,xn])=(1−
1
m
)
n
p( A[i]=1)=1−(1−
1
m
)
n
≃1−e
−n/m
lim
x→ ∞
(1−
1
x
)
−x
=e
Probability of false positives
● 1 hash function
0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0
p( A[ H ( y)]=1∣y∉S)=
(number of 1)
m
2/23
p( A[i]=1)=1−e
−n/m
Given
E(number of 1)=m⋅(1−e
−n/m
)
p( A[ H ( y)]=1∣y∉S)=(1−e
−n/m
)
Probability of false positives
● K hash function : repeat for k times
0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0
p( A[ H ( y)]=1∣y∉S)=
(number of 1)
m
p( A[i]=1)=1−e
−n/m
Given
E(number of 1)=m⋅(1−e
−n/m
)
p( A[ H ( y)]=1∣y∉S)=(1−e
−n/m
)
p( A[ H ( y)]=1∣y∉S)=(
number of 1
m
)
k
p( A[i]=1)=1−(1−
1
m
)
kn
≃1−e
−kn/m
E(number of 1)=m⋅(1−e
−kn/m
)
p( A[ H ( y)]=1∣y∉S)=(1−e
−kn/m
)
k
Probability of false positives
● Minimal Probability of false positives
p( A[ H ( y)]=1∣y∉S)=(1−e
−kn/m
)
k
f =(1−e
−kn/m
)
k
f =e
k∗ln (1−e
−kn/m
)
Minimal f, then minimal g
g=k∗ln (1−e
−kn/m
)
p=e
−kn/m
Given
g=−
m
n
ln( p)∗ln(1−p)
Minimal(f )=(
1
2
)
k
p=
1
2
e
−kn/m
=
1
2
is the probability than any specific bit is still 0
half-full Bloom filter array
Probability of false positives
● examples k=ln2
m
n
m/n k k=1 k=2 k=3 k=4 k=5
2 1.39 0.393 0.400
3 2.08 0.283 0.237 0.253
4 2.77 0.221 0.155 0.147 0.160
5 3.46 0.181 0.109 0.092 0.092 0.101
6 4.16 0.154 0.0804 0.0609 0.0561 0.0578
7 4.85 0.133 0.0618 0.0423 0.0359 0.0347
8 5.55 0.118 0.0489 0.0306 0.024 0.0217
Set properties
● Union (bitwise OR)
– same as the Bloom filter created from scratch using
the union of the two sets.
● Intersection (AND operations)
– the false positive probability in the resulting Bloom
filter is at most the false-positive probability in one of
the constituent Bloom filters, but may be larger than
the false positive probability in the Bloom filter
created from scratch using the intersection of the two
sets
Squid : Cache Digests
Squid : Cache Digests
Squid : Cache Digests
Squid : Cache Digests
Squid : Cache Digests
● False positive:
– Proxy A thinks Proxy B has URL U cached. A
asks for cached U, B responds back with “no”,
A goes to actual website.
HBase :architecture
Hbase :HFile format
● (Not including Bloom Filter)
HBase : Query optimization
● Bloom Filter
– As meta store of HFile
– used to determine if a given key is in that store file
● Characteristics
– Know n total KV count (N), but actual count can
often be much lower
– HFile.insert (and hence, BloomFilter.add)
commands are done in lexicographically increasing
order 4000
10000
5000
9001
Application : Zoie
● Long[] uidArray
– Add element
– Query element
int h = (int) ((uid >>> 32) ^ uid) * MIXER;
long bits = _filter[h & _mask];
bits |= ((1L << (h >>> 26)));
bits |= ((1L << ((h >> 20) & 0x3F)));
_filter[h & _mask] = bits;
final int h = (int) ((uid >>> 32) ^ uid) * MIXER;
final int p = h & _mask;
// check the filter
final long bits = _filter[p];
if ((bits & (1L << (h >>> 26))) == 0
|| (bits & (1L << ((h >> 20) & 0x3F))) == 0)
return -1;
Materials
● http://www.cs.jhu.edu/~fabian/courses/CS600.6
24/slides/bloomslides.pdf
● Google tech talk, bloom filtering
● http://www.slideshare.net/jaxlondon2012/intro
-to-hbase-lars-george
● https://issues.apache.org/jira/secure/attachme
nt/12444007/Bloom_Filters_in_HBase.pdf

Weitere ähnliche Inhalte

Was ist angesagt?

行列演算とPythonの言語デザイン
行列演算とPythonの言語デザイン行列演算とPythonの言語デザイン
行列演算とPythonの言語デザインAtsuo Ishimoto
 
Statistics - ArgMax Equation
Statistics - ArgMax EquationStatistics - ArgMax Equation
Statistics - ArgMax EquationAndrew Ferlitsch
 
Intoduction to numpy
Intoduction to numpyIntoduction to numpy
Intoduction to numpyFaraz Ahmed
 
Cheat sheet python3
Cheat sheet python3Cheat sheet python3
Cheat sheet python3sxw2k
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersKimikazu Kato
 
Python3 cheatsheet
Python3 cheatsheetPython3 cheatsheet
Python3 cheatsheetGil Cohen
 
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】Atsuo Ishimoto
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyAndrii Gakhov
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnKarlijn Willems
 
Python matplotlib cheat_sheet
Python matplotlib cheat_sheetPython matplotlib cheat_sheet
Python matplotlib cheat_sheetNishant Upadhyay
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyKimikazu Kato
 
Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Namgee Lee
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresshrinivasvasala
 
Python Cheat Sheet
Python Cheat SheetPython Cheat Sheet
Python Cheat SheetGlowTouch
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
 

Was ist angesagt? (20)

行列演算とPythonの言語デザイン
行列演算とPythonの言語デザイン行列演算とPythonの言語デザイン
行列演算とPythonの言語デザイン
 
Statistics - ArgMax Equation
Statistics - ArgMax EquationStatistics - ArgMax Equation
Statistics - ArgMax Equation
 
Intoduction to numpy
Intoduction to numpyIntoduction to numpy
Intoduction to numpy
 
Cheat sheet python3
Cheat sheet python3Cheat sheet python3
Cheat sheet python3
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning Programmers
 
Python3 cheatsheet
Python3 cheatsheetPython3 cheatsheet
Python3 cheatsheet
 
Numpy Talk at SIAM
Numpy Talk at SIAMNumpy Talk at SIAM
Numpy Talk at SIAM
 
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Numpy python cheat_sheet
Numpy python cheat_sheetNumpy python cheat_sheet
Numpy python cheat_sheet
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
Python matplotlib cheat_sheet
Python matplotlib cheat_sheetPython matplotlib cheat_sheet
Python matplotlib cheat_sheet
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Python Cheat Sheet
Python Cheat SheetPython Cheat Sheet
Python Cheat Sheet
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
NumPy Refresher
NumPy RefresherNumPy Refresher
NumPy Refresher
 
C++ ammar .s.q
C++  ammar .s.qC++  ammar .s.q
C++ ammar .s.q
 

Andere mochten auch

Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данныхAndrii Gakhov
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaAndrii Gakhov
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksAndrii Gakhov
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Cuckoo Optimization ppt
Cuckoo Optimization pptCuckoo Optimization ppt
Cuckoo Optimization pptAnuja Joshi
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Andrii Gakhov
 

Andere mochten auch (12)

Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
14 Skip Lists
14 Skip Lists14 Skip Lists
14 Skip Lists
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
skip list
skip listskip list
skip list
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Cuckoo Optimization ppt
Cuckoo Optimization pptCuckoo Optimization ppt
Cuckoo Optimization ppt
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
 
Cuckoo search algorithm
Cuckoo search algorithmCuckoo search algorithm
Cuckoo search algorithm
 

Ähnlich wie Bloom filter

Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structureThinh Dang
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDistributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDuyhai Doan
 
從行動廣告大數據觀點談 Big data 20150916
從行動廣告大數據觀點談 Big data   20150916從行動廣告大數據觀點談 Big data   20150916
從行動廣告大數據觀點談 Big data 20150916Craig Chao
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevJavaDayUA
 
Yin Yangs of Software Development
Yin Yangs of Software DevelopmentYin Yangs of Software Development
Yin Yangs of Software DevelopmentNaveenkumar Muguda
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptxSonaliAjankar
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural SimulationClarkTony
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)Qiangning Hong
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward
 
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...PingCAP
 
Bloom Filters: An Introduction
Bloom Filters: An IntroductionBloom Filters: An Introduction
Bloom Filters: An IntroductionIRJET Journal
 
ppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfsurefooted
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015Debashis Saha
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisDataWorks Summit
 
Functional Programming, simplified
Functional Programming, simplifiedFunctional Programming, simplified
Functional Programming, simplifiedNaveenkumar Muguda
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015Fastly
 

Ähnlich wie Bloom filter (20)

Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDistributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeCon
 
從行動廣告大數據觀點談 Big data 20150916
從行動廣告大數據觀點談 Big data   20150916從行動廣告大數據觀點談 Big data   20150916
從行動廣告大數據觀點談 Big data 20150916
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
 
Yin Yangs of Software Development
Yin Yangs of Software DevelopmentYin Yangs of Software Development
Yin Yangs of Software Development
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
Python Memory Management 101(Europython)
Python Memory Management 101(Europython)Python Memory Management 101(Europython)
Python Memory Management 101(Europython)
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural Simulation
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
 
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
 
Bloom Filters: An Introduction
Bloom Filters: An IntroductionBloom Filters: An Introduction
Bloom Filters: An Introduction
 
ppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdf
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Imcic 2020-sugihara
Imcic 2020-sugiharaImcic 2020-sugihara
Imcic 2020-sugihara
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data Analysis
 
Functional Programming, simplified
Functional Programming, simplifiedFunctional Programming, simplified
Functional Programming, simplified
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015
 

Mehr von feng lee

Guice in athena
Guice in athenaGuice in athena
Guice in athenafeng lee
 
Hadoop 安装
Hadoop 安装Hadoop 安装
Hadoop 安装feng lee
 
Axis2 client memory leak
Axis2 client memory leakAxis2 client memory leak
Axis2 client memory leakfeng lee
 
Mysql story in poi dedup
Mysql story in poi dedupMysql story in poi dedup
Mysql story in poi dedupfeng lee
 
Effective java - concurrency
Effective java - concurrencyEffective java - concurrency
Effective java - concurrencyfeng lee
 

Mehr von feng lee (6)

Guice in athena
Guice in athenaGuice in athena
Guice in athena
 
Hadoop 安装
Hadoop 安装Hadoop 安装
Hadoop 安装
 
Axis2 client memory leak
Axis2 client memory leakAxis2 client memory leak
Axis2 client memory leak
 
Mysql story in poi dedup
Mysql story in poi dedupMysql story in poi dedup
Mysql story in poi dedup
 
Effective java - concurrency
Effective java - concurrencyEffective java - concurrency
Effective java - concurrency
 
Maven
MavenMaven
Maven
 

Kürzlich hochgeladen

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Kürzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Bloom filter

  • 1. Bloom Filter YHD Search Sharing 2013-04-23
  • 2. Outline ● Interview questions ● Bloom Filter – Data structure – Probability of false positives – Set properties ● Application – Cache sharing : squid – Speed up data access : Hbase – ID Mapping : zoie ● Materials
  • 3. Interview questions ● Crawler – Billions web pages – How to keep track crawled urls ● Straggler Detection – You are manning the security desk of a large building – Everyone checks in or checks out with their id – At the end of day, identify the few stragglers left in the building
  • 4. Data structure ● Data structure – Init : a bit array of m bits, all set to 0 – Add an element ● K hash function to get K array positions ● Set the bits at all these positions to 1 ● Query an element (test whether it's in the set) – K hash function to get K array positions – If any position are 0, not in the set – If all are 1, probabilistic in the set
  • 5. Probability of false positives ● 1 hash function 0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 p( A[i]=1∣hash[ x1 ,…, xn])=1−(1− 1 m ) n p(hash[ x1]=i)= 1 m p( A[i]=0∣hash[x1])=1− 1 m p( A[i]=0∣hash[x1 ,…,xn])=(1− 1 m ) n p( A[i]=1)=1−(1− 1 m ) n ≃1−e −n/m lim x→ ∞ (1− 1 x ) −x =e
  • 6. Probability of false positives ● 1 hash function 0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 p( A[ H ( y)]=1∣y∉S)= (number of 1) m 2/23 p( A[i]=1)=1−e −n/m Given E(number of 1)=m⋅(1−e −n/m ) p( A[ H ( y)]=1∣y∉S)=(1−e −n/m )
  • 7. Probability of false positives ● K hash function : repeat for k times 0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 p( A[ H ( y)]=1∣y∉S)= (number of 1) m p( A[i]=1)=1−e −n/m Given E(number of 1)=m⋅(1−e −n/m ) p( A[ H ( y)]=1∣y∉S)=(1−e −n/m ) p( A[ H ( y)]=1∣y∉S)=( number of 1 m ) k p( A[i]=1)=1−(1− 1 m ) kn ≃1−e −kn/m E(number of 1)=m⋅(1−e −kn/m ) p( A[ H ( y)]=1∣y∉S)=(1−e −kn/m ) k
  • 8. Probability of false positives ● Minimal Probability of false positives p( A[ H ( y)]=1∣y∉S)=(1−e −kn/m ) k f =(1−e −kn/m ) k f =e k∗ln (1−e −kn/m ) Minimal f, then minimal g g=k∗ln (1−e −kn/m ) p=e −kn/m Given g=− m n ln( p)∗ln(1−p) Minimal(f )=( 1 2 ) k p= 1 2 e −kn/m = 1 2 is the probability than any specific bit is still 0 half-full Bloom filter array
  • 9. Probability of false positives ● examples k=ln2 m n m/n k k=1 k=2 k=3 k=4 k=5 2 1.39 0.393 0.400 3 2.08 0.283 0.237 0.253 4 2.77 0.221 0.155 0.147 0.160 5 3.46 0.181 0.109 0.092 0.092 0.101 6 4.16 0.154 0.0804 0.0609 0.0561 0.0578 7 4.85 0.133 0.0618 0.0423 0.0359 0.0347 8 5.55 0.118 0.0489 0.0306 0.024 0.0217
  • 10. Set properties ● Union (bitwise OR) – same as the Bloom filter created from scratch using the union of the two sets. ● Intersection (AND operations) – the false positive probability in the resulting Bloom filter is at most the false-positive probability in one of the constituent Bloom filters, but may be larger than the false positive probability in the Bloom filter created from scratch using the intersection of the two sets
  • 11. Squid : Cache Digests
  • 12. Squid : Cache Digests
  • 13. Squid : Cache Digests
  • 14. Squid : Cache Digests
  • 15. Squid : Cache Digests ● False positive: – Proxy A thinks Proxy B has URL U cached. A asks for cached U, B responds back with “no”, A goes to actual website.
  • 17. Hbase :HFile format ● (Not including Bloom Filter)
  • 18. HBase : Query optimization ● Bloom Filter – As meta store of HFile – used to determine if a given key is in that store file ● Characteristics – Know n total KV count (N), but actual count can often be much lower – HFile.insert (and hence, BloomFilter.add) commands are done in lexicographically increasing order 4000 10000 5000 9001
  • 19. Application : Zoie ● Long[] uidArray – Add element – Query element int h = (int) ((uid >>> 32) ^ uid) * MIXER; long bits = _filter[h & _mask]; bits |= ((1L << (h >>> 26))); bits |= ((1L << ((h >> 20) & 0x3F))); _filter[h & _mask] = bits; final int h = (int) ((uid >>> 32) ^ uid) * MIXER; final int p = h & _mask; // check the filter final long bits = _filter[p]; if ((bits & (1L << (h >>> 26))) == 0 || (bits & (1L << ((h >> 20) & 0x3F))) == 0) return -1;
  • 20. Materials ● http://www.cs.jhu.edu/~fabian/courses/CS600.6 24/slides/bloomslides.pdf ● Google tech talk, bloom filtering ● http://www.slideshare.net/jaxlondon2012/intro -to-hbase-lars-george ● https://issues.apache.org/jira/secure/attachme nt/12444007/Bloom_Filters_in_HBase.pdf