SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Bloom Filter
YHD Search Sharing 2013-04-23
Outline
● Interview questions
● Bloom Filter
– Data structure
– Probability of false positives
– Set properties
● Application
– Cache sharing : squid
– Speed up data access : Hbase
– ID Mapping : zoie
● Materials
Interview questions
● Crawler
– Billions web pages
– How to keep track crawled urls
● Straggler Detection
– You are manning the security desk of a large building
– Everyone checks in or checks out with their id
– At the end of day, identify the few stragglers left in the
building
Data structure
● Data structure
– Init : a bit array of m bits, all set to 0
– Add an element
● K hash function to get K array positions
● Set the bits at all these positions to 1
● Query an element (test whether it's in the set)
– K hash function to get K array positions
– If any position are 0, not in the set
– If all are 1, probabilistic in the set
Probability of false positives
● 1 hash function
0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0
p( A[i]=1∣hash[ x1 ,…, xn])=1−(1−
1
m
)
n
p(hash[ x1]=i)=
1
m
p( A[i]=0∣hash[x1])=1−
1
m
p( A[i]=0∣hash[x1 ,…,xn])=(1−
1
m
)
n
p( A[i]=1)=1−(1−
1
m
)
n
≃1−e
−n/m
lim
x→ ∞
(1−
1
x
)
−x
=e
Probability of false positives
● 1 hash function
0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0
p( A[ H ( y)]=1∣y∉S)=
(number of 1)
m
2/23
p( A[i]=1)=1−e
−n/m
Given
E(number of 1)=m⋅(1−e
−n/m
)
p( A[ H ( y)]=1∣y∉S)=(1−e
−n/m
)
Probability of false positives
● K hash function : repeat for k times
0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0
p( A[ H ( y)]=1∣y∉S)=
(number of 1)
m
p( A[i]=1)=1−e
−n/m
Given
E(number of 1)=m⋅(1−e
−n/m
)
p( A[ H ( y)]=1∣y∉S)=(1−e
−n/m
)
p( A[ H ( y)]=1∣y∉S)=(
number of 1
m
)
k
p( A[i]=1)=1−(1−
1
m
)
kn
≃1−e
−kn/m
E(number of 1)=m⋅(1−e
−kn/m
)
p( A[ H ( y)]=1∣y∉S)=(1−e
−kn/m
)
k
Probability of false positives
● Minimal Probability of false positives
p( A[ H ( y)]=1∣y∉S)=(1−e
−kn/m
)
k
f =(1−e
−kn/m
)
k
f =e
k∗ln (1−e
−kn/m
)
Minimal f, then minimal g
g=k∗ln (1−e
−kn/m
)
p=e
−kn/m
Given
g=−
m
n
ln( p)∗ln(1−p)
Minimal(f )=(
1
2
)
k
p=
1
2
e
−kn/m
=
1
2
is the probability than any specific bit is still 0
half-full Bloom filter array
Probability of false positives
● examples k=ln2
m
n
m/n k k=1 k=2 k=3 k=4 k=5
2 1.39 0.393 0.400
3 2.08 0.283 0.237 0.253
4 2.77 0.221 0.155 0.147 0.160
5 3.46 0.181 0.109 0.092 0.092 0.101
6 4.16 0.154 0.0804 0.0609 0.0561 0.0578
7 4.85 0.133 0.0618 0.0423 0.0359 0.0347
8 5.55 0.118 0.0489 0.0306 0.024 0.0217
Set properties
● Union (bitwise OR)
– same as the Bloom filter created from scratch using
the union of the two sets.
● Intersection (AND operations)
– the false positive probability in the resulting Bloom
filter is at most the false-positive probability in one of
the constituent Bloom filters, but may be larger than
the false positive probability in the Bloom filter
created from scratch using the intersection of the two
sets
Squid : Cache Digests
Squid : Cache Digests
Squid : Cache Digests
Squid : Cache Digests
Squid : Cache Digests
● False positive:
– Proxy A thinks Proxy B has URL U cached. A
asks for cached U, B responds back with “no”,
A goes to actual website.
HBase :architecture
Hbase :HFile format
● (Not including Bloom Filter)
HBase : Query optimization
● Bloom Filter
– As meta store of HFile
– used to determine if a given key is in that store file
● Characteristics
– Know n total KV count (N), but actual count can
often be much lower
– HFile.insert (and hence, BloomFilter.add)
commands are done in lexicographically increasing
order 4000
10000
5000
9001
Application : Zoie
● Long[] uidArray
– Add element
– Query element
int h = (int) ((uid >>> 32) ^ uid) * MIXER;
long bits = _filter[h & _mask];
bits |= ((1L << (h >>> 26)));
bits |= ((1L << ((h >> 20) & 0x3F)));
_filter[h & _mask] = bits;
final int h = (int) ((uid >>> 32) ^ uid) * MIXER;
final int p = h & _mask;
// check the filter
final long bits = _filter[p];
if ((bits & (1L << (h >>> 26))) == 0
|| (bits & (1L << ((h >> 20) & 0x3F))) == 0)
return -1;
Materials
● http://www.cs.jhu.edu/~fabian/courses/CS600.6
24/slides/bloomslides.pdf
● Google tech talk, bloom filtering
● http://www.slideshare.net/jaxlondon2012/intro
-to-hbase-lars-george
● https://issues.apache.org/jira/secure/attachme
nt/12444007/Bloom_Filters_in_HBase.pdf

Weitere ähnliche Inhalte

Was ist angesagt?

行列演算とPythonの言語デザイン
行列演算とPythonの言語デザイン行列演算とPythonの言語デザイン
行列演算とPythonの言語デザインAtsuo Ishimoto
 
Statistics - ArgMax Equation
Statistics - ArgMax EquationStatistics - ArgMax Equation
Statistics - ArgMax EquationAndrew Ferlitsch
 
Intoduction to numpy
Intoduction to numpyIntoduction to numpy
Intoduction to numpyFaraz Ahmed
 
Cheat sheet python3
Cheat sheet python3Cheat sheet python3
Cheat sheet python3sxw2k
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersKimikazu Kato
 
Python3 cheatsheet
Python3 cheatsheetPython3 cheatsheet
Python3 cheatsheetGil Cohen
 
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】Atsuo Ishimoto
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyAndrii Gakhov
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnKarlijn Willems
 
Python matplotlib cheat_sheet
Python matplotlib cheat_sheetPython matplotlib cheat_sheet
Python matplotlib cheat_sheetNishant Upadhyay
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyKimikazu Kato
 
Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Namgee Lee
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresshrinivasvasala
 
Python Cheat Sheet
Python Cheat SheetPython Cheat Sheet
Python Cheat SheetGlowTouch
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
 

Was ist angesagt? (20)

行列演算とPythonの言語デザイン
行列演算とPythonの言語デザイン行列演算とPythonの言語デザイン
行列演算とPythonの言語デザイン
 
Statistics - ArgMax Equation
Statistics - ArgMax EquationStatistics - ArgMax Equation
Statistics - ArgMax Equation
 
Intoduction to numpy
Intoduction to numpyIntoduction to numpy
Intoduction to numpy
 
Cheat sheet python3
Cheat sheet python3Cheat sheet python3
Cheat sheet python3
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning Programmers
 
Python3 cheatsheet
Python3 cheatsheetPython3 cheatsheet
Python3 cheatsheet
 
Numpy Talk at SIAM
Numpy Talk at SIAMNumpy Talk at SIAM
Numpy Talk at SIAM
 
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】
NumPyの歴史とPythonの並行処理【PyData.tokyo One-day Conference 2018】
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Numpy python cheat_sheet
Numpy python cheat_sheetNumpy python cheat_sheet
Numpy python cheat_sheet
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
Python matplotlib cheat_sheet
Python matplotlib cheat_sheetPython matplotlib cheat_sheet
Python matplotlib cheat_sheet
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Python Cheat Sheet
Python Cheat SheetPython Cheat Sheet
Python Cheat Sheet
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
NumPy Refresher
NumPy RefresherNumPy Refresher
NumPy Refresher
 
C++ ammar .s.q
C++  ammar .s.qC++  ammar .s.q
C++ ammar .s.q
 

Andere mochten auch

Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данныхAndrii Gakhov
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaAndrii Gakhov
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksAndrii Gakhov
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Cuckoo Optimization ppt
Cuckoo Optimization pptCuckoo Optimization ppt
Cuckoo Optimization pptAnuja Joshi
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Andrii Gakhov
 

Andere mochten auch (12)

Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
14 Skip Lists
14 Skip Lists14 Skip Lists
14 Skip Lists
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
skip list
skip listskip list
skip list
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Cuckoo Optimization ppt
Cuckoo Optimization pptCuckoo Optimization ppt
Cuckoo Optimization ppt
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
 
Cuckoo search algorithm
Cuckoo search algorithmCuckoo search algorithm
Cuckoo search algorithm
 

Ähnlich wie Bloom filter

Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structureThinh Dang
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDistributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDuyhai Doan
 
從行動廣告大數據觀點談 Big data 20150916
從行動廣告大數據觀點談 Big data   20150916從行動廣告大數據觀點談 Big data   20150916
從行動廣告大數據觀點談 Big data 20150916Craig Chao
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevJavaDayUA
 
Yin Yangs of Software Development
Yin Yangs of Software DevelopmentYin Yangs of Software Development
Yin Yangs of Software DevelopmentNaveenkumar Muguda
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptxSonaliAjankar
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural SimulationClarkTony
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)Qiangning Hong
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward
 
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...PingCAP
 
Bloom Filters: An Introduction
Bloom Filters: An IntroductionBloom Filters: An Introduction
Bloom Filters: An IntroductionIRJET Journal
 
ppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfsurefooted
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015Debashis Saha
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisDataWorks Summit
 
Functional Programming, simplified
Functional Programming, simplifiedFunctional Programming, simplified
Functional Programming, simplifiedNaveenkumar Muguda
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015Fastly
 

Ähnlich wie Bloom filter (20)

Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDistributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeCon
 
從行動廣告大數據觀點談 Big data 20150916
從行動廣告大數據觀點談 Big data   20150916從行動廣告大數據觀點談 Big data   20150916
從行動廣告大數據觀點談 Big data 20150916
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
 
Yin Yangs of Software Development
Yin Yangs of Software DevelopmentYin Yangs of Software Development
Yin Yangs of Software Development
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
Python Memory Management 101(Europython)
Python Memory Management 101(Europython)Python Memory Management 101(Europython)
Python Memory Management 101(Europython)
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural Simulation
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
 
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
 
Bloom Filters: An Introduction
Bloom Filters: An IntroductionBloom Filters: An Introduction
Bloom Filters: An Introduction
 
ppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdf
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Imcic 2020-sugihara
Imcic 2020-sugiharaImcic 2020-sugihara
Imcic 2020-sugihara
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data Analysis
 
Functional Programming, simplified
Functional Programming, simplifiedFunctional Programming, simplified
Functional Programming, simplified
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015
 

Mehr von feng lee

Guice in athena
Guice in athenaGuice in athena
Guice in athenafeng lee
 
Hadoop 安装
Hadoop 安装Hadoop 安装
Hadoop 安装feng lee
 
Axis2 client memory leak
Axis2 client memory leakAxis2 client memory leak
Axis2 client memory leakfeng lee
 
Mysql story in poi dedup
Mysql story in poi dedupMysql story in poi dedup
Mysql story in poi dedupfeng lee
 
Effective java - concurrency
Effective java - concurrencyEffective java - concurrency
Effective java - concurrencyfeng lee
 

Mehr von feng lee (6)

Guice in athena
Guice in athenaGuice in athena
Guice in athena
 
Hadoop 安装
Hadoop 安装Hadoop 安装
Hadoop 安装
 
Axis2 client memory leak
Axis2 client memory leakAxis2 client memory leak
Axis2 client memory leak
 
Mysql story in poi dedup
Mysql story in poi dedupMysql story in poi dedup
Mysql story in poi dedup
 
Effective java - concurrency
Effective java - concurrencyEffective java - concurrency
Effective java - concurrency
 
Maven
MavenMaven
Maven
 

Kürzlich hochgeladen

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Kürzlich hochgeladen (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Bloom filter

  • 1. Bloom Filter YHD Search Sharing 2013-04-23
  • 2. Outline ● Interview questions ● Bloom Filter – Data structure – Probability of false positives – Set properties ● Application – Cache sharing : squid – Speed up data access : Hbase – ID Mapping : zoie ● Materials
  • 3. Interview questions ● Crawler – Billions web pages – How to keep track crawled urls ● Straggler Detection – You are manning the security desk of a large building – Everyone checks in or checks out with their id – At the end of day, identify the few stragglers left in the building
  • 4. Data structure ● Data structure – Init : a bit array of m bits, all set to 0 – Add an element ● K hash function to get K array positions ● Set the bits at all these positions to 1 ● Query an element (test whether it's in the set) – K hash function to get K array positions – If any position are 0, not in the set – If all are 1, probabilistic in the set
  • 5. Probability of false positives ● 1 hash function 0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 p( A[i]=1∣hash[ x1 ,…, xn])=1−(1− 1 m ) n p(hash[ x1]=i)= 1 m p( A[i]=0∣hash[x1])=1− 1 m p( A[i]=0∣hash[x1 ,…,xn])=(1− 1 m ) n p( A[i]=1)=1−(1− 1 m ) n ≃1−e −n/m lim x→ ∞ (1− 1 x ) −x =e
  • 6. Probability of false positives ● 1 hash function 0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 p( A[ H ( y)]=1∣y∉S)= (number of 1) m 2/23 p( A[i]=1)=1−e −n/m Given E(number of 1)=m⋅(1−e −n/m ) p( A[ H ( y)]=1∣y∉S)=(1−e −n/m )
  • 7. Probability of false positives ● K hash function : repeat for k times 0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 p( A[ H ( y)]=1∣y∉S)= (number of 1) m p( A[i]=1)=1−e −n/m Given E(number of 1)=m⋅(1−e −n/m ) p( A[ H ( y)]=1∣y∉S)=(1−e −n/m ) p( A[ H ( y)]=1∣y∉S)=( number of 1 m ) k p( A[i]=1)=1−(1− 1 m ) kn ≃1−e −kn/m E(number of 1)=m⋅(1−e −kn/m ) p( A[ H ( y)]=1∣y∉S)=(1−e −kn/m ) k
  • 8. Probability of false positives ● Minimal Probability of false positives p( A[ H ( y)]=1∣y∉S)=(1−e −kn/m ) k f =(1−e −kn/m ) k f =e k∗ln (1−e −kn/m ) Minimal f, then minimal g g=k∗ln (1−e −kn/m ) p=e −kn/m Given g=− m n ln( p)∗ln(1−p) Minimal(f )=( 1 2 ) k p= 1 2 e −kn/m = 1 2 is the probability than any specific bit is still 0 half-full Bloom filter array
  • 9. Probability of false positives ● examples k=ln2 m n m/n k k=1 k=2 k=3 k=4 k=5 2 1.39 0.393 0.400 3 2.08 0.283 0.237 0.253 4 2.77 0.221 0.155 0.147 0.160 5 3.46 0.181 0.109 0.092 0.092 0.101 6 4.16 0.154 0.0804 0.0609 0.0561 0.0578 7 4.85 0.133 0.0618 0.0423 0.0359 0.0347 8 5.55 0.118 0.0489 0.0306 0.024 0.0217
  • 10. Set properties ● Union (bitwise OR) – same as the Bloom filter created from scratch using the union of the two sets. ● Intersection (AND operations) – the false positive probability in the resulting Bloom filter is at most the false-positive probability in one of the constituent Bloom filters, but may be larger than the false positive probability in the Bloom filter created from scratch using the intersection of the two sets
  • 11. Squid : Cache Digests
  • 12. Squid : Cache Digests
  • 13. Squid : Cache Digests
  • 14. Squid : Cache Digests
  • 15. Squid : Cache Digests ● False positive: – Proxy A thinks Proxy B has URL U cached. A asks for cached U, B responds back with “no”, A goes to actual website.
  • 17. Hbase :HFile format ● (Not including Bloom Filter)
  • 18. HBase : Query optimization ● Bloom Filter – As meta store of HFile – used to determine if a given key is in that store file ● Characteristics – Know n total KV count (N), but actual count can often be much lower – HFile.insert (and hence, BloomFilter.add) commands are done in lexicographically increasing order 4000 10000 5000 9001
  • 19. Application : Zoie ● Long[] uidArray – Add element – Query element int h = (int) ((uid >>> 32) ^ uid) * MIXER; long bits = _filter[h & _mask]; bits |= ((1L << (h >>> 26))); bits |= ((1L << ((h >> 20) & 0x3F))); _filter[h & _mask] = bits; final int h = (int) ((uid >>> 32) ^ uid) * MIXER; final int p = h & _mask; // check the filter final long bits = _filter[p]; if ((bits & (1L << (h >>> 26))) == 0 || (bits & (1L << ((h >> 20) & 0x3F))) == 0) return -1;
  • 20. Materials ● http://www.cs.jhu.edu/~fabian/courses/CS600.6 24/slides/bloomslides.pdf ● Google tech talk, bloom filtering ● http://www.slideshare.net/jaxlondon2012/intro -to-hbase-lars-george ● https://issues.apache.org/jira/secure/attachme nt/12444007/Bloom_Filters_in_HBase.pdf