SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Bloom Filter in Bioinformatics
Aleksey Kladov
aleksey.kladov@gmail.com
February 15, 2013
Set data structure
Implementations
• Hash table
• Search tree
• Bloom filter
Why Bloom is awesome?
• Asymptotically not faster then a hash table
• No delete operation
• It simply doesn’t work!
1 / 15
Meet the Bloom!
• Actually, it’s all from Wikipedia :)
• Burton Howard Bloom, 1970
• m-bit vector B
• k hash functions: element → [0, m)
2 / 15
Add
• Want to add element e?
• Compute h1(e), h2(e), . . . , hk(e)
• Compute H(e) m-bit vetcor with k ones at
h1(e), h2(e), . . . , hk(e)
• Let B = B ∨ H(e)
3 / 15
Query
• Want to check if element e was inserted in B?
• Check if B ∧ H(e) == H(e)
• If yes then e may be in the set
• If no then e is definitely not in the set
• False positives =(
4 / 15
So, why Bloom?
From Wikipedia
A Bloom filter with 1% error and an optimal value of k, requires
only about 9.6 bits per element — regardless of the size of the
elements.
5 / 15
k, m, n and other letters
• k – number of hash functions
• m – length of the vector
• p – false positive rate
• n – number of elements in the set
6 / 15
$Math$
Probability that ith
bit is zero after one insertion (1 −
1
m
)k
After n insertions (1 −
1
m
)kn
Probability that ith
bit is one after n insertions 1 − (1 −
1
m
)kn
Probability of positive answer (1 − (1 −
1
m
)kn
)k
Recall that (1 − 1
m )
m
≈ e−1 , plug it into the last formula and we
get false positive rate estimate
≈ (1 − e
−kn
m )k
7 / 15
The optimal k
Can we choose k to minimize false positive rate?
min
k
(1 − e
−kn
m )k
Optimal k
m
n ln 2
False positive rate
2− m
n
ln 2
Number of bits per item for a given p
− ln p
(ln 2)2
8 / 15
Excellent!
9 / 15
Counting k-mers
• Naive: hash table requires a lot of memory =(
• ≈ half of the k-mers are singleton!
• Increase in coverege leads to more singletons (linear
dependency)
• Most applications need solid k-mers
10 / 15
Bloom filter!
Paper ”Efficient counting of k-mers in DNA sequences using a
bloom filter” by P´all Melsted and Jonathan K Pritchard, 2011
• Create a bloom filter and a hash table
• If k-mer is not in filter – insert into filter
• If k-mer is in filter – insert into table with count 2 or update
count.
Resulting counts are higher then true counts at most by one.
To find precise counts make a second pass.
2x memory savings, assuming if half of the k-mers are unique
11 / 15
Picture!
Figure 1 : Memory usage for BFCounter, Jellyfish and Google Dense at
different coverage levels(Chromosome 21 data) 12 / 15
Speeding it up I
Paper ”Exact Pattern Matching with Feed-Forward Bloom Filters”
by Iulian Moraru and David G. Andersen, 2011
Problem
Given the huge text and lots of patterns of the same length, find
all occurences of patterns in the text.
Solution
• Create P and T Bloom filters
• Insert all patterns in P
• Search substring of text in filter
• For each substring of text that is in P, remember it’s positon
and add it to the T
• Filter out all patterns that are not in T
• Check all positions against reduced set of patterns
13 / 15
Speeding it up II
Poor cache behavior
Split filter into small cache part and large memory part
Lot’s of hash functions
• Fast rolling hash functions
• Hash functions don’t need to be independent: linear
combinations are OK
14 / 15
This is the end
end {document}
15 / 15

Weitere ähnliche Inhalte

Ähnlich wie Slides1

streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
abramsm
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 

Ähnlich wie Slides1 (20)

streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Implement Data Structures with Python Fast using Sparse Matrix
Implement Data Structures with Python Fast using Sparse MatrixImplement Data Structures with Python Fast using Sparse Matrix
Implement Data Structures with Python Fast using Sparse Matrix
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
hash
 hash hash
hash
 
Data structures
Data structuresData structures
Data structures
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
 
Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Lec1.pptx
Lec1.pptxLec1.pptx
Lec1.pptx
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 

Mehr von BioinformaticsInstitute

Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
BioinformaticsInstitute
 
Knime & bioinformatics
Knime & bioinformaticsKnime & bioinformatics
Knime & bioinformatics
BioinformaticsInstitute
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
BioinformaticsInstitute
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
BioinformaticsInstitute
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
BioinformaticsInstitute
 

Mehr von BioinformaticsInstitute (20)

Graph genome
Graph genome Graph genome
Graph genome
 
Nanopores sequencing
Nanopores sequencingNanopores sequencing
Nanopores sequencing
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphs
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днк
 
Knime & bioinformatics
Knime & bioinformaticsKnime & bioinformatics
Knime & bioinformatics
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
 
Biodb 2011-everything
Biodb 2011-everythingBiodb 2011-everything
Biodb 2011-everything
 
Biodb 2011-05
Biodb 2011-05Biodb 2011-05
Biodb 2011-05
 
Biodb 2011-04
Biodb 2011-04Biodb 2011-04
Biodb 2011-04
 
Biodb 2011-03
Biodb 2011-03Biodb 2011-03
Biodb 2011-03
 
Biodb 2011-01
Biodb 2011-01Biodb 2011-01
Biodb 2011-01
 
Biodb 2011-02
Biodb 2011-02Biodb 2011-02
Biodb 2011-02
 
Ngs 3 1
Ngs 3 1Ngs 3 1
Ngs 3 1
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Slides1

  • 1. Bloom Filter in Bioinformatics Aleksey Kladov aleksey.kladov@gmail.com February 15, 2013
  • 2. Set data structure Implementations • Hash table • Search tree • Bloom filter Why Bloom is awesome? • Asymptotically not faster then a hash table • No delete operation • It simply doesn’t work! 1 / 15
  • 3. Meet the Bloom! • Actually, it’s all from Wikipedia :) • Burton Howard Bloom, 1970 • m-bit vector B • k hash functions: element → [0, m) 2 / 15
  • 4. Add • Want to add element e? • Compute h1(e), h2(e), . . . , hk(e) • Compute H(e) m-bit vetcor with k ones at h1(e), h2(e), . . . , hk(e) • Let B = B ∨ H(e) 3 / 15
  • 5. Query • Want to check if element e was inserted in B? • Check if B ∧ H(e) == H(e) • If yes then e may be in the set • If no then e is definitely not in the set • False positives =( 4 / 15
  • 6. So, why Bloom? From Wikipedia A Bloom filter with 1% error and an optimal value of k, requires only about 9.6 bits per element — regardless of the size of the elements. 5 / 15
  • 7. k, m, n and other letters • k – number of hash functions • m – length of the vector • p – false positive rate • n – number of elements in the set 6 / 15
  • 8. $Math$ Probability that ith bit is zero after one insertion (1 − 1 m )k After n insertions (1 − 1 m )kn Probability that ith bit is one after n insertions 1 − (1 − 1 m )kn Probability of positive answer (1 − (1 − 1 m )kn )k Recall that (1 − 1 m ) m ≈ e−1 , plug it into the last formula and we get false positive rate estimate ≈ (1 − e −kn m )k 7 / 15
  • 9. The optimal k Can we choose k to minimize false positive rate? min k (1 − e −kn m )k Optimal k m n ln 2 False positive rate 2− m n ln 2 Number of bits per item for a given p − ln p (ln 2)2 8 / 15
  • 11. Counting k-mers • Naive: hash table requires a lot of memory =( • ≈ half of the k-mers are singleton! • Increase in coverege leads to more singletons (linear dependency) • Most applications need solid k-mers 10 / 15
  • 12. Bloom filter! Paper ”Efficient counting of k-mers in DNA sequences using a bloom filter” by P´all Melsted and Jonathan K Pritchard, 2011 • Create a bloom filter and a hash table • If k-mer is not in filter – insert into filter • If k-mer is in filter – insert into table with count 2 or update count. Resulting counts are higher then true counts at most by one. To find precise counts make a second pass. 2x memory savings, assuming if half of the k-mers are unique 11 / 15
  • 13. Picture! Figure 1 : Memory usage for BFCounter, Jellyfish and Google Dense at different coverage levels(Chromosome 21 data) 12 / 15
  • 14. Speeding it up I Paper ”Exact Pattern Matching with Feed-Forward Bloom Filters” by Iulian Moraru and David G. Andersen, 2011 Problem Given the huge text and lots of patterns of the same length, find all occurences of patterns in the text. Solution • Create P and T Bloom filters • Insert all patterns in P • Search substring of text in filter • For each substring of text that is in P, remember it’s positon and add it to the T • Filter out all patterns that are not in T • Check all positions against reduced set of patterns 13 / 15
  • 15. Speeding it up II Poor cache behavior Split filter into small cache part and large memory part Lot’s of hash functions • Fast rolling hash functions • Hash functions don’t need to be independent: linear combinations are OK 14 / 15
  • 16. This is the end end {document} 15 / 15