SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
A probabilistic approach to k-mer counting

                                                   Qingpeng Zhang

                                 Department of Computer Science and Engineering
                                           Michigan State University
                                         East Lansing, Michigan, USA
                                                    qingpeng@msu.edu

                                                     July 13, 2012




Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   1 / 12
What is k-mer counting?




Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   2 / 12
What is our k-mer counting approach?

                                                                                  The Bloom counting hash
                                                                                  consists of one or more
                                                                                  hash tables of different
                                                                                  size
                                                                                  Each entry in the hash
                                                                                  tables is a counter
                                                                                  representing the number
                                                                                  of k-mers that hash to
                                                                                  that location
                                                                                        Bloom filter(0/1) or
                                                                                        Count-min
                                                                                        Sketch(counting)
                                                                                  The hash function is to
                                                                                  take the modulus of a
                                                                                  number representing the
                                                                                  k-mer with the table size.

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting              July 2012   3 / 12
What is our k-mer counting approach?




           With certain counting false positive rate1 as tradeoff because of collision
           Probabilistic properties well suited to next generation sequencing datasets
           Highly scalable: Counting accuracy is related to memory usage. However
           our approach will never break an imposed memory bound.




      1
        counting false positive rate: the possibility that the number of counts will
   be incorrect (off by 1 or more)
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   4 / 12
How does our k-mer counting approach perform?
   How many k-mers have incorrect count? - counting error rate


                                                                      N: number of unique kmers; Z:
                                                                      number of hash tables; H: size
                                                                      of hash tables
                                                                      The probability that no collisions
                                                                      happened in a specific entry in
                                                                      one hash table is
                                                                      (1 − 1/H)N ,which is e −N/H .
                                                                      The individual collision rate in
                                                                      one hash table is 1 − e −N/H .
                  Example: N=915898,
                  Z=4, H=400000,                                      The counting error rate f , which
                                  −N/H Z                              is the probability that collision
                  f = (1 − e              ) =
                                                                      happened in all the locations
                  0.6523
                                                                      where a k-mer is hashed to in all
                  observed counting                                   Z hash tables, will be
                  error rate f : 0.6566                               (1 − e −N/H )Z

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting           July 2012   5 / 12
How does our k-mer counting approach perform?
   Ok, some counts are incorrect. However, how ”incorrect”?




                                                                             factors to influence miscount:
                                                                                        number of total k-mers
                                                                                        hash table size




Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting                July 2012   6 / 12
How does our k-mer counting approach perform?
   Time Usage




                            Figure: Time usage of khmer counting approach

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   7 / 12
How does our k-mer counting approach perform?
   Memory Usage




                       Figure: Memory usage of different k-mer counting tools

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   8 / 12
How does our k-mer counting approach perform?
   Disk Storage Usage




                    Figure: disk storage usage of different k-mer counting tools

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   9 / 12
What is the application of our approach?
   Filtering out reads with low-abundance k-mers for de novo assembly




                     Figure: Percentage of ”bad” reads in the remaining reads

   Iterating filtering out low-abundance reads(”bad” reads) that contain even a
   single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a
   human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads)
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   10 / 12
Summary


           a simple probabilistic approach for fast and memory efficient counting of
           k-mers
                arbitrary-length k-mers
                arbitrary-size sequence data set
                with a tradeoff of counting error
           other possible applications
                digital normalization
                repeat detection
                diversity analysis of metagenomic sample.
                ...
           The khmer software package is written in C++ and Python, available at
           https://github.com/ged-lab/khmer




Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   11 / 12
Acknowledgement




           Jason Pell, Rose Canino-Koning, Adina Chuang Howe
           Dr. C. Titus Brown
           GED lab members@ Michigan State University
           Funding from USDA, DOE, MSU, BEACON, iCER
           Thanks!




Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   12 / 12

Weitere ähnliche Inhalte

Mehr von Jan Aerts

Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data AnalysisJan Aerts
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualizationJan Aerts
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsJan Aerts
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumJan Aerts
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisJan Aerts
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...Jan Aerts
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...Jan Aerts
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...Jan Aerts
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...Jan Aerts
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsJan Aerts
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts
 
B Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoB Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoJan Aerts
 
D Baker - Galaxy Update
D Baker - Galaxy UpdateD Baker - Galaxy Update
D Baker - Galaxy UpdateJan Aerts
 
M Reich - GenomeSpace
M Reich - GenomeSpaceM Reich - GenomeSpace
M Reich - GenomeSpaceJan Aerts
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...Jan Aerts
 

Mehr von Jan Aerts (20)

Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
B Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoB Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUno
 
D Baker - Galaxy Update
D Baker - Galaxy UpdateD Baker - Galaxy Update
D Baker - Galaxy Update
 
M Reich - GenomeSpace
M Reich - GenomeSpaceM Reich - GenomeSpace
M Reich - GenomeSpace
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
 

Kürzlich hochgeladen

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 

Kürzlich hochgeladen (20)

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 

Zhang Q - A probabilistic approach to k-mer counting

  • 1. A probabilistic approach to k-mer counting Qingpeng Zhang Department of Computer Science and Engineering Michigan State University East Lansing, Michigan, USA qingpeng@msu.edu July 13, 2012 Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12
  • 2. What is k-mer counting? Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 2 / 12
  • 3. What is our k-mer counting approach? The Bloom counting hash consists of one or more hash tables of different size Each entry in the hash tables is a counter representing the number of k-mers that hash to that location Bloom filter(0/1) or Count-min Sketch(counting) The hash function is to take the modulus of a number representing the k-mer with the table size. Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 3 / 12
  • 4. What is our k-mer counting approach? With certain counting false positive rate1 as tradeoff because of collision Probabilistic properties well suited to next generation sequencing datasets Highly scalable: Counting accuracy is related to memory usage. However our approach will never break an imposed memory bound. 1 counting false positive rate: the possibility that the number of counts will be incorrect (off by 1 or more) Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 4 / 12
  • 5. How does our k-mer counting approach perform? How many k-mers have incorrect count? - counting error rate N: number of unique kmers; Z: number of hash tables; H: size of hash tables The probability that no collisions happened in a specific entry in one hash table is (1 − 1/H)N ,which is e −N/H . The individual collision rate in one hash table is 1 − e −N/H . Example: N=915898, Z=4, H=400000, The counting error rate f , which −N/H Z is the probability that collision f = (1 − e ) = happened in all the locations 0.6523 where a k-mer is hashed to in all observed counting Z hash tables, will be error rate f : 0.6566 (1 − e −N/H )Z Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 5 / 12
  • 6. How does our k-mer counting approach perform? Ok, some counts are incorrect. However, how ”incorrect”? factors to influence miscount: number of total k-mers hash table size Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 6 / 12
  • 7. How does our k-mer counting approach perform? Time Usage Figure: Time usage of khmer counting approach Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 7 / 12
  • 8. How does our k-mer counting approach perform? Memory Usage Figure: Memory usage of different k-mer counting tools Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 8 / 12
  • 9. How does our k-mer counting approach perform? Disk Storage Usage Figure: disk storage usage of different k-mer counting tools Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 9 / 12
  • 10. What is the application of our approach? Filtering out reads with low-abundance k-mers for de novo assembly Figure: Percentage of ”bad” reads in the remaining reads Iterating filtering out low-abundance reads(”bad” reads) that contain even a single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads) Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 10 / 12
  • 11. Summary a simple probabilistic approach for fast and memory efficient counting of k-mers arbitrary-length k-mers arbitrary-size sequence data set with a tradeoff of counting error other possible applications digital normalization repeat detection diversity analysis of metagenomic sample. ... The khmer software package is written in C++ and Python, available at https://github.com/ged-lab/khmer Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 11 / 12
  • 12. Acknowledgement Jason Pell, Rose Canino-Koning, Adina Chuang Howe Dr. C. Titus Brown GED lab members@ Michigan State University Funding from USDA, DOE, MSU, BEACON, iCER Thanks! Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 12 / 12