Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Parks kmer metagenomics

K-mers in metagenomics

K-mers play a critical role in the exploration of metagenomic data. They have been widely used to assign taxonomic attributions to the short genomic fragments characteristic of shotgun (metagenomic) sequencing. These approaches provide an assembly-free method for profiling microbial communities, and have helped elucidate the factors driving microbial community composition across biogeochemical gradients. Advances in sequencing technology are now making it cost-effective to sequence microbial communities at sufficient depths to allow for the assembly of high-quality contigs. This has made it possible to adopt k-mer based approaches to enable reliable binning of contigs originating from a single microbial population within a community. In this session, I will present both an overview of how k-mers can be used to assign taxonomic attributions to short metagenomic reads, and discuss how these approaches have advanced to a point where population genomes can be recovered en masse from even complex microbial communities.

  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Parks kmer metagenomics

  1. 1. K-mers in Metagenomics by donovan parks
  2. 2. 2 of 27 metagenomics environmental sample extract and sequence DNA QC and error correct reads (K-mers!) assemble (K-mers!) bin genomes (K-mers!) assign taxonomy (and function) (K-mers!) refine genomes (K-mers!)
  3. 3. Assigning Taxonomic Labels to Metagenomic DNA Sequences
  4. 4. 4 of 27 a plethora of approaches  Homology: BLAST, MEGAN  Composition: Kraken, CLARK, Naïve Bayes  Hybrid: PhymmBL, FCP, PhyloPythia  Phylogenetic: Treephyler, AMPHORA, GraftM  Marker genes: 16S profiling, MetaPhlAn, PhyloSift classifyallreadsclassifysubset
  5. 5. 5 of 27 exploiting genomic (K-mer) signatures  PhymmBL (K≤8): interpolated Markov model  PhyloPythia (K ≈6): multiclass support vector machine  Naïve Bayes (K ≈15): probability of observing a K-mer  Kraken (K ≈31): exact K-mer matching  CLARK (K ≈31): exact matching of discriminative K-mers denseprofilessparseprofiles
  6. 6. 6 of 27 Kraken: K-mer LCA database Wood and Salzberg, Genome Biology, 2014 Reference Genomes (2,256 RefSeq Genomes) Lowest common ancestor database K-mer LCA ACC … GT g__Escherichia ACG … GT s__E. coli AGT … AA p__Proteobacteria … TGA … TT d__Bacteria Extract K-mers (default, K = 31)
  7. 7. 7 of 27 Kraken: classification tree Wood and Salzberg, Genome Biology, 2014
  8. 8. 8 of 27 assessment of methods Results from Ounit et al., BMC Genomics, 2015 and Wood and Salzberg, Genome Biology, 2014 Classifier Precision Sensitivity Speed Megablast 99.0 79.0 - Naïve Bayes (K = 15) 82.3 82.3 8 Naïve Bayes (K = 11) 59.0 59.0 20 PhymmBL 82.3 82.3 - CLARK 99.3 77.2 3.1 million Kraken (K = 31) 99.3 77.8 2.3 million Kraken (K = 20) 80.2 82.7 1.5 million  Precision: (correct classifications) / (total classifications)  Sensitivity: (correct classifications) / (total reads)  Speed: reads per minute  Results for simple simulated dataset
  9. 9. 9 of 27 impact of K and reference database size Classifier Precision Sensitivity Speed Megablast 99.0 79.0 - Naïve Bayes (K = 15) 82.3 82.3 8 Naïve Bayes (K = 11) 59.0 59.0 20 PhymmBL 82.3 82.3 - CLARK 99.3 77.2 3.1 million Kraken (K = 31) 99.3 77.8 2.3 million Kraken (K = 20) 80.2 82.7 1.5 million Kraken-GB (K = 31) 99.5 93.8 -  Performance is sensitive to K  Kraken-GB: 8,517 reference genomes instead of 2,256
  10. 10. 10 of 27 impact of taxonomic novelty Results from Wood and Salzberg, Genome Biology, 2014 Taxonomic Novelty Measured Rank Species Genus Family Domain 24.4 7.9 2.8 Phylum 23.9 7.2 2.5 Class 24.7 7.1 2.0 Order 24.1 6.8 2.0 Family 25.4 8.5 - Genus 26.3 - -  Sensitivity decreases rapidly with taxonomic novelty
  11. 11. 11 of 27 Kraken: some practical numbers  Applied to metagenome from coalbed methane well  ~82 million paired end reads (2 x 100bp)  ~30 minutes to process with 8 threads   Reference database requires ~70GB of RAM   Classified 7.7% of reads  0 10 20 30 40 50 60 Relativeabundance(%) 16S profile Kraken
  12. 12. 12 of 27 take away points  K-mers widely used to assign taxonomy to metagenomic reads  Active area of research  Resolution limited by reference genomes  16S profiling still the gold standard  change is coming…
  13. 13. Recovering Population Genomes from Metagenomic Data shotgun sequencing assembly bin contigs into genomes (genome-centric metagenomics) metagenome reads contigs
  14. 14. 14 of 27 recovering genomes from metagenomic data shotgun sequencing assembly metagenome reads contigs population genomes identify strain-specific SNPs binning classify using coverage and k-mer profiles
  15. 15. 15 of 27 differential coverage signal contigs with similar coverage profiles likely belong to the same genome!
  16. 16. 16 of 27 K-mers and coverage: complementary signals microbial community from coalbed methane well coverage tetranucleotide (PC1) Genome Comp. (%) Cont. (%) Length (Mbp) Archaea Methanobacteriaceae 1 98.4 1.6 2.32 Methanobacteriaceae 2 96.8 0.8 2.23 Methanobacteriaceae 3 88.6 0.0 1.57 Methanobacteriaceae 4 96.0 0.0 1.71 Bacteria Actinobacteria 1 95.0 0.9 2.56 Actinobacteria 2 90.5 2.7 2.72 Actinobacteria 3 88.4 2.7 2.48 Clostridiales 1 92.6 9.4 2.91 Clostridiales 2 80.2 0.0 2.74 Elusimicrobia 95.7 2.2 2.03 Thermodesulfovibrionaceae 83.9 0.0 2.66 Syntrophus 92.9 0.8 2.31 Rikenellaceae 86.7 2.3 2.72 Candidate Phylum OP1 83.9 0.0 1.66 Rhodocyclaceae 69.0 1.63 3.73
  17. 17. 17 of 27 many ways to combine coverage + K-mer profiles  GroopM: http://minillinim.github.io/GroopM/  DBB: https://github.com/dparks1134/DBB  CONCOCT: https://github.com/BinPro/CONCOCT  MetaWatt: http://sourceforge.net/projects/metawatt/  MetaBAT: https://bitbucket.org/berkeleylab/metabat
  18. 18. 18 of 27 MetaBAT overview Kang et al., bioRxiv, 2014
  19. 19. 19 of 27 MetaBAT: statistical model of tetranucleotide signatures  Empirical parameters from ~1500 reference genomes  Posterior probability that two contigs are from different genomes: Kang et al., bioRxiv, 2014 contig size = 10kb 𝑃 𝑖𝑛𝑡𝑒𝑟 𝐷 = 𝛼𝑃(𝐷|𝑖𝑛𝑡𝑒𝑟) 𝛼𝑃 𝐷 𝑖𝑛𝑡𝑒𝑟 + 𝑃(𝐷|𝑖𝑛𝑡𝑟𝑎) tetranucleotide distance, D tetranucleotide distance, D probability,P(inter|D)
  20. 20. 20 of 27 rapidly filling out tree of life 60 bacterial phyla >3000 population genomes 23 habitats 51 phyla with population genome representatives
  21. 21. 21 of 27 take away points  Population genomes can be recovered from metagenomic samples  K-mer profiles complement differential coverage signal  Rapidly expanding reference genomes  Improve gene-centric metagenomics
  22. 22. Assessing and Refining Population Genomes
  23. 23. 23 of 27 estimating quality of population genomes Additional markers refine quality estimates Scaffolds Gammaproteobacteria sp. 80 % complete, 20% contaminated 105 bacterial marker genes estimates: 92% comp., 17% cont. 281 clade-specific marker genes estimates: 83% comp., 22% cont. Parks et al., Genome Res., 2015 Estimates ± 5%
  24. 24. 24 of 27 varying quality of recovered genomes microbial community from coalbed methane well coverage tetranucleotide (PC1) Genome Comp. (%) Cont. (%) Length (Mbp) Archaea Methanobacteriaceae 1 98.4 1.6 2.32 Methanobacteriaceae 2 96.8 0.8 2.23 Methanobacteriaceae 3 88.6 0.0 1.57 Methanobacteriaceae 4 96.0 0.0 1.71 Bacteria Actinobacteria 1 95.0 0.9 2.56 Actinobacteria 2 90.5 2.7 2.72 Actinobacteria 3 88.4 2.7 2.48 Clostridiales 1 92.6 9.4 2.91 Clostridiales 2 80.2 0.0 2.74 Elusimicrobia 95.7 2.2 2.03 Thermodesulfovibrionaceae 83.9 0.0 2.66 Syntrophus 92.9 0.8 2.31 Rikenellaceae 86.7 2.3 2.72 Candidate Phylum OP1 83.9 0.0 1.66 Rhodocyclaceae 69.0 1.63 3.73
  25. 25. 25 of 27 identifying potential contamination 95th percentile outliers… treat with caution
  26. 26. 26 of 27 K-mer modeling: impact of evolution Bacteria vs. Archaea (Intra-genome 95th percentile; K=4) Classes of Proteobacteria (Intra-genome 95th percentiles; K=4)
  27. 27. 27 of 27 final thoughts  K-mers widely used in gene- and genome-centric metagenomic  Population genomes substantially improving diversity of available reference genomes  Big win for taxonomic attribution methods  And CheckM, and many other bioinformatic programs  How best to exploit population genomes  Looking at 100,000+ reference genomes in next few years  Issues in terms of scalability  Using ‘noisy’ population genomes raises interesting questions
  28. 28. Thank you!

×