SlideShare ist ein Scribd-Unternehmen logo
1 von 98
Scaling Mt. Metagenome:Assembling very large data sets C. Titus Brown Assistant Professor Computer Science and Engineering / Microbiology and Molecular Genetics Michigan State University
Thanks for coming! Note: this talk is about the computational side of metagenome assembly, motivated by the Great Prairie Grand Challenge soil sequencing project. Jim Tiedje will talk about the project as a whole at the JGI User’s Meeting.
The basic problem. Lots of metagenomic sequence data (200 GB Illumina for< $20k?) Assembly, especially metagenome assembly, scales poorly (due to high diversity). Standard assembly techniques don’t work well with sequences from multiple abundance genomes. Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).
We can’t just throw more hardware at the problem, either. Lincoln Stein
Jumping to the end: We have implemented a solution for these problems: Scalability of assembly, Lack of resources,  and parameter choice. We demonstrate this solution for a high diversity sample (219.1 Gb of Iowa corn field soil metagenome). …there is an additional surprise or two, so you should stick around!
Whole genome shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010)
K-mer graphs - branching For decisions about which paths etc, biology-based heuristics come into play as well.
Too much data – what can we do? Reduce the size of the data (either with an approximate or an exact approach) Divide & conquer: subdivide the problem. For exact data reduction or subdivision, need to grok the entire assembly graph structure. …but that is why assembly scales poorly in the first place.
Abundance filtering Approach used in two published Illumina metagenomic papers (MetaHIT/human microbiomeand rumen papers) Remove or trim reads with low-abundance k-mers Either due to errors, or low-abundance organisms. Inexact data reduction: may or may not remove usable data. Works well for high-coverage data sets (rumen est56x!!) However, for low-coverage or high-diversity data sets, abundance filtering will reject potentially useful reads.
Abundance filtering
Two exact data reduction techniques: Eliminate reads that do not connect to many other reads. Group reads by connectivity into different partitions of the entire graph. For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
Eliminating unconnected reads “Graphsize filtering”
Subdividing reads by connection “Partitioning”
Two exact data reduction techniques: Eliminate reads that do not connect to many other reads (“graphsize filtering”). Group reads by connectivity into different partitions of the entire graph (“partitioning”). For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
Engineering overview Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure; With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k. Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes). For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784
Store graph nodes in Bloom filter Graph traversal is done in full k-mer space; Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).
Practical application Enables: graph trimming (exact removal) partitioning (exact subdivision) abundance filtering … all for K <= 64, for 200+ gb sequence collections. All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores. Similar running times to using Velvet alone.
We pre-filter data for assembly:
Does removing small graphs work? Small data set (35m reads / 3.4 gb  rhizosphere soil sample) Filtered at k=32, assembled at k=33 with ABYSS N contigs	/ Total bp			Largest contig 130     		   223,341	  		61,766				Unfiltered (35m) 130     		   223,341	  		61,766				Filtered (2m reads) YES.
Does partitioning into disconnected graphs work? Partitioned same data set (35m reads / 3.5 gb) into 45k partitions containing > 10 reads; assembled partitions separately (k0=32, k=33). N contigs	/ Total bp			Largest contig 130     		   223,341	  		61,766				Unfiltered (35m) 130     		   223,341	  		61,766				Sum partitions YES.
Data reduction for assembly / practical details Reduction performed on machine with 16 gb of RAM. Removing poorly connected reads: 35m -> 2m reads. 	- Memory required reduced from 40 gb to 2 gb; 	- Time reduced from 4 hrs to 20 minutes. Partitioning reads into disconnected groups: 	- Biggest group is 300k reads 	- Memory required reduced from 40 gb to 500 mb; 	- Time reduced from 4 hrs to < 5 minutes/group.
Does it work on bigger data sets? 35 m read data set partition sizes: P1: 277,043 reads P2: 5776 reads P3: 4444 reads P4: 3513 reads P5: 2528 reads P6: 2397 reads … Iowa continuous corn GA2 partitions (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads …
Problem: big data sets have one big partition!? Too big to handle on EC2. Assembles with low coverage. Contains 2.5 bn unique k-mers (~500microbial genomes), at ~3-5x coverage As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble! Both for our approach, And possibly for assembly in general (because it assembles more poorly than it should, for given coverage/size)
Why this lump? Real biological connectivity (rRNA, conserved genes, etc.) Bug in our software Sequencing artifact or error
Why this lump? Real biological connectivity? Probably not. 	- 	Increasing Kfrom 32 to ~64 didn’t break up the lump: not biological. Bug in our software? Probably not. ,[object Object],Sequencing artifact or error? YES. -	(Note, we do filter & quality trim all sequences already)
“Good” vs “bad” assembly graph Low density High density
Non-biological levels of local graph connectivity:
Higher local graph density correlates with position in read
Higher local graph density correlates with position in read ARTIFACT
Trimming reads Trim at high “soddd”, sum of degree degree distribution: From each k-mer in each read, walk two k-mers in all directions in the graph; If more than 3 k-mers can be found at exactly two steps, trim remainder of sequence. Overly stringent; actually trimming (k-1) connectivity graph by degree.
Trimmed read examples >895:5:1:1986:16019/2 TGAGCACTACCTGCGGGCCGGGGACCGGGTCAGCCTGCT CGACCTGGGCCAACCGATGCGCC >895:5:1:1995:6913/1 TTGCGCGCCATGAAGCGGTTAACGCGCTCGGTCCATAGC GCGATG >895:5:1:1995:6913/2 GTTCATCGCGCTATGGACCGAGCGCGTTAACCGCTTCAT GGCGCGCAAAGATCGGAAGAGCGTCGTGTAG
Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.
Artifacts from sequencing falsely connect graphs
Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.
Groxel view of knot-like region / ArendHintze
Density trimming breaks up the lump: Old P1,sodddtrimmed 	(204.6 mreads -> 179 m): P1: 23,444,332 reads P2: 60,703 reads P3: 48,818 reads P4: 39,755 reads P5: 34,902 reads P6: 33,284 reads … Untrimmed partitioning (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads …
What does density trimming do to assembly? 204 m reads in lump: 	 assembles into 52,610 contigs; total 73.5 MB 180 m reads in trimmed lump: 	assembles into 57,135 contigs; total83.6 MB (all contigs > 1kb) Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
Wait, what? Yes, trimming these “knot-like” sequences improves the overall assembly! We remove 25.6 m reads and gain 10.1 MB!? Trend is same for ABySS, another k-mergraph assembler, as well.
Is this a valid assembly? Paired-end usage is good. 50% of contigs have BLASTX hit better than 1e-20 in Swissprot; 75% of contigs have BLASTX hit better than 1e-20 in TrEMBL; Reference genomes sequenced by JGI: Frateuriaaurantia: 1376 hits > 100 aa Saprospiragrandis: 1114 hits > 100 aa (> 50% identity over > 50% of gene)
So what’s going on? Current assemblers are bad at dealing with certain graph sturctures (“knots”). If we can untangle knots for them, that’s good, maybe? Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves? Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.
OK, let’s assemble! Iowa corn (HiSeq+ GA2): 219.11 Gb of sequence assembles to: 	148,053 contigs, 	in220MB; 	max length 20322 	max coverage ~10x …all done on Amazon EC2, ~ 1 week for under $500. Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
Full Iowa corn / mapping stats 1,806,800,000 QC/trimmed reads (1.8 bn) 204,900,000 reads map to somecontig (11%) 37,244,000 reads map to contigs > 1kb (2.1%) > 1 kb contig is a stringent criterion! Compare: 80% of MetaHIT reads to > 500 bp; 65%+ of rumen reads to > 1kb
Percentage mapped vscontig size
High coverage partitions assemble more reads
Success, tentatively. We are still evaluating assembly and assembly parameters; should be possible to improve in every way.  (~10 hrs to redo entire assembly, once partitioned.) The main engineering point is that we can actuallyrun this entire pipeline on a relatively small machine (8 core/68 GB RAM) We can do dozens of these in parallel on Amazon rental hardware. And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.
Optimizing per-partition assembly Metagenomes contain mixed-abundance genomes. Current assemblers are not built for mixed-abundance samples (problem with mRNAseq, too). Repeat resolution Error/edge trimming Since we’re breaking the data set into multiple partitions containing reads that may assemble together, can we optimize assembler parameters (k, coverage) for each partition?
Mixing parameters improves assembly statistics Objective function: maximize sum(contigs > 1kb) 4.5x average coverage– gained 228 contigs/469 kb 	(over 152/215 kb) 5.8x average coverage – gained 78 contigs/304 kb 	(over 248/708 kb) 8.2x average coverage – lost 58 contigs /gained 116 kb 	(over 279/803 kb)
Conclusions Engineering: can assemble large data sets. Scaling: can assemble on rented machines. Science: can optimize assembly for individual partitions. Science: retain low-abundance.
Caveats Quality of assembly?? Illumina sequencing bias/error issue needs to be explored. Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs. Need to better analyze upper limits of data structures. Have not applied our approaches to high-coverage data yet; in progress.
Future thoughts Our pre-filtering technique alwayshas lower memory requirements than Velvet or other assemblers.  So it is a good first step to try, even if it doesn’t reduce the problem significantly. Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future. This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence. Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …)
Acknowledgements The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning QingpengZhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.
A guide to khmer Python wrapping C++; BSD license. Tools for: K-mer abundance filtering (constant mem; inexact) Assembly graph size filtering (constant mem; exact) Assembly graph partitioning (exact) Error trimming (constant mem; inexact) Still in alpha form… undocumented, esp.
k-mer coverage by partition
Abundance filtering affects low-coverage contigs dramatically
Many read pairs map together
Bonus slides How much more do we need to sequence, anyway??
Calculating expected k-mer numbers Entire population S1 S2 Note: no simple way to correct abundance bias, so we don’t, yet.
Coverage estimates (Based on k-mer mark/recapture analysis.) Iowa prairie (136 GB):    		est 1.26 x Iowa corn (62 GB): 		est 0.86 x Wisconsin corn (190 GB):		est 2.17 x For comparison, the panda genome assembly used ~50x with short reads. Qingpeng Zhang
Coverage estimates: getting to 50x… Human					-> 150 GB for 50x Iowa prairie (136 GB):    	 est 1.26 x 	-> 5.4 TB for 50x Iowa corn (62 GB): 	 est 0.86 x 	-> 3.6 TB for 50x Wisconsin corn (190 GB):	 est 2.17 x 	-> 4.4 TB for 50x …note that it’s not clear what “coverage” exactly means in this case, since 16s-estimated diversity is very high.
What does coverage mean here? “Unseen” sequence: 1x ~ 37% 2x ~ 14% 5x ~ 0.7% 10x ~ .00005%  50x ~ 2e-20% For metagenomes, coverage is of abundance weighted DNA.
CAMERA Annotation of full set contigs(>1000 bp)  # of ORFS:  344,661 (Metagene) 	Longest ORF:  1,974 bp 	Shortest ORF:  20 bp 	Average ORF:  173 bp # of COG hits:  153,138 (e-value < 0.001) # of Pfam hits:  170,072 # of TIGRfam hits:  315,776
CAMERA COG Summary
The k-mer oracle Q: is this k-mer present in the data set? A: no => then it is not. A: yes => it may or may not be present. This lets us store k-mers efficiently.
Building on the k-mer oracle: Once we can store/query k-mers efficiently in this oracle, we can build additional oracles on top of it:
The k-mer graph oracle Q: does this k-mer overlap with this other k-mer? A: no => then it does not, guaranteed. A: yes => it may or may not. This lets us traverse de Bruijn graphs efficiently.
The contig size oracle Q: could this read contribute to a contig bigger than N? A: no => then it does not, guaranteed. A: yes => then it might. This lets us eliminate reads that do not belong to “big” contigs.
The read partition oracle Does this read connect to this other read in any way? A: no => then it does not, guaranteed. A: yes => then it might. This lets us subdivide the assembly problem into many smaller, disconnected problems that are much easier.
Oracular fact All of these oracles are cheap, can yield answers from a different probability distribution, and can be “chained” together (so you can keep on asking oracles for as long as you want, and get more and more accurate).
Implementing a basic k-mer oracle Conveniently, perhaps the simplest data structure in computer science is what we need… …a hash table that ignores collisions. Note, P(false positive) = fractional occupancy.
A more reliable k-mer oracle Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more reliable.
Scaling the k-mer oracle Adding additional filters increases discrimination at the cost of speed. This gives you a fairly straightforward tradeoff: memory (decrease individual false positives) vs computation (more filters!)
The k-mer oracle, revisited We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately. This implicitly lets us store the graph structure, too!
B. Partitioning graphs into disconnected subgraphs Which nodes do not connect to each other?
Partitioning graphs – it’s easy looking Which nodes do not connect to each other?
But partitioning big graphs is expensive Requires exhaustive exploration.
But partitioning big graphs is expensive
Tabu search – avoid global searches
Tabu search – systematic local exploration
Tabu search – systematic local exploration
Tabu search – systematic local exploration
Tabu search – systematic local exploration
Strategies for completing big searches…
Hard-to-traverse graphs are well-connected
Add neighborhood-exclusion to tabu search
Exclusion strategy lets you systematically explore big graphs with a local algorithm
Potential problems Our oracle can mistakenly connect clusters.
Potential problems This is a problem if the rate is sufficiently high!
However, the error is one-sided: Graphs will never be erroneously disconnected
The error is one-sided: Nodes will never be erroneously disconnected
The error is one-sided: Nodes will never be erroneously disconnected. This is critically important: it guarantees that our k-mer graph representation yields reliable “no” answers. This, in turn, lets us reliably partition graphs into smaller graphs.
Actual implementation

Weitere ähnliche Inhalte

Andere mochten auch

Extra Credit: Eye Tracking Finance Websites
Extra Credit: Eye Tracking Finance WebsitesExtra Credit: Eye Tracking Finance Websites
Extra Credit: Eye Tracking Finance WebsitesJennifer Hsieh
 
Manduca
ManducaManduca
Manducanbmro
 
NABE Communications Section - Event Program
NABE Communications Section - Event ProgramNABE Communications Section - Event Program
NABE Communications Section - Event ProgramStephanie Abbott
 
Maximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain TimesMaximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain TimesKristina O'Regan
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
ROI, magic bullets and social business
ROI, magic bullets and social businessROI, magic bullets and social business
ROI, magic bullets and social businessNiall O'Malley
 
Homework, Term 3 & 4
Homework, Term 3 & 4Homework, Term 3 & 4
Homework, Term 3 & 4Takahe One
 
Presentation Teknisa
Presentation TeknisaPresentation Teknisa
Presentation Teknisaguestf98a87
 
VAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnershipVAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnershipreginal97
 
CosmoSales: Integrated Digital Sales Strategy [Global Version]
CosmoSales: Integrated Digital Sales Strategy [Global Version]CosmoSales: Integrated Digital Sales Strategy [Global Version]
CosmoSales: Integrated Digital Sales Strategy [Global Version]Khomeini Mujahid
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynotec.titus.brown
 
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiAmadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiMaurizio Repetto
 
Enlightenment
EnlightenmentEnlightenment
EnlightenmentGregorio
 
Manual Book - Telkomsel Care Applications
Manual Book - Telkomsel Care ApplicationsManual Book - Telkomsel Care Applications
Manual Book - Telkomsel Care ApplicationsKhomeini Mujahid
 
3835 N Greenview #1
3835 N Greenview #13835 N Greenview #1
3835 N Greenview #1bamadogg
 
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?Kegler Brown Hill + Ritter
 
Tutor Inservice (Health Literacy), May 2009
Tutor Inservice (Health Literacy), May 2009Tutor Inservice (Health Literacy), May 2009
Tutor Inservice (Health Literacy), May 2009Sarah Halstead
 
Talk at 2012 Notre Dame Collab Computing Lab workshop
Talk at 2012 Notre Dame Collab Computing Lab workshopTalk at 2012 Notre Dame Collab Computing Lab workshop
Talk at 2012 Notre Dame Collab Computing Lab workshopc.titus.brown
 

Andere mochten auch (20)

Extra Credit: Eye Tracking Finance Websites
Extra Credit: Eye Tracking Finance WebsitesExtra Credit: Eye Tracking Finance Websites
Extra Credit: Eye Tracking Finance Websites
 
Manduca
ManducaManduca
Manduca
 
NABE Communications Section - Event Program
NABE Communications Section - Event ProgramNABE Communications Section - Event Program
NABE Communications Section - Event Program
 
Maximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain TimesMaximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain Times
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
ROI, magic bullets and social business
ROI, magic bullets and social businessROI, magic bullets and social business
ROI, magic bullets and social business
 
Homework, Term 3 & 4
Homework, Term 3 & 4Homework, Term 3 & 4
Homework, Term 3 & 4
 
Presentation Teknisa
Presentation TeknisaPresentation Teknisa
Presentation Teknisa
 
VAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnershipVAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnership
 
CosmoSales: Integrated Digital Sales Strategy [Global Version]
CosmoSales: Integrated Digital Sales Strategy [Global Version]CosmoSales: Integrated Digital Sales Strategy [Global Version]
CosmoSales: Integrated Digital Sales Strategy [Global Version]
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiAmadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
 
Litigation 101: Depositions
Litigation 101: DepositionsLitigation 101: Depositions
Litigation 101: Depositions
 
Enlightenment
EnlightenmentEnlightenment
Enlightenment
 
Manual Book - Telkomsel Care Applications
Manual Book - Telkomsel Care ApplicationsManual Book - Telkomsel Care Applications
Manual Book - Telkomsel Care Applications
 
3835 N Greenview #1
3835 N Greenview #13835 N Greenview #1
3835 N Greenview #1
 
Matchmoving Introduction
Matchmoving IntroductionMatchmoving Introduction
Matchmoving Introduction
 
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
 
Tutor Inservice (Health Literacy), May 2009
Tutor Inservice (Health Literacy), May 2009Tutor Inservice (Health Literacy), May 2009
Tutor Inservice (Health Literacy), May 2009
 
Talk at 2012 Notre Dame Collab Computing Lab workshop
Talk at 2012 Notre Dame Collab Computing Lab workshopTalk at 2012 Notre Dame Collab Computing Lab workshop
Talk at 2012 Notre Dame Collab Computing Lab workshop
 

Ähnlich wie Climbing Mt. Metagenome

Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesScott Edmunds
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsc.titus.brown
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocolsc.titus.brown
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 
MrBayes_intro_big4ws_2016-10-10
MrBayes_intro_big4ws_2016-10-10MrBayes_intro_big4ws_2016-10-10
MrBayes_intro_big4ws_2016-10-10FredrikRonquist
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentationaustinps
 
02.cnn - CNN 파헤치기 3탄
02.cnn - CNN 파헤치기 3탄02.cnn - CNN 파헤치기 3탄
02.cnn - CNN 파헤치기 3탄Jeong-gyu Kim
 
Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561Shobhit Saxena
 
End of Sprint 5
End of Sprint 5End of Sprint 5
End of Sprint 5dm_work
 
EOS5 Demo
EOS5 DemoEOS5 Demo
EOS5 Demodm_work
 
Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsBioinformaticsInstitute
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 

Ähnlich wie Climbing Mt. Metagenome (20)

Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
MrBayes_intro_big4ws_2016-10-10
MrBayes_intro_big4ws_2016-10-10MrBayes_intro_big4ws_2016-10-10
MrBayes_intro_big4ws_2016-10-10
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
02.cnn - CNN 파헤치기 3탄
02.cnn - CNN 파헤치기 3탄02.cnn - CNN 파헤치기 3탄
02.cnn - CNN 파헤치기 3탄
 
Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561
 
End of Sprint 5
End of Sprint 5End of Sprint 5
End of Sprint 5
 
EOS5 Demo
EOS5 DemoEOS5 Demo
EOS5 Demo
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphs
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 

Mehr von c.titus.brown

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-reviewc.titus.brown
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcastc.titus.brown
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbugc.titus.brown
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 

Mehr von c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 

Kürzlich hochgeladen

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Kürzlich hochgeladen (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Climbing Mt. Metagenome

  • 1. Scaling Mt. Metagenome:Assembling very large data sets C. Titus Brown Assistant Professor Computer Science and Engineering / Microbiology and Molecular Genetics Michigan State University
  • 2. Thanks for coming! Note: this talk is about the computational side of metagenome assembly, motivated by the Great Prairie Grand Challenge soil sequencing project. Jim Tiedje will talk about the project as a whole at the JGI User’s Meeting.
  • 3. The basic problem. Lots of metagenomic sequence data (200 GB Illumina for< $20k?) Assembly, especially metagenome assembly, scales poorly (due to high diversity). Standard assembly techniques don’t work well with sequences from multiple abundance genomes. Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).
  • 4. We can’t just throw more hardware at the problem, either. Lincoln Stein
  • 5. Jumping to the end: We have implemented a solution for these problems: Scalability of assembly, Lack of resources, and parameter choice. We demonstrate this solution for a high diversity sample (219.1 Gb of Iowa corn field soil metagenome). …there is an additional surprise or two, so you should stick around!
  • 6. Whole genome shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 7. K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010)
  • 8. K-mer graphs - branching For decisions about which paths etc, biology-based heuristics come into play as well.
  • 9. Too much data – what can we do? Reduce the size of the data (either with an approximate or an exact approach) Divide & conquer: subdivide the problem. For exact data reduction or subdivision, need to grok the entire assembly graph structure. …but that is why assembly scales poorly in the first place.
  • 10.
  • 11.
  • 12.
  • 13. Abundance filtering Approach used in two published Illumina metagenomic papers (MetaHIT/human microbiomeand rumen papers) Remove or trim reads with low-abundance k-mers Either due to errors, or low-abundance organisms. Inexact data reduction: may or may not remove usable data. Works well for high-coverage data sets (rumen est56x!!) However, for low-coverage or high-diversity data sets, abundance filtering will reject potentially useful reads.
  • 15. Two exact data reduction techniques: Eliminate reads that do not connect to many other reads. Group reads by connectivity into different partitions of the entire graph. For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
  • 16. Eliminating unconnected reads “Graphsize filtering”
  • 17. Subdividing reads by connection “Partitioning”
  • 18. Two exact data reduction techniques: Eliminate reads that do not connect to many other reads (“graphsize filtering”). Group reads by connectivity into different partitions of the entire graph (“partitioning”). For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
  • 19. Engineering overview Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure; With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k. Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes). For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784
  • 20. Store graph nodes in Bloom filter Graph traversal is done in full k-mer space; Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).
  • 21. Practical application Enables: graph trimming (exact removal) partitioning (exact subdivision) abundance filtering … all for K <= 64, for 200+ gb sequence collections. All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores. Similar running times to using Velvet alone.
  • 22. We pre-filter data for assembly:
  • 23. Does removing small graphs work? Small data set (35m reads / 3.4 gb rhizosphere soil sample) Filtered at k=32, assembled at k=33 with ABYSS N contigs / Total bp Largest contig 130      223,341   61,766 Unfiltered (35m) 130      223,341   61,766 Filtered (2m reads) YES.
  • 24. Does partitioning into disconnected graphs work? Partitioned same data set (35m reads / 3.5 gb) into 45k partitions containing > 10 reads; assembled partitions separately (k0=32, k=33). N contigs / Total bp Largest contig 130      223,341   61,766 Unfiltered (35m) 130      223,341   61,766 Sum partitions YES.
  • 25. Data reduction for assembly / practical details Reduction performed on machine with 16 gb of RAM. Removing poorly connected reads: 35m -> 2m reads. - Memory required reduced from 40 gb to 2 gb; - Time reduced from 4 hrs to 20 minutes. Partitioning reads into disconnected groups: - Biggest group is 300k reads - Memory required reduced from 40 gb to 500 mb; - Time reduced from 4 hrs to < 5 minutes/group.
  • 26. Does it work on bigger data sets? 35 m read data set partition sizes: P1: 277,043 reads P2: 5776 reads P3: 4444 reads P4: 3513 reads P5: 2528 reads P6: 2397 reads … Iowa continuous corn GA2 partitions (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads …
  • 27. Problem: big data sets have one big partition!? Too big to handle on EC2. Assembles with low coverage. Contains 2.5 bn unique k-mers (~500microbial genomes), at ~3-5x coverage As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble! Both for our approach, And possibly for assembly in general (because it assembles more poorly than it should, for given coverage/size)
  • 28. Why this lump? Real biological connectivity (rRNA, conserved genes, etc.) Bug in our software Sequencing artifact or error
  • 29.
  • 30. “Good” vs “bad” assembly graph Low density High density
  • 31. Non-biological levels of local graph connectivity:
  • 32. Higher local graph density correlates with position in read
  • 33. Higher local graph density correlates with position in read ARTIFACT
  • 34. Trimming reads Trim at high “soddd”, sum of degree degree distribution: From each k-mer in each read, walk two k-mers in all directions in the graph; If more than 3 k-mers can be found at exactly two steps, trim remainder of sequence. Overly stringent; actually trimming (k-1) connectivity graph by degree.
  • 35. Trimmed read examples >895:5:1:1986:16019/2 TGAGCACTACCTGCGGGCCGGGGACCGGGTCAGCCTGCT CGACCTGGGCCAACCGATGCGCC >895:5:1:1995:6913/1 TTGCGCGCCATGAAGCGGTTAACGCGCTCGGTCCATAGC GCGATG >895:5:1:1995:6913/2 GTTCATCGCGCTATGGACCGAGCGCGTTAACCGCTTCAT GGCGCGCAAAGATCGGAAGAGCGTCGTGTAG
  • 36. Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.
  • 37. Artifacts from sequencing falsely connect graphs
  • 38. Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.
  • 39. Groxel view of knot-like region / ArendHintze
  • 40. Density trimming breaks up the lump: Old P1,sodddtrimmed (204.6 mreads -> 179 m): P1: 23,444,332 reads P2: 60,703 reads P3: 48,818 reads P4: 39,755 reads P5: 34,902 reads P6: 33,284 reads … Untrimmed partitioning (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads …
  • 41. What does density trimming do to assembly? 204 m reads in lump: assembles into 52,610 contigs; total 73.5 MB 180 m reads in trimmed lump: assembles into 57,135 contigs; total83.6 MB (all contigs > 1kb) Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
  • 42. Wait, what? Yes, trimming these “knot-like” sequences improves the overall assembly! We remove 25.6 m reads and gain 10.1 MB!? Trend is same for ABySS, another k-mergraph assembler, as well.
  • 43. Is this a valid assembly? Paired-end usage is good. 50% of contigs have BLASTX hit better than 1e-20 in Swissprot; 75% of contigs have BLASTX hit better than 1e-20 in TrEMBL; Reference genomes sequenced by JGI: Frateuriaaurantia: 1376 hits > 100 aa Saprospiragrandis: 1114 hits > 100 aa (> 50% identity over > 50% of gene)
  • 44. So what’s going on? Current assemblers are bad at dealing with certain graph sturctures (“knots”). If we can untangle knots for them, that’s good, maybe? Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves? Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.
  • 45. OK, let’s assemble! Iowa corn (HiSeq+ GA2): 219.11 Gb of sequence assembles to: 148,053 contigs, in220MB; max length 20322 max coverage ~10x …all done on Amazon EC2, ~ 1 week for under $500. Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
  • 46. Full Iowa corn / mapping stats 1,806,800,000 QC/trimmed reads (1.8 bn) 204,900,000 reads map to somecontig (11%) 37,244,000 reads map to contigs > 1kb (2.1%) > 1 kb contig is a stringent criterion! Compare: 80% of MetaHIT reads to > 500 bp; 65%+ of rumen reads to > 1kb
  • 48. High coverage partitions assemble more reads
  • 49. Success, tentatively. We are still evaluating assembly and assembly parameters; should be possible to improve in every way. (~10 hrs to redo entire assembly, once partitioned.) The main engineering point is that we can actuallyrun this entire pipeline on a relatively small machine (8 core/68 GB RAM) We can do dozens of these in parallel on Amazon rental hardware. And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.
  • 50. Optimizing per-partition assembly Metagenomes contain mixed-abundance genomes. Current assemblers are not built for mixed-abundance samples (problem with mRNAseq, too). Repeat resolution Error/edge trimming Since we’re breaking the data set into multiple partitions containing reads that may assemble together, can we optimize assembler parameters (k, coverage) for each partition?
  • 51. Mixing parameters improves assembly statistics Objective function: maximize sum(contigs > 1kb) 4.5x average coverage– gained 228 contigs/469 kb (over 152/215 kb) 5.8x average coverage – gained 78 contigs/304 kb (over 248/708 kb) 8.2x average coverage – lost 58 contigs /gained 116 kb (over 279/803 kb)
  • 52. Conclusions Engineering: can assemble large data sets. Scaling: can assemble on rented machines. Science: can optimize assembly for individual partitions. Science: retain low-abundance.
  • 53. Caveats Quality of assembly?? Illumina sequencing bias/error issue needs to be explored. Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs. Need to better analyze upper limits of data structures. Have not applied our approaches to high-coverage data yet; in progress.
  • 54. Future thoughts Our pre-filtering technique alwayshas lower memory requirements than Velvet or other assemblers. So it is a good first step to try, even if it doesn’t reduce the problem significantly. Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future. This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence. Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …)
  • 55. Acknowledgements The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning QingpengZhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.
  • 56.
  • 57. A guide to khmer Python wrapping C++; BSD license. Tools for: K-mer abundance filtering (constant mem; inexact) Assembly graph size filtering (constant mem; exact) Assembly graph partitioning (exact) Error trimming (constant mem; inexact) Still in alpha form… undocumented, esp.
  • 58. k-mer coverage by partition
  • 59. Abundance filtering affects low-coverage contigs dramatically
  • 60. Many read pairs map together
  • 61. Bonus slides How much more do we need to sequence, anyway??
  • 62. Calculating expected k-mer numbers Entire population S1 S2 Note: no simple way to correct abundance bias, so we don’t, yet.
  • 63. Coverage estimates (Based on k-mer mark/recapture analysis.) Iowa prairie (136 GB): est 1.26 x Iowa corn (62 GB): est 0.86 x Wisconsin corn (190 GB): est 2.17 x For comparison, the panda genome assembly used ~50x with short reads. Qingpeng Zhang
  • 64. Coverage estimates: getting to 50x… Human -> 150 GB for 50x Iowa prairie (136 GB): est 1.26 x -> 5.4 TB for 50x Iowa corn (62 GB): est 0.86 x -> 3.6 TB for 50x Wisconsin corn (190 GB): est 2.17 x -> 4.4 TB for 50x …note that it’s not clear what “coverage” exactly means in this case, since 16s-estimated diversity is very high.
  • 65. What does coverage mean here? “Unseen” sequence: 1x ~ 37% 2x ~ 14% 5x ~ 0.7% 10x ~ .00005% 50x ~ 2e-20% For metagenomes, coverage is of abundance weighted DNA.
  • 66. CAMERA Annotation of full set contigs(>1000 bp) # of ORFS: 344,661 (Metagene) Longest ORF: 1,974 bp Shortest ORF: 20 bp Average ORF: 173 bp # of COG hits: 153,138 (e-value < 0.001) # of Pfam hits: 170,072 # of TIGRfam hits: 315,776
  • 68. The k-mer oracle Q: is this k-mer present in the data set? A: no => then it is not. A: yes => it may or may not be present. This lets us store k-mers efficiently.
  • 69. Building on the k-mer oracle: Once we can store/query k-mers efficiently in this oracle, we can build additional oracles on top of it:
  • 70. The k-mer graph oracle Q: does this k-mer overlap with this other k-mer? A: no => then it does not, guaranteed. A: yes => it may or may not. This lets us traverse de Bruijn graphs efficiently.
  • 71. The contig size oracle Q: could this read contribute to a contig bigger than N? A: no => then it does not, guaranteed. A: yes => then it might. This lets us eliminate reads that do not belong to “big” contigs.
  • 72. The read partition oracle Does this read connect to this other read in any way? A: no => then it does not, guaranteed. A: yes => then it might. This lets us subdivide the assembly problem into many smaller, disconnected problems that are much easier.
  • 73. Oracular fact All of these oracles are cheap, can yield answers from a different probability distribution, and can be “chained” together (so you can keep on asking oracles for as long as you want, and get more and more accurate).
  • 74. Implementing a basic k-mer oracle Conveniently, perhaps the simplest data structure in computer science is what we need… …a hash table that ignores collisions. Note, P(false positive) = fractional occupancy.
  • 75. A more reliable k-mer oracle Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more reliable.
  • 76. Scaling the k-mer oracle Adding additional filters increases discrimination at the cost of speed. This gives you a fairly straightforward tradeoff: memory (decrease individual false positives) vs computation (more filters!)
  • 77.
  • 78.
  • 79. The k-mer oracle, revisited We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately. This implicitly lets us store the graph structure, too!
  • 80. B. Partitioning graphs into disconnected subgraphs Which nodes do not connect to each other?
  • 81. Partitioning graphs – it’s easy looking Which nodes do not connect to each other?
  • 82. But partitioning big graphs is expensive Requires exhaustive exploration.
  • 83. But partitioning big graphs is expensive
  • 84. Tabu search – avoid global searches
  • 85. Tabu search – systematic local exploration
  • 86. Tabu search – systematic local exploration
  • 87. Tabu search – systematic local exploration
  • 88. Tabu search – systematic local exploration
  • 89. Strategies for completing big searches…
  • 90. Hard-to-traverse graphs are well-connected
  • 92. Exclusion strategy lets you systematically explore big graphs with a local algorithm
  • 93. Potential problems Our oracle can mistakenly connect clusters.
  • 94. Potential problems This is a problem if the rate is sufficiently high!
  • 95. However, the error is one-sided: Graphs will never be erroneously disconnected
  • 96. The error is one-sided: Nodes will never be erroneously disconnected
  • 97. The error is one-sided: Nodes will never be erroneously disconnected. This is critically important: it guarantees that our k-mer graph representation yields reliable “no” answers. This, in turn, lets us reliably partition graphs into smaller graphs.

Hinweis der Redaktion

  1. Expand on this last point
  2. Quantify, or do cumulative distribution
  3. Bridge between this kind of view and k-mers
  4. Constant memory
  5. @@
  6. @@
  7. @@ k up to 64 graph
  8. Expand; talk about density, circumference
  9. @@ redo
  10. @@ redo
  11. Details!
  12. 2x coverage vs 10x coverage? Add “reads”
  13. Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.
  14. Refactor with error bars, etc.
  15. Put in subtraction foo
  16. Briefly, all six open reading frames (ORFs) were translated by the ORF_finder (or ORFs were predicted by MetaGene) from translation table 11 with minimum length 30aa. The ORFs were clustered at 90(default 90) % identity to identify the non-redundant sequences, which are further clustered to families at a conservative threshold 60 (default 60) % identity over 80 (default 80) % of length of ORFs. The resulting ORFs are annotated from Pfam and Tigrfam with HMMER, accelerated with Hammerhead, and from COG with RPS-BLAST with e-values less than 0.001. GO annotations were mapped from Pfam or Tigrfam and EC numbers were mapped from the GO database.
  17. Paint between the greens.
  18. When a green connects two or more colors, recolor one color.
  19. Dependent on minimumdensity tagging