This document summarizes a talk about assembling large metagenomic datasets. The speaker discusses the challenges of assembling large amounts of metagenomic sequence data, which scales poorly for standard assembly techniques. They present a solution that uses k-mer graphs and probabilistic data structures to efficiently store and traverse very large graphs. This allows them to exactly reduce the data size through techniques like filtering unconnected reads and partitioning reads into disconnected subgraphs. They demonstrate applying this approach to assemble over 200 GB of sequence data from an Iowa corn field soil sample.
The Ultimate Guide to Choosing WordPress Pros and Cons
Climbing Mt. Metagenome
1. Scaling Mt. Metagenome:Assembling very large data sets C. Titus Brown Assistant Professor Computer Science and Engineering / Microbiology and Molecular Genetics Michigan State University
2. Thanks for coming! Note: this talk is about the computational side of metagenome assembly, motivated by the Great Prairie Grand Challenge soil sequencing project. Jim Tiedje will talk about the project as a whole at the JGI User’s Meeting.
3. The basic problem. Lots of metagenomic sequence data (200 GB Illumina for< $20k?) Assembly, especially metagenome assembly, scales poorly (due to high diversity). Standard assembly techniques don’t work well with sequences from multiple abundance genomes. Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).
4. We can’t just throw more hardware at the problem, either. Lincoln Stein
5. Jumping to the end: We have implemented a solution for these problems: Scalability of assembly, Lack of resources, and parameter choice. We demonstrate this solution for a high diversity sample (219.1 Gb of Iowa corn field soil metagenome). …there is an additional surprise or two, so you should stick around!
6. Whole genome shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
7. K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010)
8. K-mer graphs - branching For decisions about which paths etc, biology-based heuristics come into play as well.
9. Too much data – what can we do? Reduce the size of the data (either with an approximate or an exact approach) Divide & conquer: subdivide the problem. For exact data reduction or subdivision, need to grok the entire assembly graph structure. …but that is why assembly scales poorly in the first place.
10.
11.
12.
13. Abundance filtering Approach used in two published Illumina metagenomic papers (MetaHIT/human microbiomeand rumen papers) Remove or trim reads with low-abundance k-mers Either due to errors, or low-abundance organisms. Inexact data reduction: may or may not remove usable data. Works well for high-coverage data sets (rumen est56x!!) However, for low-coverage or high-diversity data sets, abundance filtering will reject potentially useful reads.
15. Two exact data reduction techniques: Eliminate reads that do not connect to many other reads. Group reads by connectivity into different partitions of the entire graph. For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
18. Two exact data reduction techniques: Eliminate reads that do not connect to many other reads (“graphsize filtering”). Group reads by connectivity into different partitions of the entire graph (“partitioning”). For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
19. Engineering overview Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure; With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k. Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes). For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784
20. Store graph nodes in Bloom filter Graph traversal is done in full k-mer space; Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).
21. Practical application Enables: graph trimming (exact removal) partitioning (exact subdivision) abundance filtering … all for K <= 64, for 200+ gb sequence collections. All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores. Similar running times to using Velvet alone.
23. Does removing small graphs work? Small data set (35m reads / 3.4 gb rhizosphere soil sample) Filtered at k=32, assembled at k=33 with ABYSS N contigs / Total bp Largest contig 130 223,341 61,766 Unfiltered (35m) 130 223,341 61,766 Filtered (2m reads) YES.
24. Does partitioning into disconnected graphs work? Partitioned same data set (35m reads / 3.5 gb) into 45k partitions containing > 10 reads; assembled partitions separately (k0=32, k=33). N contigs / Total bp Largest contig 130 223,341 61,766 Unfiltered (35m) 130 223,341 61,766 Sum partitions YES.
25. Data reduction for assembly / practical details Reduction performed on machine with 16 gb of RAM. Removing poorly connected reads: 35m -> 2m reads. - Memory required reduced from 40 gb to 2 gb; - Time reduced from 4 hrs to 20 minutes. Partitioning reads into disconnected groups: - Biggest group is 300k reads - Memory required reduced from 40 gb to 500 mb; - Time reduced from 4 hrs to < 5 minutes/group.
26. Does it work on bigger data sets? 35 m read data set partition sizes: P1: 277,043 reads P2: 5776 reads P3: 4444 reads P4: 3513 reads P5: 2528 reads P6: 2397 reads … Iowa continuous corn GA2 partitions (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads …
27. Problem: big data sets have one big partition!? Too big to handle on EC2. Assembles with low coverage. Contains 2.5 bn unique k-mers (~500microbial genomes), at ~3-5x coverage As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble! Both for our approach, And possibly for assembly in general (because it assembles more poorly than it should, for given coverage/size)
28. Why this lump? Real biological connectivity (rRNA, conserved genes, etc.) Bug in our software Sequencing artifact or error
34. Trimming reads Trim at high “soddd”, sum of degree degree distribution: From each k-mer in each read, walk two k-mers in all directions in the graph; If more than 3 k-mers can be found at exactly two steps, trim remainder of sequence. Overly stringent; actually trimming (k-1) connectivity graph by degree.
36. Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.
38. Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.
41. What does density trimming do to assembly? 204 m reads in lump: assembles into 52,610 contigs; total 73.5 MB 180 m reads in trimmed lump: assembles into 57,135 contigs; total83.6 MB (all contigs > 1kb) Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
42. Wait, what? Yes, trimming these “knot-like” sequences improves the overall assembly! We remove 25.6 m reads and gain 10.1 MB!? Trend is same for ABySS, another k-mergraph assembler, as well.
43. Is this a valid assembly? Paired-end usage is good. 50% of contigs have BLASTX hit better than 1e-20 in Swissprot; 75% of contigs have BLASTX hit better than 1e-20 in TrEMBL; Reference genomes sequenced by JGI: Frateuriaaurantia: 1376 hits > 100 aa Saprospiragrandis: 1114 hits > 100 aa (> 50% identity over > 50% of gene)
44. So what’s going on? Current assemblers are bad at dealing with certain graph sturctures (“knots”). If we can untangle knots for them, that’s good, maybe? Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves? Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.
45. OK, let’s assemble! Iowa corn (HiSeq+ GA2): 219.11 Gb of sequence assembles to: 148,053 contigs, in220MB; max length 20322 max coverage ~10x …all done on Amazon EC2, ~ 1 week for under $500. Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
46. Full Iowa corn / mapping stats 1,806,800,000 QC/trimmed reads (1.8 bn) 204,900,000 reads map to somecontig (11%) 37,244,000 reads map to contigs > 1kb (2.1%) > 1 kb contig is a stringent criterion! Compare: 80% of MetaHIT reads to > 500 bp; 65%+ of rumen reads to > 1kb
49. Success, tentatively. We are still evaluating assembly and assembly parameters; should be possible to improve in every way. (~10 hrs to redo entire assembly, once partitioned.) The main engineering point is that we can actuallyrun this entire pipeline on a relatively small machine (8 core/68 GB RAM) We can do dozens of these in parallel on Amazon rental hardware. And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.
50. Optimizing per-partition assembly Metagenomes contain mixed-abundance genomes. Current assemblers are not built for mixed-abundance samples (problem with mRNAseq, too). Repeat resolution Error/edge trimming Since we’re breaking the data set into multiple partitions containing reads that may assemble together, can we optimize assembler parameters (k, coverage) for each partition?
52. Conclusions Engineering: can assemble large data sets. Scaling: can assemble on rented machines. Science: can optimize assembly for individual partitions. Science: retain low-abundance.
53. Caveats Quality of assembly?? Illumina sequencing bias/error issue needs to be explored. Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs. Need to better analyze upper limits of data structures. Have not applied our approaches to high-coverage data yet; in progress.
54. Future thoughts Our pre-filtering technique alwayshas lower memory requirements than Velvet or other assemblers. So it is a good first step to try, even if it doesn’t reduce the problem significantly. Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future. This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence. Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …)
55. Acknowledgements The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning QingpengZhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.
56.
57. A guide to khmer Python wrapping C++; BSD license. Tools for: K-mer abundance filtering (constant mem; inexact) Assembly graph size filtering (constant mem; exact) Assembly graph partitioning (exact) Error trimming (constant mem; inexact) Still in alpha form… undocumented, esp.
62. Calculating expected k-mer numbers Entire population S1 S2 Note: no simple way to correct abundance bias, so we don’t, yet.
63. Coverage estimates (Based on k-mer mark/recapture analysis.) Iowa prairie (136 GB): est 1.26 x Iowa corn (62 GB): est 0.86 x Wisconsin corn (190 GB): est 2.17 x For comparison, the panda genome assembly used ~50x with short reads. Qingpeng Zhang
64. Coverage estimates: getting to 50x… Human -> 150 GB for 50x Iowa prairie (136 GB): est 1.26 x -> 5.4 TB for 50x Iowa corn (62 GB): est 0.86 x -> 3.6 TB for 50x Wisconsin corn (190 GB): est 2.17 x -> 4.4 TB for 50x …note that it’s not clear what “coverage” exactly means in this case, since 16s-estimated diversity is very high.
65. What does coverage mean here? “Unseen” sequence: 1x ~ 37% 2x ~ 14% 5x ~ 0.7% 10x ~ .00005% 50x ~ 2e-20% For metagenomes, coverage is of abundance weighted DNA.
66. CAMERA Annotation of full set contigs(>1000 bp) # of ORFS: 344,661 (Metagene) Longest ORF: 1,974 bp Shortest ORF: 20 bp Average ORF: 173 bp # of COG hits: 153,138 (e-value < 0.001) # of Pfam hits: 170,072 # of TIGRfam hits: 315,776
68. The k-mer oracle Q: is this k-mer present in the data set? A: no => then it is not. A: yes => it may or may not be present. This lets us store k-mers efficiently.
69. Building on the k-mer oracle: Once we can store/query k-mers efficiently in this oracle, we can build additional oracles on top of it:
70. The k-mer graph oracle Q: does this k-mer overlap with this other k-mer? A: no => then it does not, guaranteed. A: yes => it may or may not. This lets us traverse de Bruijn graphs efficiently.
71. The contig size oracle Q: could this read contribute to a contig bigger than N? A: no => then it does not, guaranteed. A: yes => then it might. This lets us eliminate reads that do not belong to “big” contigs.
72. The read partition oracle Does this read connect to this other read in any way? A: no => then it does not, guaranteed. A: yes => then it might. This lets us subdivide the assembly problem into many smaller, disconnected problems that are much easier.
73. Oracular fact All of these oracles are cheap, can yield answers from a different probability distribution, and can be “chained” together (so you can keep on asking oracles for as long as you want, and get more and more accurate).
74. Implementing a basic k-mer oracle Conveniently, perhaps the simplest data structure in computer science is what we need… …a hash table that ignores collisions. Note, P(false positive) = fractional occupancy.
75. A more reliable k-mer oracle Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more reliable.
76. Scaling the k-mer oracle Adding additional filters increases discrimination at the cost of speed. This gives you a fairly straightforward tradeoff: memory (decrease individual false positives) vs computation (more filters!)
77.
78.
79. The k-mer oracle, revisited We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately. This implicitly lets us store the graph structure, too!
80. B. Partitioning graphs into disconnected subgraphs Which nodes do not connect to each other?
95. However, the error is one-sided: Graphs will never be erroneously disconnected
96. The error is one-sided: Nodes will never be erroneously disconnected
97. The error is one-sided: Nodes will never be erroneously disconnected. This is critically important: it guarantees that our k-mer graph representation yields reliable “no” answers. This, in turn, lets us reliably partition graphs into smaller graphs.
Briefly, all six open reading frames (ORFs) were translated by the ORF_finder (or ORFs were predicted by MetaGene) from translation table 11 with minimum length 30aa. The ORFs were clustered at 90(default 90) % identity to identify the non-redundant sequences, which are further clustered to families at a conservative threshold 60 (default 60) % identity over 80 (default 80) % of length of ORFs. The resulting ORFs are annotated from Pfam and Tigrfam with HMMER, accelerated with Hammerhead, and from COG with RPS-BLAST with e-values less than 0.001. GO annotations were mapped from Pfam or Tigrfam and EC numbers were mapped from the GO database.
Paint between the greens.
When a green connects two or more colors, recolor one color.