Climbing Mt. Metagenome

Scaling Mt. Metagenome:Assembling very large data sets C. Titus Brown Assistant Professor Computer Science and Engineering / Microbiology and Molecular Genetics Michigan State University

Thanks for coming! Note: this talk is about the computational side of metagenome assembly, motivated by the Great Prairie Grand Challenge soil sequencing project. Jim Tiedje will talk about the project as a whole at the JGI User’s Meeting.

The basic problem. Lots of metagenomic sequence data (200 GB Illumina for< $20k?) Assembly, especially metagenome assembly, scales poorly (due to high diversity). Standard assembly techniques don’t work well with sequences from multiple abundance genomes. Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).

We can’t just throw more hardware at the problem, either. Lincoln Stein

Jumping to the end: We have implemented a solution for these problems: Scalability of assembly, Lack of resources, and parameter choice. We demonstrate this solution for a high diversity sample (219.1 Gb of Iowa corn field soil metagenome). …there is an additional surprise or two, so you should stick around!

Whole genome shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)

K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010)

K-mer graphs - branching For decisions about which paths etc, biology-based heuristics come into play as well.

Too much data – what can we do? Reduce the size of the data (either with an approximate or an exact approach) Divide & conquer: subdivide the problem. For exact data reduction or subdivision, need to grok the entire assembly graph structure. …but that is why assembly scales poorly in the first place.

Abundance filtering Approach used in two published Illumina metagenomic papers (MetaHIT/human microbiomeand rumen papers) Remove or trim reads with low-abundance k-mers Either due to errors, or low-abundance organisms. Inexact data reduction: may or may not remove usable data. Works well for high-coverage data sets (rumen est56x!!) However, for low-coverage or high-diversity data sets, abundance filtering will reject potentially useful reads.

Two exact data reduction techniques: Eliminate reads that do not connect to many other reads. Group reads by connectivity into different partitions of the entire graph. For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.

Eliminating unconnected reads “Graphsize filtering”

Subdividing reads by connection “Partitioning”

Two exact data reduction techniques: Eliminate reads that do not connect to many other reads (“graphsize filtering”). Group reads by connectivity into different partitions of the entire graph (“partitioning”). For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.

Engineering overview Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure; With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k. Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes). For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784

Store graph nodes in Bloom filter Graph traversal is done in full k-mer space; Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).

Practical application Enables: graph trimming (exact removal) partitioning (exact subdivision) abundance filtering … all for K <= 64, for 200+ gb sequence collections. All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores. Similar running times to using Velvet alone.

We pre-filter data for assembly:

Does removing small graphs work? Small data set (35m reads / 3.4 gb rhizosphere soil sample) Filtered at k=32, assembled at k=33 with ABYSS N contigs / Total bp Largest contig 130 223,341 61,766 Unfiltered (35m) 130 223,341 61,766 Filtered (2m reads) YES.

Does partitioning into disconnected graphs work? Partitioned same data set (35m reads / 3.5 gb) into 45k partitions containing > 10 reads; assembled partitions separately (k0=32, k=33). N contigs / Total bp Largest contig 130 223,341 61,766 Unfiltered (35m) 130 223,341 61,766 Sum partitions YES.

Data reduction for assembly / practical details Reduction performed on machine with 16 gb of RAM. Removing poorly connected reads: 35m -> 2m reads. - Memory required reduced from 40 gb to 2 gb; - Time reduced from 4 hrs to 20 minutes. Partitioning reads into disconnected groups: - Biggest group is 300k reads - Memory required reduced from 40 gb to 500 mb; - Time reduced from 4 hrs to < 5 minutes/group.

Does it work on bigger data sets? 35 m read data set partition sizes: P1: 277,043 reads P2: 5776 reads P3: 4444 reads P4: 3513 reads P5: 2528 reads P6: 2397 reads … Iowa continuous corn GA2 partitions (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads …

Problem: big data sets have one big partition!? Too big to handle on EC2. Assembles with low coverage. Contains 2.5 bn unique k-mers (~500microbial genomes), at ~3-5x coverage As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble! Both for our approach, And possibly for assembly in general (because it assembles more poorly than it should, for given coverage/size)

Why this lump? Real biological connectivity (rRNA, conserved genes, etc.) Bug in our software Sequencing artifact or error

Why this lump? Real biological connectivity? Probably not. - Increasing Kfrom 32 to ~64 didn’t break up the lump: not biological. Bug in our software? Probably not. ,[object Object],Sequencing artifact or error? YES. - (Note, we do filter & quality trim all sequences already)

“Good” vs “bad” assembly graph Low density High density

Non-biological levels of local graph connectivity:

Higher local graph density correlates with position in read

Higher local graph density correlates with position in read ARTIFACT

Trimming reads Trim at high “soddd”, sum of degree degree distribution: From each k-mer in each read, walk two k-mers in all directions in the graph; If more than 3 k-mers can be found at exactly two steps, trim remainder of sequence. Overly stringent; actually trimming (k-1) connectivity graph by degree.

Trimmed read examples >895:5:1:1986:16019/2 TGAGCACTACCTGCGGGCCGGGGACCGGGTCAGCCTGCT CGACCTGGGCCAACCGATGCGCC >895:5:1:1995:6913/1 TTGCGCGCCATGAAGCGGTTAACGCGCTCGGTCCATAGC GCGATG >895:5:1:1995:6913/2 GTTCATCGCGCTATGGACCGAGCGCGTTAACCGCTTCAT GGCGCGCAAAGATCGGAAGAGCGTCGTGTAG

Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.

Artifacts from sequencing falsely connect graphs

Groxel view of knot-like region / ArendHintze

Density trimming breaks up the lump: Old P1,sodddtrimmed (204.6 mreads -> 179 m): P1: 23,444,332 reads P2: 60,703 reads P3: 48,818 reads P4: 39,755 reads P5: 34,902 reads P6: 33,284 reads … Untrimmed partitioning (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads …

What does density trimming do to assembly? 204 m reads in lump: assembles into 52,610 contigs; total 73.5 MB 180 m reads in trimmed lump: assembles into 57,135 contigs; total83.6 MB (all contigs > 1kb) Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0

Wait, what? Yes, trimming these “knot-like” sequences improves the overall assembly! We remove 25.6 m reads and gain 10.1 MB!? Trend is same for ABySS, another k-mergraph assembler, as well.

Is this a valid assembly? Paired-end usage is good. 50% of contigs have BLASTX hit better than 1e-20 in Swissprot; 75% of contigs have BLASTX hit better than 1e-20 in TrEMBL; Reference genomes sequenced by JGI: Frateuriaaurantia: 1376 hits > 100 aa Saprospiragrandis: 1114 hits > 100 aa (> 50% identity over > 50% of gene)

So what’s going on? Current assemblers are bad at dealing with certain graph sturctures (“knots”). If we can untangle knots for them, that’s good, maybe? Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves? Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.

OK, let’s assemble! Iowa corn (HiSeq+ GA2): 219.11 Gb of sequence assembles to: 148,053 contigs, in220MB; max length 20322 max coverage ~10x …all done on Amazon EC2, ~ 1 week for under $500. Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0

Full Iowa corn / mapping stats 1,806,800,000 QC/trimmed reads (1.8 bn) 204,900,000 reads map to somecontig (11%) 37,244,000 reads map to contigs > 1kb (2.1%) > 1 kb contig is a stringent criterion! Compare: 80% of MetaHIT reads to > 500 bp; 65%+ of rumen reads to > 1kb

Percentage mapped vscontig size

High coverage partitions assemble more reads

Success, tentatively. We are still evaluating assembly and assembly parameters; should be possible to improve in every way. (~10 hrs to redo entire assembly, once partitioned.) The main engineering point is that we can actuallyrun this entire pipeline on a relatively small machine (8 core/68 GB RAM) We can do dozens of these in parallel on Amazon rental hardware. And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.

Optimizing per-partition assembly Metagenomes contain mixed-abundance genomes. Current assemblers are not built for mixed-abundance samples (problem with mRNAseq, too). Repeat resolution Error/edge trimming Since we’re breaking the data set into multiple partitions containing reads that may assemble together, can we optimize assembler parameters (k, coverage) for each partition?

Mixing parameters improves assembly statistics Objective function: maximize sum(contigs > 1kb) 4.5x average coverage– gained 228 contigs/469 kb (over 152/215 kb) 5.8x average coverage – gained 78 contigs/304 kb (over 248/708 kb) 8.2x average coverage – lost 58 contigs /gained 116 kb (over 279/803 kb)

Conclusions Engineering: can assemble large data sets. Scaling: can assemble on rented machines. Science: can optimize assembly for individual partitions. Science: retain low-abundance.

Caveats Quality of assembly?? Illumina sequencing bias/error issue needs to be explored. Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs. Need to better analyze upper limits of data structures. Have not applied our approaches to high-coverage data yet; in progress.

Future thoughts Our pre-filtering technique alwayshas lower memory requirements than Velvet or other assemblers. So it is a good first step to try, even if it doesn’t reduce the problem significantly. Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future. This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence. Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …)

Acknowledgements The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning QingpengZhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.

A guide to khmer Python wrapping C++; BSD license. Tools for: K-mer abundance filtering (constant mem; inexact) Assembly graph size filtering (constant mem; exact) Assembly graph partitioning (exact) Error trimming (constant mem; inexact) Still in alpha form… undocumented, esp.

Abundance filtering affects low-coverage contigs dramatically

Bonus slides How much more do we need to sequence, anyway??

Calculating expected k-mer numbers Entire population S1 S2 Note: no simple way to correct abundance bias, so we don’t, yet.

Coverage estimates (Based on k-mer mark/recapture analysis.) Iowa prairie (136 GB): est 1.26 x Iowa corn (62 GB): est 0.86 x Wisconsin corn (190 GB): est 2.17 x For comparison, the panda genome assembly used ~50x with short reads. Qingpeng Zhang

Coverage estimates: getting to 50x… Human -> 150 GB for 50x Iowa prairie (136 GB): est 1.26 x -> 5.4 TB for 50x Iowa corn (62 GB): est 0.86 x -> 3.6 TB for 50x Wisconsin corn (190 GB): est 2.17 x -> 4.4 TB for 50x …note that it’s not clear what “coverage” exactly means in this case, since 16s-estimated diversity is very high.

What does coverage mean here? “Unseen” sequence: 1x ~ 37% 2x ~ 14% 5x ~ 0.7% 10x ~ .00005% 50x ~ 2e-20% For metagenomes, coverage is of abundance weighted DNA.

CAMERA Annotation of full set contigs(>1000 bp) # of ORFS: 344,661 (Metagene) Longest ORF: 1,974 bp Shortest ORF: 20 bp Average ORF: 173 bp # of COG hits: 153,138 (e-value < 0.001) # of Pfam hits: 170,072 # of TIGRfam hits: 315,776

The k-mer oracle Q: is this k-mer present in the data set? A: no => then it is not. A: yes => it may or may not be present. This lets us store k-mers efficiently.

Building on the k-mer oracle: Once we can store/query k-mers efficiently in this oracle, we can build additional oracles on top of it:

The k-mer graph oracle Q: does this k-mer overlap with this other k-mer? A: no => then it does not, guaranteed. A: yes => it may or may not. This lets us traverse de Bruijn graphs efficiently.

The contig size oracle Q: could this read contribute to a contig bigger than N? A: no => then it does not, guaranteed. A: yes => then it might. This lets us eliminate reads that do not belong to “big” contigs.

The read partition oracle Does this read connect to this other read in any way? A: no => then it does not, guaranteed. A: yes => then it might. This lets us subdivide the assembly problem into many smaller, disconnected problems that are much easier.

Oracular fact All of these oracles are cheap, can yield answers from a different probability distribution, and can be “chained” together (so you can keep on asking oracles for as long as you want, and get more and more accurate).

Implementing a basic k-mer oracle Conveniently, perhaps the simplest data structure in computer science is what we need… …a hash table that ignores collisions. Note, P(false positive) = fractional occupancy.

A more reliable k-mer oracle Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more reliable.

Scaling the k-mer oracle Adding additional filters increases discrimination at the cost of speed. This gives you a fairly straightforward tradeoff: memory (decrease individual false positives) vs computation (more filters!)

The k-mer oracle, revisited We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately. This implicitly lets us store the graph structure, too!

B. Partitioning graphs into disconnected subgraphs Which nodes do not connect to each other?

Partitioning graphs – it’s easy looking Which nodes do not connect to each other?

But partitioning big graphs is expensive Requires exhaustive exploration.

But partitioning big graphs is expensive

Tabu search – avoid global searches

Tabu search – systematic local exploration

Strategies for completing big searches…

Hard-to-traverse graphs are well-connected

Add neighborhood-exclusion to tabu search

Exclusion strategy lets you systematically explore big graphs with a local algorithm

Potential problems Our oracle can mistakenly connect clusters.

Potential problems This is a problem if the rate is sufficiently high!

However, the error is one-sided: Graphs will never be erroneously disconnected

The error is one-sided: Nodes will never be erroneously disconnected

The error is one-sided: Nodes will never be erroneously disconnected. This is critically important: it guarantees that our k-mer graph representation yields reliable “no” answers. This, in turn, lets us reliably partition graphs into smaller graphs.

Climbing Mt. Metagenome

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Climbing Mt. Metagenome

Ähnlich wie Climbing Mt. Metagenome (20)

Mehr von c.titus.brown

Mehr von c.titus.brown (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Climbing Mt. Metagenome

Hinweis der Redaktion