08448380779 Call Girls In Greater Kailash - I Women Seeking Men
2013 hmp-assembly-webinar
1. C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
ctb@msu.edu
HMP – Metagenome assembly
2. Acknowledgements
Lab members involved Collaborators
• Adina Howe (w/Tiedje)
• Jason Pell
• Arend Hintze
• Rosangela Canino-Koning
• Qingpeng Zhang
• Elijah Lowe
• Likit Preeyanon
• Jiarong Guo
• Tim Brom
• Kanchan Pavangadkar
• Eric McDonald
• Jordan Fish
• Chris Welcher
• Jim Tiedje, MSU
• Billie Swalla, UW
• Janet Jansson, LBNL
• Susannah Tringe, JGI
Funding
USDA NIFA; NSF IOS;
BEACON.
3. Open, online science
All of the software and approaches I’m talking about
today are available:
Assembling large, complex metagenomes
arxiv.org/abs/1212.2832
khmer software:
github.com/ged-lab/khmer/
Blog: http://ivory.idyll.org/blog/
Twitter: @ctitusbrown
4. Illumina! De Bruijn graphs!
• Today I’ll be talking about Illumina data
sets, and de Bruijn graph assembly (k-mer
assembly).
• This is because my research has largely
focused on scaling to large data sets (soil
metagenomics!) and Illumina is the real
scaling challenge.
5. Assembler heuristics
• In order to build assemblies, each assembler
makes choices – uses heuristics – to reach a
conclusion.
• These heuristics may not be appropriate for your
sample!
– High polymorphism?
– Mixed population vs clonal?
– Genomic vs metagenomic vs mRNA
– Low coverage drives differences in assembly.
7. Shotgun sequencing
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
8. Reducing to k-mers overlaps
Note that k-mer abundance is not properly represented here! Each
blue k-mer will be present around 10 times.
9. Errors create new k-mers
Each single base error generates ~k new k-mers.
Generally, erroneous k-mers show up only once – errors are random.
13. Approach: Digital normalization
(a computational version of library normalization)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…
20. Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
• Reference free.
• Is single pass: looks at each read only once;
• Does not “collect” the majority of errors;
• Keeps all low-coverage reads;
• Smooths out coverage of regions.
22. Coverage after digital normalization:
Normalizes coverage
Discards redundancy
Eliminates majority of
errors
Scales assembly dramatically.
Assembly is 98% identical.
23. In our experience…
• Digital normalization produces “good”
metagenome assemblies.
• Smooths out abundance variation, strain
variation.
• Reduces computational requirements for
assembly.
• It also kinda makes sense :)
24. Additional Approach for
Metagenomes: Data partitioning
(a computational version of cell sorting)
Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on connectivity
of sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Pell et al., 2012, PNAS
25. Partitioning separates reads by genome.
Strain variants co-partition.
When computationally spiking HMP mock data with one E. coli
genome (left) or multiple E. coli strains (right), majority of partitions
contain reads from only a single genome (blue) vs multi-genome
partitions (green).
Partitions containing spiked data indicated with a * Adina Howe
**
26. Conclusions re strain
variation/chimerism (previous slide)
• When spiking in intentionally complex
mixtures, only a small fraction of partitions
are chimeric.
• These means that only a small fraction of
contigs could be chimeric.
• Strain variants will almost certainly assemble
together.
• Can separate on abundance.
See Sharon et al., 2013, PMID 22936250, for Banfield work on this.
31. Our experience
• Our metagenome assemblies compare well with
others, but we have little in the way of ground
truth with which to evaluate.
• Scaffold assembly is tricky; we believe in contig
assembly for metagenomes, but not scaffolding.
• See arXiv paper, “Assembling large, complex
metagenomes”, for our suggested pipeline and
statistics & references.
35. Overconfident predictions
• We can assemble virtually anything but soil ;).
– Genomes, transcriptomes, MDA, mixtures, etc.
– Repeat resolution will be fundamentally limited by
sequencing technology (insert size; sampling depth)
• Strain variation confuses assembly, but does not
prevent useful results.
– Diginorm is systematic strategy to enable assembly.
– Banfield has shown how to deconvolve strains at
differential abundance.
– Kostas K. results suggest that there will be a species gap
sufficient to prevent contig misassembly.
– Even genes “chimeric” between strains are useful.
36. Reasons why you shouldn’t believe me
1) Strain variation – when we get deeper in soil, we
should see more (?). Not sure what will
happen, and we do not (yet) have proven
approaches.
2) We, by definition, are not yet seeing anything
that doesn’t assemble.
3) We have not tackled scaffolding much. Serious
investigation of scaffolding will be necessary for
any good genome assembly, and scaffolding is
weak point.
38. Assembling in the cloud
• Most metagenomes require 50-150 GB of RAM.
• Many people don’t have access to computers of
that size.
• Amazon Web Services (aws.amazon.com) will
happily rent you such computers for $1-2/hr.
• I will post instructions and sample data sets for
using Amazon today at ged.msu.edu/angus/.
39. Current research
• Optimizing our programs => faster.
• Building an evaluation framework for
metagenome assemblers.
• Error correction!
40. De novo metagenome error correction
makes reads more mappable.
Jason Pell, unpub.
41. Concluding thoughts
• Achieving one or more assemblies is fairly
straightforward.
• Evaluating them is challenging, however, and
where you should be thinking hardest about
assembly.
• There are relatively few pipelines available for
analyzing assembled metagenomic data. MG-
RAST does support this; others?
Hinweis der Redaktion
Bad habit…
Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.