2013 hmp-assembly-webinar

C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
ctb@msu.edu
HMP – Metagenome assembly

Acknowledgements
Lab members involved Collaborators
• Adina Howe (w/Tiedje)
• Jason Pell
• Arend Hintze
• Rosangela Canino-Koning
• Qingpeng Zhang
• Elijah Lowe
• Likit Preeyanon
• Jiarong Guo
• Tim Brom
• Kanchan Pavangadkar
• Eric McDonald
• Jordan Fish
• Chris Welcher
• Jim Tiedje, MSU
• Billie Swalla, UW
• Janet Jansson, LBNL
• Susannah Tringe, JGI
Funding
USDA NIFA; NSF IOS;
BEACON.

Open, online science
All of the software and approaches I’m talking about
today are available:
Assembling large, complex metagenomes
arxiv.org/abs/1212.2832
khmer software:
github.com/ged-lab/khmer/
Blog: http://ivory.idyll.org/blog/
Twitter: @ctitusbrown

Illumina! De Bruijn graphs!
• Today I’ll be talking about Illumina data
sets, and de Bruijn graph assembly (k-mer
assembly).
• This is because my research has largely
focused on scaling to large data sets (soil
metagenomics!) and Illumina is the real
scaling challenge.

Assembler heuristics
• In order to build assemblies, each assembler
makes choices – uses heuristics – to reach a
conclusion.
• These heuristics may not be appropriate for your
sample!
– High polymorphism?
– Mixed population vs clonal?
– Genomic vs metagenomic vs mRNA
– Low coverage drives differences in assembly.

Evaluating assembly
Predicted genome.
X
X
X
X
X
X
X
X
XX
Reads - noisy observations
of some genome.
Assembler
(a Big Black Box)
Evaluating correctness of metagenomes is still undiscovered country.

Shotgun sequencing
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.

Reducing to k-mers overlaps
Note that k-mer abundance is not properly represented here! Each
blue k-mer will be present around 10 times.

Errors create new k-mers
Each single base error generates ~k new k-mers.
Generally, erroneous k-mers show up only once – errors are random.

So, k-mer abundance plots are
mixtures of true and false k-mers.

Counting k-mers - histograms
Low-abundance peak (errors)

Counting k-mers - histograms
High-abundance peak
(true k-mers)

Approach: Digital normalization
(a computational version of library normalization)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…

Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
• Reference free.
• Is single pass: looks at each read only once;
• Does not “collect” the majority of errors;
• Keeps all low-coverage reads;
• Smooths out coverage of regions.

Coverage before digital normalization:
(MD amplified)

Coverage after digital normalization:
Normalizes coverage
Discards redundancy
Eliminates majority of
errors
Scales assembly dramatically.
Assembly is 98% identical.

In our experience…
• Digital normalization produces “good”
metagenome assemblies.
• Smooths out abundance variation, strain
variation.
• Reduces computational requirements for
assembly.
• It also kinda makes sense :)

Additional Approach for
Metagenomes: Data partitioning
(a computational version of cell sorting)
Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on connectivity
of sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Pell et al., 2012, PNAS

Partitioning separates reads by genome.
Strain variants co-partition.
When computationally spiking HMP mock data with one E. coli
genome (left) or multiple E. coli strains (right), majority of partitions
contain reads from only a single genome (blue) vs multi-genome
partitions (green).
Partitions containing spiked data indicated with a * Adina Howe
**

Conclusions re strain
variation/chimerism (previous slide)
• When spiking in intentionally complex
mixtures, only a small fraction of partitions
are chimeric.
• These means that only a small fraction of
contigs could be chimeric.
• Strain variants will almost certainly assemble
together.
• Can separate on abundance.
See Sharon et al., 2013, PMID 22936250, for Banfield work on this.

Looking at k-mer histograms…

Partitioning picks out diff genomes

Error correction “fixes” k-mers
Jason Pell

Our experience
• Our metagenome assemblies compare well with
others, but we have little in the way of ground
truth with which to evaluate.
• Scaffold assembly is tricky; we believe in contig
assembly for metagenomes, but not scaffolding.
• See arXiv paper, “Assembling large, complex
metagenomes”, for our suggested pipeline and
statistics & references.

Metagenomic assemblies are highly variable
Adina Howe et al., arXiv 1212.0159

High coverage is needed.
Low coverage is the dominant problem blocking assembly of
your soil metagenome.

Strain variation (soil)Toptwoallelefrequencies
Position within contig
Of 5000 most
abundant
contigs, only 1 has
a
polymorphism
rate > 5%
Can measure by
read mapping.

Overconfident predictions
• We can assemble virtually anything but soil ;).
– Genomes, transcriptomes, MDA, mixtures, etc.
– Repeat resolution will be fundamentally limited by
sequencing technology (insert size; sampling depth)
• Strain variation confuses assembly, but does not
prevent useful results.
– Diginorm is systematic strategy to enable assembly.
– Banfield has shown how to deconvolve strains at
differential abundance.
– Kostas K. results suggest that there will be a species gap
sufficient to prevent contig misassembly.
– Even genes “chimeric” between strains are useful.

Reasons why you shouldn’t believe me
1) Strain variation – when we get deeper in soil, we
should see more (?). Not sure what will
happen, and we do not (yet) have proven
approaches.
2) We, by definition, are not yet seeing anything
that doesn’t assemble.
3) We have not tackled scaffolding much. Serious
investigation of scaffolding will be necessary for
any good genome assembly, and scaffolding is
weak point.

Metagenome assemblers
In addition to khmer prefiltering,
• SPADES
• IDBA-UD
• MetaVelvet
• Ray Meta

Assembling in the cloud
• Most metagenomes require 50-150 GB of RAM.
• Many people don’t have access to computers of
that size.
• Amazon Web Services (aws.amazon.com) will
happily rent you such computers for $1-2/hr.
• I will post instructions and sample data sets for
using Amazon today at ged.msu.edu/angus/.

Current research
• Optimizing our programs => faster.
• Building an evaluation framework for
metagenome assemblers.
• Error correction!

De novo metagenome error correction
makes reads more mappable.
Jason Pell, unpub.

Concluding thoughts
• Achieving one or more assemblies is fairly
straightforward.
• Evaluating them is challenging, however, and
where you should be thinking hardest about
assembly.
• There are relatively few pipelines available for
analyzing assembled metagenomic data. MG-
RAST does support this; others?

2013 hmp-assembly-webinar

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (7)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie 2013 hmp-assembly-webinar

Ähnlich wie 2013 hmp-assembly-webinar (20)

Mehr von c.titus.brown

Mehr von c.titus.brown (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

2013 hmp-assembly-webinar

Hinweis der Redaktion