This document summarizes a presentation given by Adina Howe at the ASMWorkshop in May 2013 on visualizing complexity in metagenomics data. It lists her collaborators at Michigan State University and Argonne National Laboratory. It discusses challenges in sequencing metagenomics samples like errors, diversity, and low abundance of sequences. It describes techniques like digital normalization and partitioning that can help scale assembly of large metagenomics datasets. It addresses questions around memory requirements, evaluating assemblies, and studying microbial communities and environments.
Developer Data Modeling Mistakes: From Postgres to NoSQL
ASM 2013 Metagenomic Assembly Workshop Slides
1. Adina Howe
Michigan State University, Adjunct
Argonne National Laboratory, Postdoc
ASMWorkshop, May 2013
Visual Complexity
http://www.flickr.com/photos/maisonbisson
2. Titus Brown
Jim Tiedje
Jason Pell
Qingpeng Zhang
Jordan Fish
Eric McDonald
Chris Welcher
Aaron Garoutte
Jiarong Guo
Janet Jansson
Susannah Tringe
MSU Lab: Collaborators:
3. I will upload this on slideshare (adinachuanghowe)
Khmer documentation
github.com/ged-lab/khmer/
https://khmer.readthedocs.org/en/latest/guide.html
Manuscripts
Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
http://www.pnas.org/content/early/2012/07/25/1121464109
A reference-free algorithm for computational normalization of shotgun sequencing
data
http://arxiv.org/abs/1203.4802
Assembling large, complex metagenomes
http://arxiv.org/abs/1212.2832
4. High Abundance
Low Abundance
In t heenvironment (Our goal)
In our hands
X X
X
XX
XX
X
X
A few gotchas of sequencing:
Errors / Artifacts (confusion)
Diversity / Complexity (scale)
High Abundance
Low Abundance
In t heenvironment (Our goal)
In our hands
X X
X
XX
XX
X
X
5. High Abundance
Low Abundance
In theenvironment (Our goal)
In our hands
X
X
XX
XX
X
X1. Digital normalization (lossy compression)
2. Partitioning
3. Enabling usage of current previously unusable
assembly tools
6. Reduces data for analysis
Longer sequences (increased accuracy of annotation)
Gene order
Does not rely on known references, access to unknowns
Creates new references
Lots of assembly tools available
But…
7. Reduces data for analysis
Longer sequences (increased accuracy of annotation)
Gene order
Does not rely on known references, access to unknowns
Creates new references
Lots of assembly tools available
But…
Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
High memory requirements Depends on good (~10x) sequencing coverage
8. “Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
9. Note that k-mer abundance is not properly represented here! Each
blue k-mer will be present around 10 times.
10. Each single base error generates ~k new k-mers.
Generally, erroneous k-mers show up only once – errors are random.
15. Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…
16.
17.
18.
19.
20.
21.
22. A digital analog to cDNA library normalization,
diginorm:
Reference free.
Is single pass: looks at each read only once;
Does not “collect” the majority of errors;
Keeps all low-coverage reads;
Smooths out coverage of regions.
23. Digital normalization produces “good”
metagenome assemblies.
Smooths out abundance variation, strain
variation.
Reduces computational requirements for
assembly.
It also kinda makes sense :)
24. Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on connectivity
of sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Pell et al., 2012, PNAS
25.
26.
27.
28. Low coverage is the dominant problem blocking assembly of
your soil metagenome.
29. In order to build assemblies, each assembler
makes choices – uses heuristics – to reach a
conclusion.
These heuristics may not be appropriate for your
sample!
High polymorphism?
Mixed population vs clonal?
Genomic vs metagenomic vs mRNA
Low coverage drives differences in assembly.
30.
31. We can assemble virtually anything but soil ;).
Genomes, transcriptomes, MDA, mixtures, etc.
Repeat resolution will be fundamentally limited by
sequencing technology (insert size; sampling depth)
Strain variation confuses assembly, but does not
prevent useful results.
Diginorm is systematic strategy to enable assembly.
Banfield has shown how to deconvolve strains at
differential abundance.
Kostas K. results suggest that there will be a species
gap sufficient to prevent contig misassembly.
32. Most metagenomes require 50-150 GB of RAM.
Many people don’t have access to computers of
that size.
Amazon Web Services (aws.amazon.com) will
happily rent you such computers for $1-2/hr.
http://ged.msu.edu/angus/2013-hmp-assembly-
webinar/index.html
33. Optimizing our programs => faster.
Building an evaluation framework for
metagenome assemblers.
Error correction!
34. Achieving one or more assemblies is fairly
straightforward.
An assembly is a hypothesis and evaluating
them is challenging, however, and where you
should be thinking hardest about assembly.
There are relatively few pipelines available
for analyzing assembled metagenomic data.
36. How do we study complexity? Interactions? Diversity?
Communities? Evolution? Our environment?
Visual Complexity
http://www.flickr.com/photos/maisonbisson
• Major efforts of data collection
• Open-mind for discoveries
• Willingness to adjust to change
• Multiple efforts
• Well-designed experiments
Workshop example: Illumina deep
sequencing and scaling large datasets
on soil metagenomes
37. We receive Gb of sequences
Generally, my data is…
Split by barcodes
Untrimmed
Adapters are present
Two paired end fastq files
Underestimation of computational
requirements:
Quality control steps usually require 2-3 times the
amount of hard drive space
Similarity comparison against known databases
impractical (soil metagenome ~50 years to BLAST)
Home Alone Scream
My first slide graphic that I’m scared may date me.
38. Two ways to reduce the onslaught:
Cluster into known observances (annotate,
bin)
Assembly
Some mix of the above
39. Ten of you upload 1 Hiseq
flowcell into MG-RAST
40. Illumina short reads from soil
metagenome (~100 bp)
454 short reads from soil
metagenome (~368 bp)
Assembled contigs (Illumina)
reads from soil metagenome
(~491 bp)
Read length will increase… computational requirements? Assembly great way to reduce data.