The Codex of Business Writing Software for Real-World Solutions 2.pptx
2012 talk to CSE department at U. Arizona
1. Streaming lossy compression of biological sequence
data using probabilistic data structures
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
August 2012
ctb@msu.edu
4. We practice open science!
“Be the change you want”
Everything discussed here:
Code: github.com/ged-lab/ ; BSD license
Blog: http://ivory.idyll.org/blog („titus brown blog‟)
Twitter: @ctitusbrown
Grants on Lab Web site:
http://ged.msu.edu/interests.html
Preprints: on arXiv, q-bio:
„diginorm arxiv‟
6. Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness
…but for lots and lots of fragments!
8. Sequencers also produce
errors…
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
9. Shotgun sequencing & assembly
Randomly fragment & sequence from DNA;
reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
10. Assembly – no subdivision!
Assembly is inherently an all by all process. There
is no good way to subdivide the reads without
potentially missing a key connection
11. Assembly – no subdivision!
Assembly is inherently an all by all process. There
is no good way to subdivide the reads without
potentially missing a key connection
I am, of course, lying. There were no good ways…
12. Four main challenges for de novo
sequencing.
Repeats.
Low coverage.
Errors
These introduce breaks in the
construction of contigs.
Variation in coverage – transcriptomes and
metagenomes, as well as amplified genomic.
This challenges the assembler to distinguish between
erroneous connections (e.g. repeats) and real connections.
13. Repeats
Overlaps don‟t place sequences uniquely when
there are repeats present.
UMD assembly primer (cbcb.umd.edu)
15. Coverage
“1x” doesn‟t mean every DNA sequence is read
once.
It means that, if sampling were systematic, it
would be.
Sampling isn‟t systematic, it‟s random!
17. Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.
18. Two basic assembly approaches
Overlap/layout/consensus
De Bruijn or k-mer graphs
The former is used for long reads, esp all Sanger-
based assemblies. The latter is used because of
memory efficiency.
20. K-mer graph
Break reads (of any length) down into multiple
overlapping words of fixed length k.
ATGGACCAGATGACAC (k=12) =>
ATGGACCAGATG
TGGACCAGATGA
GGACCAGATGAC
GACCAGATGACA
ACCAGATGACAC
21. K-mer graphs - overlaps
J.R. Miller et al. / Genomics (2010)
22. K-mer graph (k=14)
Each node represents a 14-mer;
Links between each node are 13-mer overlaps
23. K-mer graph (k=14)
Branches in the graph represent partially overlapping sequences.
29. The scale of the problem is stunning.
I estimate a worldwide capacity for DNA sequencing
of 15 petabases/yr (it‟s probably larger).
Individual labs can generate ~100 Gbp in ~1 week for
$10k.
This sequencing is at a boutique level:
Sequencing formats are semi-standard.
Basic analysis approaches are ~80% cookbook.
Every biological prep, problem, and analysis is different.
Traditionally, biologists receive no training in
computation. (And computational people receive no
training in biology :)
…and our computational infrastructure is optimizing
for high performance computing, not high throughput.
30. My problems are also very
annoying…
(From Monday seminar) Est ~50 Tbp to
comprehensively sample the microbial
composition of a gram of soil.
Currently we have approximately 2 Tbp spread
across 9 soil samples.
Need 3 TB RAM on single chassis to do
assembly of 300 Gbp.
…estimate 500 TB RAM for 50 Tbp of sequence.
That just won‟t do.
31. Theoretical => applied solutions.
Theoretical advances
Practically useful & usable Demonstrated
in data structures and
implementations, at scale. effectiveness on real data.
algorithms
32. Three parts to our solution.
1. Adaptation of a suite of probabilistic data
structures for representing set membership and
counting (Bloom filters and CountMin Sketch).
2. An online streaming approach to lossy
compression.
3. Compressible de Bruijn graph representation.
33. 1. CountMin Sketch
To add element: increment associated counter at all hash locales
To get count: retrieve minimum counter across all hash locales
http://highlyscalable.wordpress.com/2012/0
5/01/probabilistic-structures-web-analytics-
data-mining/
35. …and does not introduce significant
miscounts on NGS data sets.
36. 2. Online, streaming, lossy (NOVEL)
compression.
Much of next-gen sequencing is redundant.
37. Uneven coverage => even more (NOVEL)
redundancy
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume
disk space and, because
of errors, memory.
40. Digital normalization algorithm
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read
Note, single pass; fixed memory.
41. The median k-mer count in a “sentence” is a
good estimator of redundancy within the graph.
This gives us a
reference-free
measure of
coverage.
43. Contig assembly now scales with underlying genome
size
Transcriptomes, microbial genomes incl
MDA, and most metagenomes can be assembled
in under 50 GB of RAM, with identical or
improved results.
Memory efficient is improved by use of CountMin
Sketch.
44. (NOVEL)
3. Compressible de Bruijn graphs
Each node represents a 14-mer;
Links between each node are 13-mer overlaps
45. Can store implicit de Bruijn graphs in
a Bloom filter
AGTCGG
AGTCGGCATGAC
AGTCGG …C
GTCGGC
TCGGCA …A
CGGCAT
GGCATG
…T
GCATGA
CATGAC
…G
…A
Bloom filter
…C
50. Equivalent to bond percolation problem; percolation
threshold independent of k (?)
51. This data structure is strikingly
efficient for storing sparse k-mer
graphs.
“Exact” is for best possible information-theoretical storage.
52. We implemented graph partitioning
on top of this probabilistic de Bruijn
graph.
Split reads into “bins”
belonging to
different source
species.
Can do this based
almost entirely on
connectivity of
sequences.
53. Partitioning scales assembly for a
subset of problems.
Can be done in ~10x less memory than assembly.
Partition at low k and assemble exactly at any higher
k (DBG).
Partitions can then be assembled independently
Multiple processors -> scaling
Multiple k, coverage -> improved assembly
Multiple assembly packages (tailored to high
variation, etc.)
Can eliminate small partitions/contigs in the
partitioning phase.
An incredibly convenient approach enabling divide &
conquer approaches across the board.
54. Technical challenges met (and defeated)
Exhaustive in-memory traversal of graphs
containing 5-15 billion nodes.
Sequencing technology introduces false
connections in graph (Howe et al., in prep.)
Implementation lets us scale ~20x over other
approaches.
56. Our approaches yield a variety of
strategies…
Assembly
Assembly
Metagenomic data Partitioning
Assembly
Assembly
Shotgun data Digital
normalization Shotgun data Assembly
57. Concluding thoughts, thus far
Our approaches provide significant and
substantial practical and theoretical leverage to
one of the most challenging current problems in
computational biology: assembly.
They also improve quality of analysis, in some
cases.
They provide a path to the future:
Many-core compatible; distributable?
Decreased memory footprint => cloud computing
can be used for many analyses.
They are in use, ~dozens of labs using digital
normalization.
58. Future research
Many directions in the works! (see posted grant
props)
Theoretical groundwork for normalization
approach.
Graph search & alignment algorithms.
Error detection & correction.
Resequencing analysis.
Online (“infinite”) assembly.
60. Running HMMs over de Bruijn graphs
(=> cross validation)
hmmgs: Assemble
based on good-scoring
HMM paths through the
graph.
Independent of other
assemblers; very
sensitive, specific.
95% of hmmgs rplB
domains are present in
our partitioned
assemblies.
Jordan Fish, Qiong Wang, and Jim Cole (RDP)
61. Side note: error correction is the
biggest “data” problem left in
sequencing.
Both for mapping & assembly.
62. Streaming error correction.
First pass Second pass
Error-correct low- Error-correct low-
All reads Yes! abundance k-mers in Yes! abundance k-mers in
read. read.
Does read come Does read come
from a high- from a now high-
coverage locus? coverage locus?
Add read to graph
Leave unchanged.
and save for later.
Only saved reads
No! No!
We can do error trimming of
genomic, MDA, transcriptomic, metagenomic data in < 2
passes, fixed memory.
We have just submitted a proposal to adapt Euler or
Quake-like error correction (e.g. spectral alignment
Hinweis der Redaktion
High coverage is essential.
High coverage is essential.
Note, no tolerance for indels
Note that any such measure will do.
Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
Completely different style of assembler; useful for cross validation.