2012 talk to CSE department at U. Arizona

Streaming lossy compression of biological sequence
data using probabilistic data structures

C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
August 2012
ctb@msu.edu

Acknowledgements
Lab members involved Collaborators
 Adina Howe (w/Tiedje)  Jim Tiedje, MSU
 Jason Pell
 Arend Hintze  Billie Swalla, UW
 Rosangela Canino-  Janet Jansson, LBNL
Koning
 Qingpeng Zhang  Susannah Tringe, JGI
 Elijah Lowe
 Likit Preeyanon Funding
 Jiarong Guo
 Tim Brom USDA NIFA; NSF IOS;
 Kanchan Pavangadkar BEACON.
 Eric McDonald

We practice open science!
“Be the change you want”

Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/interests.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟

Shotgun metagenomics
 Collect samples;

 Extract DNA;

 Feed into sequencer;

 Computationally analyze.

Wikipedia: Environmental shotgun sequencing.p

Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th

It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness

…but for lots and lots of fragments!

Assemble based on word overlaps:

Repeats cause problems:

Sequencers also produce
errors…
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness

It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness

Shotgun sequencing & assembly
Randomly fragment & sequence from DNA;
reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

Assembly – no subdivision!
Assembly is inherently an all by all process. There
is no good way to subdivide the reads without
potentially missing a key connection

Assembly – no subdivision!
Assembly is inherently an all by all process. There
is no good way to subdivide the reads without
potentially missing a key connection
I am, of course, lying. There were no good ways…

Four main challenges for de novo
sequencing.
 Repeats.
 Low coverage.
 Errors

These introduce breaks in the
construction of contigs.

 Variation in coverage – transcriptomes and
metagenomes, as well as amplified genomic.

This challenges the assembler to distinguish between
erroneous connections (e.g. repeats) and real connections.

Repeats
 Overlaps don‟t place sequences uniquely when
there are repeats present.


Coverage
Easy calculation:

(# reads x avg read length) / genome size

So, for haploid human genome:

30m reads x 100 bp = 3 bn

Coverage
 “1x” doesn‟t mean every DNA sequence is read
once.
 It means that, if sampling were systematic, it
would be.
 Sampling isn‟t systematic, it‟s random!

Actual coverage varies widely from
the average.

Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.

Two basic assembly approaches
 Overlap/layout/consensus
 De Bruijn or k-mer graphs

The former is used for long reads, esp all Sanger-
based assemblies. The latter is used because of
memory efficiency.

Overlap/layout/consensus
Essentially,
1. Calculate all overlaps (n^2)
2. Cluster based on overlap.
3. Do a multiple sequence alignment


K-mer graph
Break reads (of any length) down into multiple
overlapping words of fixed length k.

ATGGACCAGATGACAC (k=12) =>

ATGGACCAGATG
TGGACCAGATGA
GGACCAGATGAC
GACCAGATGACA
ACCAGATGACAC

K-mer graphs - overlaps

J.R. Miller et al. / Genomics (2010)

K-mer graph (k=14)

Each node represents a 14-mer;
Links between each node are 13-mer overlaps

K-mer graph (k=14)

Branches in the graph represent partially overlapping sequences.

K-mer graph (k=14)

Single nucleotide variations cause long branches

K-mer graph (k=14)

Single nucleotide variations cause long branches;
They don‟t rejoin quickly.

K-mer graphs – choosing paths

For decisions about which paths etc, biology-based
heuristics come into play as well.

The computational conundrum

More data => better.

and

More data => computationally more challenging.

Reads vs edges (memory) in de Bruijn graphs

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For
Permissions, please email: journals.permissions@oup.com

The scale of the problem is stunning.
 I estimate a worldwide capacity for DNA sequencing
of 15 petabases/yr (it‟s probably larger).
 Individual labs can generate ~100 Gbp in ~1 week for
$10k.
 This sequencing is at a boutique level:
 Sequencing formats are semi-standard.
 Basic analysis approaches are ~80% cookbook.
 Every biological prep, problem, and analysis is different.
 Traditionally, biologists receive no training in
computation. (And computational people receive no
training in biology :)
 …and our computational infrastructure is optimizing
for high performance computing, not high throughput.

My problems are also very
annoying…
 (From Monday seminar) Est ~50 Tbp to
comprehensively sample the microbial
composition of a gram of soil.
 Currently we have approximately 2 Tbp spread
across 9 soil samples.

 Need 3 TB RAM on single chassis to do
assembly of 300 Gbp.
 …estimate 500 TB RAM for 50 Tbp of sequence.

That just won‟t do.

Theoretical => applied solutions.

Theoretical advances
Practically useful & usable Demonstrated
in data structures and
implementations, at scale. effectiveness on real data.
algorithms

Three parts to our solution.
1. Adaptation of a suite of probabilistic data
structures for representing set membership and
counting (Bloom filters and CountMin Sketch).

2. An online streaming approach to lossy
compression.

3. Compressible de Bruijn graph representation.

1. CountMin Sketch
To add element: increment associated counter at all hash locales
To get count: retrieve minimum counter across all hash locales

http://highlyscalable.wordpress.com/2012/0
5/01/probabilistic-structures-web-analytics-
data-mining/

Our approach is very memory
efficient…

…and does not introduce significant
miscounts on NGS data sets.

2. Online, streaming, lossy (NOVEL)

compression.
Much of next-gen sequencing is redundant.

Uneven coverage => even more (NOVEL)

redundancy

Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!

This 100x will consume
disk space and, because
of errors, memory.

Can we preferentially retain reads that contain “true
edges”?

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com

Downsample based on de Bruijn
graph structure; this can be derived
online.

Digital normalization algorithm

for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read

Note, single pass; fixed memory.

The median k-mer count in a “sentence” is a
good estimator of redundancy within the graph.
This gives us a
reference-free
measure of
coverage.

Digital normalization retains information, while
discarding data and errors

Contig assembly now scales with underlying genome
size

 Transcriptomes, microbial genomes incl
MDA, and most metagenomes can be assembled
in under 50 GB of RAM, with identical or
improved results.

 Memory efficient is improved by use of CountMin
Sketch.

(NOVEL)

3. Compressible de Bruijn graphs

Each node represents a 14-mer;
Links between each node are 13-mer overlaps

Can store implicit de Bruijn graphs in
a Bloom filter
AGTCGG
AGTCGGCATGAC
AGTCGG …C
GTCGGC
TCGGCA …A
CGGCAT
GGCATG
…T
GCATGA
CATGAC
…G

…A
Bloom ﬁlter
…C

False positives introduce false
nodes/edges.
When does this start to distort the graph?

Average component size remains low
through 18% FPR.

Graph diameter remains constant
through 18% FPR.

Global graph structure is retained past
18% FPR

1%
5%

10% 15%

Equivalent to bond percolation problem; percolation
threshold independent of k (?)

This data structure is strikingly
efficient for storing sparse k-mer
graphs.

“Exact” is for best possible information-theoretical storage.

We implemented graph partitioning
on top of this probabilistic de Bruijn
graph.

Split reads into “bins”
belonging to
different source
species.
Can do this based
almost entirely on
connectivity of
sequences.

Partitioning scales assembly for a
subset of problems.
 Can be done in ~10x less memory than assembly.
 Partition at low k and assemble exactly at any higher
k (DBG).
 Partitions can then be assembled independently
 Multiple processors -> scaling
 Multiple k, coverage -> improved assembly
 Multiple assembly packages (tailored to high
variation, etc.)

 Can eliminate small partitions/contigs in the
partitioning phase.
 An incredibly convenient approach enabling divide &
conquer approaches across the board.

Technical challenges met (and defeated)
 Exhaustive in-memory traversal of graphs
containing 5-15 billion nodes.

 Sequencing technology introduces false
connections in graph (Howe et al., in prep.)

 Implementation lets us scale ~20x over other
approaches.

Minia assembler
(minia.geneouest.org)

Chaikhi thesis presentation

Our approaches yield a variety of
strategies…
Assembly

Assembly
Metagenomic data Partitioning
Assembly

Assembly

Shotgun data Digital
normalization Shotgun data Assembly

Concluding thoughts, thus far
 Our approaches provide significant and
substantial practical and theoretical leverage to
one of the most challenging current problems in
computational biology: assembly.
 They also improve quality of analysis, in some
cases.
 They provide a path to the future:
 Many-core compatible; distributable?
 Decreased memory footprint => cloud computing
can be used for many analyses.
 They are in use, ~dozens of labs using digital
normalization.

Future research
Many directions in the works! (see posted grant
props)

 Theoretical groundwork for normalization
approach.
 Graph search & alignment algorithms.
 Error detection & correction.
 Resequencing analysis.
 Online (“infinite”) assembly.

Running HMMs over de Bruijn graphs
(=> cross validation)

 hmmgs: Assemble
based on good-scoring
HMM paths through the
graph.
 Independent of other
assemblers; very
sensitive, specific.
 95% of hmmgs rplB
domains are present in
our partitioned
assemblies.
Jordan Fish, Qiong Wang, and Jim Cole (RDP)

Side note: error correction is the
biggest “data” problem left in
sequencing.

Both for mapping & assembly.

Streaming error correction.
First pass Second pass

Error-correct low- Error-correct low-
All reads Yes! abundance k-mers in Yes! abundance k-mers in
read. read.

Does read come Does read come
from a high- from a now high-
coverage locus? coverage locus?
Add read to graph
Leave unchanged.
and save for later.
Only saved reads
No! No!

We can do error trimming of
genomic, MDA, transcriptomic, metagenomic data in < 2
passes, fixed memory.
We have just submitted a proposal to adapt Euler or
Quake-like error correction (e.g. spectral alignment

2012 talk to CSE department at U. Arizona

2012 talk to CSE department at U. Arizona

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie 2012 talk to CSE department at U. Arizona

Ähnlich wie 2012 talk to CSE department at U. Arizona (20)

Mehr von c.titus.brown

Mehr von c.titus.brown (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

2012 talk to CSE department at U. Arizona

Hinweis der Redaktion