SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Streaming lossy compression of biological sequence
      data using probabilistic data structures

                  C. Titus Brown
                Assistant Professor
              CSE, MMG, BEACON
             Michigan State University
                   August 2012
                  ctb@msu.edu
Acknowledgements
Lab members involved        Collaborators
   Adina Howe (w/Tiedje)    Jim Tiedje, MSU
   Jason Pell
   Arend Hintze             Billie Swalla, UW
   Rosangela Canino-        Janet Jansson, LBNL
    Koning
   Qingpeng Zhang           Susannah Tringe, JGI
   Elijah Lowe
   Likit Preeyanon         Funding
   Jiarong Guo
   Tim Brom                USDA NIFA; NSF IOS;
   Kanchan Pavangadkar          BEACON.
   Eric McDonald
We practice open science!
        “Be the change you want”

Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
 Preprints: on arXiv, q-bio:
  „diginorm arxiv‟
Shotgun metagenomics
 Collect samples;


 Extract DNA;


 Feed into sequencer;


 Computationally analyze.




                      Wikipedia: Environmental shotgun sequencing.p
Assembly
        It was the best of times, it was the wor
          , it was the worst of times, it was the
          isdom, it was the age of foolishness
        mes, it was the age of wisdom, it was th



It was the best of times, it was the worst of times, it was
     the age of wisdom, it was the age of foolishness

          …but for lots and lots of fragments!
Assemble based on word overlaps:




Repeats cause problems:
Sequencers also produce
errors…
         It was the Gest of times, it was the wor
            , it was the worst of timZs, it was the
            isdom, it was the age of foolisXness
           , it was the worVt of times, it was the
         mes, it was Ahe age of wisdom, it was th
          It was the best of times, it Gas the wor
         mes, it was the age of witdom, it was th
             isdom, it was tIe age of foolishness



It was the best of times, it was the worst of times, it was the
         age of wisdom, it was the age of foolishness
Shotgun sequencing & assembly
  Randomly fragment & sequence from DNA;
       reassemble computationally.




                     UMD assembly primer (cbcb.umd.edu)
Assembly – no subdivision!
Assembly is inherently an all by all process. There
   is no good way to subdivide the reads without
        potentially missing a key connection
Assembly – no subdivision!
Assembly is inherently an all by all process. There
   is no good way to subdivide the reads without
        potentially missing a key connection
         I am, of course, lying. There were no good ways…
Four main challenges for de novo
sequencing.
 Repeats.
 Low coverage.
 Errors

               These introduce breaks in the
                  construction of contigs.

 Variation in coverage – transcriptomes and
  metagenomes, as well as amplified genomic.

    This challenges the assembler to distinguish between
  erroneous connections (e.g. repeats) and real connections.
Repeats
 Overlaps don‟t place sequences uniquely when
 there are repeats present.




                              UMD assembly primer (cbcb.umd.edu)
Coverage
Easy calculation:

(# reads x avg read length) / genome size

So, for haploid human genome:

30m reads x 100 bp = 3 bn
Coverage
 “1x” doesn‟t mean every DNA sequence is read
  once.
 It means that, if sampling were systematic, it
  would be.
 Sampling isn‟t systematic, it‟s random!
Actual coverage varies widely from
the average.
Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.
Two basic assembly approaches
 Overlap/layout/consensus
 De Bruijn or k-mer graphs




 The former is used for long reads, esp all Sanger-
  based assemblies. The latter is used because of
                 memory efficiency.
Overlap/layout/consensus
Essentially,
1. Calculate all overlaps (n^2)
2. Cluster based on overlap.
3. Do a multiple sequence alignment




                          UMD assembly primer (cbcb.umd.edu)
K-mer graph
  Break reads (of any length) down into multiple
        overlapping words of fixed length k.

ATGGACCAGATGACAC (k=12) =>

ATGGACCAGATG
 TGGACCAGATGA
  GGACCAGATGAC
   GACCAGATGACA
    ACCAGATGACAC
K-mer graphs - overlaps




                   J.R. Miller et al. / Genomics (2010)
K-mer graph (k=14)




         Each node represents a 14-mer;
    Links between each node are 13-mer overlaps
K-mer graph (k=14)




 Branches in the graph represent partially overlapping sequences.
K-mer graph (k=14)




     Single nucleotide variations cause long branches
K-mer graph (k=14)




    Single nucleotide variations cause long branches;
                They don‟t rejoin quickly.
K-mer graphs – choosing paths




For decisions about which paths etc, biology-based
          heuristics come into play as well.
The computational conundrum


              More data => better.

and

 More data => computationally more challenging.
Reads vs edges (memory) in de Bruijn graphs




           Conway T C , Bromage A J Bioinformatics 2011;27:479-486


© The Author 2011. Published by Oxford University Press. All rights reserved. For
 Permissions, please email: journals.permissions@oup.com
The scale of the problem is stunning.
 I estimate a worldwide capacity for DNA sequencing
  of 15 petabases/yr (it‟s probably larger).
 Individual labs can generate ~100 Gbp in ~1 week for
  $10k.
 This sequencing is at a boutique level:
   Sequencing formats are semi-standard.
   Basic analysis approaches are ~80% cookbook.
   Every biological prep, problem, and analysis is different.
 Traditionally, biologists receive no training in
  computation. (And computational people receive no
  training in biology :)
 …and our computational infrastructure is optimizing
  for high performance computing, not high throughput.
My problems are also very
annoying…
 (From Monday seminar) Est ~50 Tbp to
  comprehensively sample the microbial
  composition of a gram of soil.
 Currently we have approximately 2 Tbp spread
  across 9 soil samples.

 Need 3 TB RAM on single chassis to do
  assembly of 300 Gbp.
 …estimate 500 TB RAM for 50 Tbp of sequence.


               That just won‟t do.
Theoretical => applied solutions.



Theoretical advances
                         Practically useful & usable          Demonstrated
in data structures and
                         implementations, at scale.    effectiveness on real data.
      algorithms
Three parts to our solution.
1.   Adaptation of a suite of probabilistic data
     structures for representing set membership and
     counting (Bloom filters and CountMin Sketch).

2.   An online streaming approach to lossy
     compression.

3.   Compressible de Bruijn graph representation.
1. CountMin Sketch
    To add element: increment associated counter at all hash locales
    To get count: retrieve minimum counter across all hash locales




                       http://highlyscalable.wordpress.com/2012/0
                       5/01/probabilistic-structures-web-analytics-
                       data-mining/
Our approach is very memory
efficient…
…and does not introduce significant
miscounts on NGS data sets.
2. Online, streaming, lossy            (NOVEL)

compression.
      Much of next-gen sequencing is redundant.
Uneven coverage => even more           (NOVEL)

redundancy


                         Suppose you have a
                      dilution factor of A (10) to
                      B(1). To get 10x of B you
                        need to get 100x of A!
                                Overkill!!

                       This 100x will consume
                      disk space and, because
                         of errors, memory.
Can we preferentially retain reads that contain “true
                                    edges”?




           Conway T C , Bromage A J Bioinformatics 2011;27:479-486


© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
 please email: journals.permissions@oup.com
Downsample based on de Bruijn
graph structure; this can be derived
online.
Digital normalization algorithm

for read in dataset:
  if estimated_coverage(read) < CUTOFF:
        update_kmer_counts(read)
        save(read)
  else:
        # discard read

              Note, single pass; fixed memory.
The median k-mer count in a “sentence” is a
good estimator of redundancy within the graph.
                                   This gives us a
                                   reference-free
                                     measure of
                                      coverage.
Digital normalization retains information, while
discarding data and errors
Contig assembly now scales with underlying genome
size




    Transcriptomes, microbial genomes incl
    MDA, and most metagenomes can be assembled
    in under 50 GB of RAM, with identical or
    improved results.

    Memory efficient is improved by use of CountMin
    Sketch.
(NOVEL)

3. Compressible de Bruijn graphs




          Each node represents a 14-mer;
     Links between each node are 13-mer overlaps
Can store implicit de Bruijn graphs in
a Bloom filter
                                AGTCGG
  AGTCGGCATGAC
  AGTCGG                         …C
   GTCGGC
    TCGGCA                       …A
     CGGCAT
      GGCATG
                                 …T
       GCATGA
        CATGAC
                                 …G

                                 …A
                  Bloom filter
                                 …C
False positives introduce false
nodes/edges.
              When does this start to distort the graph?
Average component size remains low
through 18% FPR.
Graph diameter remains constant
through 18% FPR.
Global graph structure is retained past
18% FPR


              1%
                          5%




              10%        15%
Equivalent to bond percolation problem; percolation
threshold independent of k (?)
This data structure is strikingly
efficient for storing sparse k-mer
graphs.




       “Exact” is for best possible information-theoretical storage.
We implemented graph partitioning
     on top of this probabilistic de Bruijn
     graph.


Split reads into “bins”
 belonging to
 different source
 species.
Can do this based
 almost entirely on
 connectivity of
 sequences.
Partitioning scales assembly for a
subset of problems.
 Can be done in ~10x less memory than assembly.
 Partition at low k and assemble exactly at any higher
  k (DBG).
 Partitions can then be assembled independently
   Multiple processors -> scaling
   Multiple k, coverage -> improved assembly
   Multiple assembly packages (tailored to high
    variation, etc.)

 Can eliminate small partitions/contigs in the
  partitioning phase.
 An incredibly convenient approach enabling divide &
  conquer approaches across the board.
Technical challenges met (and defeated)
 Exhaustive in-memory traversal of graphs
 containing 5-15 billion nodes.

 Sequencing technology introduces false
 connections in graph (Howe et al., in prep.)

 Implementation lets us scale ~20x over other
 approaches.
Minia assembler
(minia.geneouest.org)




                        Chaikhi thesis presentation
Our approaches yield a variety of
strategies…
                                              Assembly


                                              Assembly
   Metagenomic data   Partitioning
                                              Assembly


                                              Assembly




    Shotgun data         Digital
                      normalization   Shotgun data       Assembly
Concluding thoughts, thus far
 Our approaches provide significant and
  substantial practical and theoretical leverage to
  one of the most challenging current problems in
  computational biology: assembly.
 They also improve quality of analysis, in some
  cases.
 They provide a path to the future:
   Many-core compatible; distributable?
   Decreased memory footprint => cloud computing
   can be used for many analyses.
 They are in use, ~dozens of labs using digital
 normalization.
Future research
Many directions in the works! (see posted grant
props)

 Theoretical groundwork for normalization
    approach.
   Graph search & alignment algorithms.
   Error detection & correction.
   Resequencing analysis.
   Online (“infinite”) assembly.
Streaming Twitter analysis.
Running HMMs over de Bruijn graphs
 (=> cross validation)


                                           hmmgs: Assemble
                                            based on good-scoring
                                            HMM paths through the
                                            graph.
                                           Independent of other
                                            assemblers; very
                                            sensitive, specific.
                                           95% of hmmgs rplB
                                            domains are present in
                                            our partitioned
                                            assemblies.
Jordan Fish, Qiong Wang, and Jim Cole (RDP)
Side note: error correction is the
biggest “data” problem left in
sequencing.




        Both for mapping & assembly.
Streaming error correction.
                         First pass                                               Second pass




                                       Error-correct low-                                       Error-correct low-
All reads                Yes!         abundance k-mers in                           Yes!       abundance k-mers in
                                             read.                                                    read.

            Does read come                                            Does read come
              from a high-                                            from a now high-
            coverage locus?                                           coverage locus?
                                       Add read to graph
                                                                                                Leave unchanged.
                                       and save for later.
                                                             Only saved reads
                              No!                                                        No!




             We can do error trimming of
genomic, MDA, transcriptomic, metagenomic data in < 2
                passes, fixed memory.
  We have just submitted a proposal to adapt Euler or
  Quake-like error correction (e.g. spectral alignment
2012 talk to CSE department at U. Arizona

Weitere ähnliche Inhalte

Andere mochten auch

NZ Myths & Legends webquest
NZ Myths & Legends webquestNZ Myths & Legends webquest
NZ Myths & Legends webquestTakahe One
 
Enough Blame for System Performance Issues
Enough Blame for System Performance IssuesEnough Blame for System Performance Issues
Enough Blame for System Performance IssuesMahesh Vallampati
 
net Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Groupnet Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics GroupWarrick Tan
 
Digital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsDigital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsJudith Baines
 
Ma R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or IMa R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or Iguest5f4c783
 
Dados da Internet no Brasil - nov/2009
Dados da Internet no Brasil - nov/2009Dados da Internet no Brasil - nov/2009
Dados da Internet no Brasil - nov/2009flaviohorta
 
Implications of the Proposed Employee Free Choice Act
Implications of the Proposed Employee Free Choice ActImplications of the Proposed Employee Free Choice Act
Implications of the Proposed Employee Free Choice ActKegler Brown Hill + Ritter
 
Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?Sham Yemul
 
Shepley ross introduction_od_es_manual_4th
Shepley ross introduction_od_es_manual_4thShepley ross introduction_od_es_manual_4th
Shepley ross introduction_od_es_manual_4thgabo GAG
 
PPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyRPPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyRAkwu OKOLO
 
Kansen zien kansen benutten okw woerden
Kansen zien kansen benutten okw woerdenKansen zien kansen benutten okw woerden
Kansen zien kansen benutten okw woerdenPiet van Vugt
 
Eterna Si Fascinanta Romanie
Eterna Si Fascinanta RomanieEterna Si Fascinanta Romanie
Eterna Si Fascinanta Romanienbmro
 
Putting Our Funny Where Our Mouth Is
Putting Our Funny Where Our Mouth IsPutting Our Funny Where Our Mouth Is
Putting Our Funny Where Our Mouth IsSarah Halstead
 
Your Guide to Business + Legal Success in Latin America
Your Guide to Business + Legal Success in Latin AmericaYour Guide to Business + Legal Success in Latin America
Your Guide to Business + Legal Success in Latin AmericaKegler Brown Hill + Ritter
 
Cloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvuCloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvuPiet van Vugt
 

Andere mochten auch (20)

NZ Myths & Legends webquest
NZ Myths & Legends webquestNZ Myths & Legends webquest
NZ Myths & Legends webquest
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Enough Blame for System Performance Issues
Enough Blame for System Performance IssuesEnough Blame for System Performance Issues
Enough Blame for System Performance Issues
 
net Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Groupnet Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Group
 
Digital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsDigital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospects
 
Ma R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or IMa R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or I
 
Dados da Internet no Brasil - nov/2009
Dados da Internet no Brasil - nov/2009Dados da Internet no Brasil - nov/2009
Dados da Internet no Brasil - nov/2009
 
Wild beauty2
Wild beauty2Wild beauty2
Wild beauty2
 
Ten Common Wage & Hour Blunders
Ten Common Wage & Hour BlundersTen Common Wage & Hour Blunders
Ten Common Wage & Hour Blunders
 
Implications of the Proposed Employee Free Choice Act
Implications of the Proposed Employee Free Choice ActImplications of the Proposed Employee Free Choice Act
Implications of the Proposed Employee Free Choice Act
 
Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?
 
Shepley ross introduction_od_es_manual_4th
Shepley ross introduction_od_es_manual_4thShepley ross introduction_od_es_manual_4th
Shepley ross introduction_od_es_manual_4th
 
PPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyRPPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyR
 
Passivhus nordvest
Passivhus nordvestPassivhus nordvest
Passivhus nordvest
 
Kansen zien kansen benutten okw woerden
Kansen zien kansen benutten okw woerdenKansen zien kansen benutten okw woerden
Kansen zien kansen benutten okw woerden
 
Eterna Si Fascinanta Romanie
Eterna Si Fascinanta RomanieEterna Si Fascinanta Romanie
Eterna Si Fascinanta Romanie
 
Putting Our Funny Where Our Mouth Is
Putting Our Funny Where Our Mouth IsPutting Our Funny Where Our Mouth Is
Putting Our Funny Where Our Mouth Is
 
2011 Ohio Hispanic Business Summit
2011 Ohio Hispanic Business Summit2011 Ohio Hispanic Business Summit
2011 Ohio Hispanic Business Summit
 
Your Guide to Business + Legal Success in Latin America
Your Guide to Business + Legal Success in Latin AmericaYour Guide to Business + Legal Success in Latin America
Your Guide to Business + Legal Success in Latin America
 
Cloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvuCloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvu
 

Ähnlich wie 2012 talk to CSE department at U. Arizona

2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsc.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...c.titus.brown
 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenomec.titus.brown
 

Ähnlich wie 2012 talk to CSE department at U. Arizona (20)

2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
 

Mehr von c.titus.brown

Mehr von c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 

Kürzlich hochgeladen

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Kürzlich hochgeladen (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

2012 talk to CSE department at U. Arizona

  • 1. Streaming lossy compression of biological sequence data using probabilistic data structures C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University August 2012 ctb@msu.edu
  • 2. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Jason Pell  Arend Hintze  Billie Swalla, UW  Rosangela Canino-  Janet Jansson, LBNL Koning  Qingpeng Zhang  Susannah Tringe, JGI  Elijah Lowe  Likit Preeyanon Funding  Jiarong Guo  Tim Brom USDA NIFA; NSF IOS;  Kanchan Pavangadkar BEACON.  Eric McDonald
  • 3.
  • 4. We practice open science! “Be the change you want” Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟
  • 5. Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. Wikipedia: Environmental shotgun sequencing.p
  • 6. Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  • 7. Assemble based on word overlaps: Repeats cause problems:
  • 8. Sequencers also produce errors… It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 9. Shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 10. Assembly – no subdivision! Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection
  • 11. Assembly – no subdivision! Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection I am, of course, lying. There were no good ways…
  • 12. Four main challenges for de novo sequencing.  Repeats.  Low coverage.  Errors These introduce breaks in the construction of contigs.  Variation in coverage – transcriptomes and metagenomes, as well as amplified genomic. This challenges the assembler to distinguish between erroneous connections (e.g. repeats) and real connections.
  • 13. Repeats  Overlaps don‟t place sequences uniquely when there are repeats present. UMD assembly primer (cbcb.umd.edu)
  • 14. Coverage Easy calculation: (# reads x avg read length) / genome size So, for haploid human genome: 30m reads x 100 bp = 3 bn
  • 15. Coverage  “1x” doesn‟t mean every DNA sequence is read once.  It means that, if sampling were systematic, it would be.  Sampling isn‟t systematic, it‟s random!
  • 16. Actual coverage varies widely from the average.
  • 17. Actual coverage varies widely from the average. Low coverage introduces unavoidable breaks.
  • 18. Two basic assembly approaches  Overlap/layout/consensus  De Bruijn or k-mer graphs The former is used for long reads, esp all Sanger- based assemblies. The latter is used because of memory efficiency.
  • 19. Overlap/layout/consensus Essentially, 1. Calculate all overlaps (n^2) 2. Cluster based on overlap. 3. Do a multiple sequence alignment UMD assembly primer (cbcb.umd.edu)
  • 20. K-mer graph Break reads (of any length) down into multiple overlapping words of fixed length k. ATGGACCAGATGACAC (k=12) => ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC
  • 21. K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010)
  • 22. K-mer graph (k=14) Each node represents a 14-mer; Links between each node are 13-mer overlaps
  • 23. K-mer graph (k=14) Branches in the graph represent partially overlapping sequences.
  • 24. K-mer graph (k=14) Single nucleotide variations cause long branches
  • 25. K-mer graph (k=14) Single nucleotide variations cause long branches; They don‟t rejoin quickly.
  • 26. K-mer graphs – choosing paths For decisions about which paths etc, biology-based heuristics come into play as well.
  • 27. The computational conundrum More data => better. and More data => computationally more challenging.
  • 28. Reads vs edges (memory) in de Bruijn graphs Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 29. The scale of the problem is stunning.  I estimate a worldwide capacity for DNA sequencing of 15 petabases/yr (it‟s probably larger).  Individual labs can generate ~100 Gbp in ~1 week for $10k.  This sequencing is at a boutique level:  Sequencing formats are semi-standard.  Basic analysis approaches are ~80% cookbook.  Every biological prep, problem, and analysis is different.  Traditionally, biologists receive no training in computation. (And computational people receive no training in biology :)  …and our computational infrastructure is optimizing for high performance computing, not high throughput.
  • 30. My problems are also very annoying…  (From Monday seminar) Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil.  Currently we have approximately 2 Tbp spread across 9 soil samples.  Need 3 TB RAM on single chassis to do assembly of 300 Gbp.  …estimate 500 TB RAM for 50 Tbp of sequence. That just won‟t do.
  • 31. Theoretical => applied solutions. Theoretical advances Practically useful & usable Demonstrated in data structures and implementations, at scale. effectiveness on real data. algorithms
  • 32. Three parts to our solution. 1. Adaptation of a suite of probabilistic data structures for representing set membership and counting (Bloom filters and CountMin Sketch). 2. An online streaming approach to lossy compression. 3. Compressible de Bruijn graph representation.
  • 33. 1. CountMin Sketch To add element: increment associated counter at all hash locales To get count: retrieve minimum counter across all hash locales http://highlyscalable.wordpress.com/2012/0 5/01/probabilistic-structures-web-analytics- data-mining/
  • 34. Our approach is very memory efficient…
  • 35. …and does not introduce significant miscounts on NGS data sets.
  • 36. 2. Online, streaming, lossy (NOVEL) compression. Much of next-gen sequencing is redundant.
  • 37. Uneven coverage => even more (NOVEL) redundancy Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  • 38. Can we preferentially retain reads that contain “true edges”? Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 39. Downsample based on de Bruijn graph structure; this can be derived online.
  • 40. Digital normalization algorithm for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  • 41. The median k-mer count in a “sentence” is a good estimator of redundancy within the graph. This gives us a reference-free measure of coverage.
  • 42. Digital normalization retains information, while discarding data and errors
  • 43. Contig assembly now scales with underlying genome size  Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results.  Memory efficient is improved by use of CountMin Sketch.
  • 44. (NOVEL) 3. Compressible de Bruijn graphs Each node represents a 14-mer; Links between each node are 13-mer overlaps
  • 45. Can store implicit de Bruijn graphs in a Bloom filter AGTCGG AGTCGGCATGAC AGTCGG …C GTCGGC TCGGCA …A CGGCAT GGCATG …T GCATGA CATGAC …G …A Bloom filter …C
  • 46. False positives introduce false nodes/edges. When does this start to distort the graph?
  • 47. Average component size remains low through 18% FPR.
  • 48. Graph diameter remains constant through 18% FPR.
  • 49. Global graph structure is retained past 18% FPR 1% 5% 10% 15%
  • 50. Equivalent to bond percolation problem; percolation threshold independent of k (?)
  • 51. This data structure is strikingly efficient for storing sparse k-mer graphs. “Exact” is for best possible information-theoretical storage.
  • 52. We implemented graph partitioning on top of this probabilistic de Bruijn graph. Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences.
  • 53. Partitioning scales assembly for a subset of problems.  Can be done in ~10x less memory than assembly.  Partition at low k and assemble exactly at any higher k (DBG).  Partitions can then be assembled independently  Multiple processors -> scaling  Multiple k, coverage -> improved assembly  Multiple assembly packages (tailored to high variation, etc.)  Can eliminate small partitions/contigs in the partitioning phase.  An incredibly convenient approach enabling divide & conquer approaches across the board.
  • 54. Technical challenges met (and defeated)  Exhaustive in-memory traversal of graphs containing 5-15 billion nodes.  Sequencing technology introduces false connections in graph (Howe et al., in prep.)  Implementation lets us scale ~20x over other approaches.
  • 55. Minia assembler (minia.geneouest.org) Chaikhi thesis presentation
  • 56. Our approaches yield a variety of strategies… Assembly Assembly Metagenomic data Partitioning Assembly Assembly Shotgun data Digital normalization Shotgun data Assembly
  • 57. Concluding thoughts, thus far  Our approaches provide significant and substantial practical and theoretical leverage to one of the most challenging current problems in computational biology: assembly.  They also improve quality of analysis, in some cases.  They provide a path to the future:  Many-core compatible; distributable?  Decreased memory footprint => cloud computing can be used for many analyses.  They are in use, ~dozens of labs using digital normalization.
  • 58. Future research Many directions in the works! (see posted grant props)  Theoretical groundwork for normalization approach.  Graph search & alignment algorithms.  Error detection & correction.  Resequencing analysis.  Online (“infinite”) assembly.
  • 60. Running HMMs over de Bruijn graphs (=> cross validation)  hmmgs: Assemble based on good-scoring HMM paths through the graph.  Independent of other assemblers; very sensitive, specific.  95% of hmmgs rplB domains are present in our partitioned assemblies. Jordan Fish, Qiong Wang, and Jim Cole (RDP)
  • 61. Side note: error correction is the biggest “data” problem left in sequencing. Both for mapping & assembly.
  • 62. Streaming error correction. First pass Second pass Error-correct low- Error-correct low- All reads Yes! abundance k-mers in Yes! abundance k-mers in read. read. Does read come Does read come from a high- from a now high- coverage locus? coverage locus? Add read to graph Leave unchanged. and save for later. Only saved reads No! No! We can do error trimming of genomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory. We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment

Hinweis der Redaktion

  1. High coverage is essential.
  2. High coverage is essential.
  3. Note, no tolerance for indels
  4. Note that any such measure will do.
  5. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression =&gt; OLC assembly.
  6. Completely different style of assembler; useful for cross validation.