SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Streaming lossy compression of biological sequence
      data using probabilistic data structures

                  C. Titus Brown
                Assistant Professor
              CSE, MMG, BEACON
             Michigan State University
                   August 2012
                  ctb@msu.edu
Acknowledgements
Lab members involved        Collaborators
   Adina Howe (w/Tiedje)    Jim Tiedje, MSU
   Jason Pell
   Arend Hintze             Billie Swalla, UW
   Rosangela Canino-        Janet Jansson, LBNL
    Koning
   Qingpeng Zhang           Susannah Tringe, JGI
   Elijah Lowe
   Likit Preeyanon         Funding
   Jiarong Guo
   Tim Brom                USDA NIFA; NSF IOS;
   Kanchan Pavangadkar          BEACON.
   Eric McDonald
We practice open science!
        “Be the change you want”

Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
 Preprints: on arXiv, q-bio:
  „diginorm arxiv‟
Shotgun metagenomics
 Collect samples;


 Extract DNA;


 Feed into sequencer;


 Computationally analyze.




                      Wikipedia: Environmental shotgun sequencing.p
Assembly
        It was the best of times, it was the wor
          , it was the worst of times, it was the
          isdom, it was the age of foolishness
        mes, it was the age of wisdom, it was th



It was the best of times, it was the worst of times, it was
     the age of wisdom, it was the age of foolishness

          
but for lots and lots of fragments!
Assemble based on word overlaps:




Repeats cause problems:
Sequencers also produce
errors

         It was the Gest of times, it was the wor
            , it was the worst of timZs, it was the
            isdom, it was the age of foolisXness
           , it was the worVt of times, it was the
         mes, it was Ahe age of wisdom, it was th
          It was the best of times, it Gas the wor
         mes, it was the age of witdom, it was th
             isdom, it was tIe age of foolishness



It was the best of times, it was the worst of times, it was the
         age of wisdom, it was the age of foolishness
Shotgun sequencing & assembly
  Randomly fragment & sequence from DNA;
       reassemble computationally.




                     UMD assembly primer (cbcb.umd.edu)
Assembly – no subdivision!
Assembly is inherently an all by all process. There
   is no good way to subdivide the reads without
        potentially missing a key connection
Assembly – no subdivision!
Assembly is inherently an all by all process. There
   is no good way to subdivide the reads without
        potentially missing a key connection
         I am, of course, lying. There were no good ways

Four main challenges for de novo
sequencing.
 Repeats.
 Low coverage.
 Errors

               These introduce breaks in the
                  construction of contigs.

 Variation in coverage – transcriptomes and
  metagenomes, as well as amplified genomic.

    This challenges the assembler to distinguish between
  erroneous connections (e.g. repeats) and real connections.
Repeats
 Overlaps don‟t place sequences uniquely when
 there are repeats present.




                              UMD assembly primer (cbcb.umd.edu)
Coverage
Easy calculation:

(# reads x avg read length) / genome size

So, for haploid human genome:

30m reads x 100 bp = 3 bn
Coverage
 “1x” doesn‟t mean every DNA sequence is read
  once.
 It means that, if sampling were systematic, it
  would be.
 Sampling isn‟t systematic, it‟s random!
Actual coverage varies widely from
the average.
Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.
Two basic assembly approaches
 Overlap/layout/consensus
 De Bruijn or k-mer graphs




 The former is used for long reads, esp all Sanger-
  based assemblies. The latter is used because of
                 memory efficiency.
Overlap/layout/consensus
Essentially,
1. Calculate all overlaps (n^2)
2. Cluster based on overlap.
3. Do a multiple sequence alignment




                          UMD assembly primer (cbcb.umd.edu)
K-mer graph
  Break reads (of any length) down into multiple
        overlapping words of fixed length k.

ATGGACCAGATGACAC (k=12) =>

ATGGACCAGATG
 TGGACCAGATGA
  GGACCAGATGAC
   GACCAGATGACA
    ACCAGATGACAC
K-mer graphs - overlaps




                   J.R. Miller et al. / Genomics (2010)
K-mer graph (k=14)




         Each node represents a 14-mer;
    Links between each node are 13-mer overlaps
K-mer graph (k=14)




 Branches in the graph represent partially overlapping sequences.
K-mer graph (k=14)




     Single nucleotide variations cause long branches
K-mer graph (k=14)




    Single nucleotide variations cause long branches;
                They don‟t rejoin quickly.
K-mer graphs – choosing paths




For decisions about which paths etc, biology-based
          heuristics come into play as well.
The computational conundrum


              More data => better.

and

 More data => computationally more challenging.
Reads vs edges (memory) in de Bruijn graphs




           Conway T C , Bromage A J Bioinformatics 2011;27:479-486


© The Author 2011. Published by Oxford University Press. All rights reserved. For
 Permissions, please email: journals.permissions@oup.com
The scale of the problem is stunning.
 I estimate a worldwide capacity for DNA sequencing
  of 15 petabases/yr (it‟s probably larger).
 Individual labs can generate ~100 Gbp in ~1 week for
  $10k.
 This sequencing is at a boutique level:
   Sequencing formats are semi-standard.
   Basic analysis approaches are ~80% cookbook.
   Every biological prep, problem, and analysis is different.
 Traditionally, biologists receive no training in
  computation. (And computational people receive no
  training in biology :)
 
and our computational infrastructure is optimizing
  for high performance computing, not high throughput.
My problems are also very
annoying

 (From Monday seminar) Est ~50 Tbp to
  comprehensively sample the microbial
  composition of a gram of soil.
 Currently we have approximately 2 Tbp spread
  across 9 soil samples.

 Need 3 TB RAM on single chassis to do
  assembly of 300 Gbp.
 
estimate 500 TB RAM for 50 Tbp of sequence.


               That just won‟t do.
Theoretical => applied solutions.



Theoretical advances
                         Practically useful & usable          Demonstrated
in data structures and
                         implementations, at scale.    effectiveness on real data.
      algorithms
Three parts to our solution.
1.   Adaptation of a suite of probabilistic data
     structures for representing set membership and
     counting (Bloom filters and CountMin Sketch).

2.   An online streaming approach to lossy
     compression.

3.   Compressible de Bruijn graph representation.
1. CountMin Sketch
    To add element: increment associated counter at all hash locales
    To get count: retrieve minimum counter across all hash locales




                       http://highlyscalable.wordpress.com/2012/0
                       5/01/probabilistic-structures-web-analytics-
                       data-mining/
Our approach is very memory
efficient


and does not introduce significant
miscounts on NGS data sets.
2. Online, streaming, lossy            (NOVEL)

compression.
      Much of next-gen sequencing is redundant.
Uneven coverage => even more           (NOVEL)

redundancy


                         Suppose you have a
                      dilution factor of A (10) to
                      B(1). To get 10x of B you
                        need to get 100x of A!
                                Overkill!!

                       This 100x will consume
                      disk space and, because
                         of errors, memory.
Can we preferentially retain reads that contain “true
                                    edges”?




           Conway T C , Bromage A J Bioinformatics 2011;27:479-486


© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
 please email: journals.permissions@oup.com
Downsample based on de Bruijn
graph structure; this can be derived
online.
Digital normalization algorithm

for read in dataset:
  if estimated_coverage(read) < CUTOFF:
        update_kmer_counts(read)
        save(read)
  else:
        # discard read

              Note, single pass; fixed memory.
The median k-mer count in a “sentence” is a
good estimator of redundancy within the graph.
                                   This gives us a
                                   reference-free
                                     measure of
                                      coverage.
Digital normalization retains information, while
discarding data and errors
Contig assembly now scales with underlying genome
size




    Transcriptomes, microbial genomes incl
    MDA, and most metagenomes can be assembled
    in under 50 GB of RAM, with identical or
    improved results.

    Memory efficient is improved by use of CountMin
    Sketch.
(NOVEL)

3. Compressible de Bruijn graphs




          Each node represents a 14-mer;
     Links between each node are 13-mer overlaps
Can store implicit de Bruijn graphs in
a Bloom filter
                                AGTCGG
  AGTCGGCATGAC
  AGTCGG                         
C
   GTCGGC
    TCGGCA                       
A
     CGGCAT
      GGCATG
                                 
T
       GCATGA
        CATGAC
                                 
G

                                 
A
                  Bloom ïŹlter
                                 
C
False positives introduce false
nodes/edges.
              When does this start to distort the graph?
Average component size remains low
through 18% FPR.
Graph diameter remains constant
through 18% FPR.
Global graph structure is retained past
18% FPR


              1%
                          5%




              10%        15%
Equivalent to bond percolation problem; percolation
threshold independent of k (?)
This data structure is strikingly
efficient for storing sparse k-mer
graphs.




       “Exact” is for best possible information-theoretical storage.
We implemented graph partitioning
     on top of this probabilistic de Bruijn
     graph.


Split reads into “bins”
 belonging to
 different source
 species.
Can do this based
 almost entirely on
 connectivity of
 sequences.
Partitioning scales assembly for a
subset of problems.
 Can be done in ~10x less memory than assembly.
 Partition at low k and assemble exactly at any higher
  k (DBG).
 Partitions can then be assembled independently
   Multiple processors -> scaling
   Multiple k, coverage -> improved assembly
   Multiple assembly packages (tailored to high
    variation, etc.)

 Can eliminate small partitions/contigs in the
  partitioning phase.
 An incredibly convenient approach enabling divide &
  conquer approaches across the board.
Technical challenges met (and defeated)
 Exhaustive in-memory traversal of graphs
 containing 5-15 billion nodes.

 Sequencing technology introduces false
 connections in graph (Howe et al., in prep.)

 Implementation lets us scale ~20x over other
 approaches.
Minia assembler
(minia.geneouest.org)




                        Chaikhi thesis presentation
Our approaches yield a variety of
strategies

                                              Assembly


                                              Assembly
   Metagenomic data   Partitioning
                                              Assembly


                                              Assembly




    Shotgun data         Digital
                      normalization   Shotgun data       Assembly
Concluding thoughts, thus far
 Our approaches provide significant and
  substantial practical and theoretical leverage to
  one of the most challenging current problems in
  computational biology: assembly.
 They also improve quality of analysis, in some
  cases.
 They provide a path to the future:
   Many-core compatible; distributable?
   Decreased memory footprint => cloud computing
   can be used for many analyses.
 They are in use, ~dozens of labs using digital
 normalization.
Future research
Many directions in the works! (see posted grant
props)

 Theoretical groundwork for normalization
    approach.
   Graph search & alignment algorithms.
   Error detection & correction.
   Resequencing analysis.
   Online (“infinite”) assembly.
Streaming Twitter analysis.
Running HMMs over de Bruijn graphs
 (=> cross validation)


                                           hmmgs: Assemble
                                            based on good-scoring
                                            HMM paths through the
                                            graph.
                                           Independent of other
                                            assemblers; very
                                            sensitive, specific.
                                           95% of hmmgs rplB
                                            domains are present in
                                            our partitioned
                                            assemblies.
Jordan Fish, Qiong Wang, and Jim Cole (RDP)
Side note: error correction is the
biggest “data” problem left in
sequencing.




        Both for mapping & assembly.
Streaming error correction.
                         First pass                                               Second pass




                                       Error-correct low-                                       Error-correct low-
All reads                Yes!         abundance k-mers in                           Yes!       abundance k-mers in
                                             read.                                                    read.

            Does read come                                            Does read come
              from a high-                                            from a now high-
            coverage locus?                                           coverage locus?
                                       Add read to graph
                                                                                                Leave unchanged.
                                       and save for later.
                                                             Only saved reads
                              No!                                                        No!




             We can do error trimming of
genomic, MDA, transcriptomic, metagenomic data in < 2
                passes, fixed memory.
  We have just submitted a proposal to adapt Euler or
  Quake-like error correction (e.g. spectral alignment
2012 talk to CSE department at U. Arizona

Weitere Àhnliche Inhalte

Andere mochten auch

Enough Blame for System Performance Issues
Enough Blame for System Performance IssuesEnough Blame for System Performance Issues
Enough Blame for System Performance Issues
Mahesh Vallampati
 
net Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Groupnet Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Group
Warrick Tan
 
Digital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsDigital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospects
Judith Baines
 
Ma R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or IMa R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or I
guest5f4c783
 
Dados da Internet no Brasil - nov/2009
Dados da Internet no Brasil - nov/2009Dados da Internet no Brasil - nov/2009
Dados da Internet no Brasil - nov/2009
flaviohorta
 
Wild beauty2
Wild beauty2Wild beauty2
Wild beauty2
Daniel Chua
 
Shepley ross introduction_od_es_manual_4th
Shepley ross introduction_od_es_manual_4thShepley ross introduction_od_es_manual_4th
Shepley ross introduction_od_es_manual_4th
gabo GAG
 

Andere mochten auch (20)

NZ Myths & Legends webquest
NZ Myths & Legends webquestNZ Myths & Legends webquest
NZ Myths & Legends webquest
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Enough Blame for System Performance Issues
Enough Blame for System Performance IssuesEnough Blame for System Performance Issues
Enough Blame for System Performance Issues
 
net Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Groupnet Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Group
 
Digital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsDigital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospects
 
Ma R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or IMa R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or I
 
Dados da Internet no Brasil - nov/2009
Dados da Internet no Brasil - nov/2009Dados da Internet no Brasil - nov/2009
Dados da Internet no Brasil - nov/2009
 
Wild beauty2
Wild beauty2Wild beauty2
Wild beauty2
 
Ten Common Wage & Hour Blunders
Ten Common Wage & Hour BlundersTen Common Wage & Hour Blunders
Ten Common Wage & Hour Blunders
 
Implications of the Proposed Employee Free Choice Act
Implications of the Proposed Employee Free Choice ActImplications of the Proposed Employee Free Choice Act
Implications of the Proposed Employee Free Choice Act
 
Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?
 
Shepley ross introduction_od_es_manual_4th
Shepley ross introduction_od_es_manual_4thShepley ross introduction_od_es_manual_4th
Shepley ross introduction_od_es_manual_4th
 
PPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyRPPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyR
 
Passivhus nordvest
Passivhus nordvestPassivhus nordvest
Passivhus nordvest
 
Kansen zien kansen benutten okw woerden
Kansen zien kansen benutten okw woerdenKansen zien kansen benutten okw woerden
Kansen zien kansen benutten okw woerden
 
Eterna Si Fascinanta Romanie
Eterna Si Fascinanta RomanieEterna Si Fascinanta Romanie
Eterna Si Fascinanta Romanie
 
Putting Our Funny Where Our Mouth Is
Putting Our Funny Where Our Mouth IsPutting Our Funny Where Our Mouth Is
Putting Our Funny Where Our Mouth Is
 
2011 Ohio Hispanic Business Summit
2011 Ohio Hispanic Business Summit2011 Ohio Hispanic Business Summit
2011 Ohio Hispanic Business Summit
 
Your Guide to Business + Legal Success in Latin America
Your Guide to Business + Legal Success in Latin AmericaYour Guide to Business + Legal Success in Latin America
Your Guide to Business + Legal Success in Latin America
 
Cloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvuCloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvu
 

Ähnlich wie 2012 talk to CSE department at U. Arizona

2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
c.titus.brown
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
c.titus.brown
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
c.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
c.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
c.titus.brown
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
c.titus.brown
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
c.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
c.titus.brown
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
Adina Chuang Howe
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
c.titus.brown
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
c.titus.brown
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
c.titus.brown
 

Ähnlich wie 2012 talk to CSE department at U. Arizona (20)

2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
 

Mehr von c.titus.brown

Mehr von c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 

KĂŒrzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

KĂŒrzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

2012 talk to CSE department at U. Arizona

  • 1. Streaming lossy compression of biological sequence data using probabilistic data structures C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University August 2012 ctb@msu.edu
  • 2. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Jason Pell  Arend Hintze  Billie Swalla, UW  Rosangela Canino-  Janet Jansson, LBNL Koning  Qingpeng Zhang  Susannah Tringe, JGI  Elijah Lowe  Likit Preeyanon Funding  Jiarong Guo  Tim Brom USDA NIFA; NSF IOS;  Kanchan Pavangadkar BEACON.  Eric McDonald
  • 3.
  • 4. We practice open science! “Be the change you want” Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟
  • 5. Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. Wikipedia: Environmental shotgun sequencing.p
  • 6. Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness 
but for lots and lots of fragments!
  • 7. Assemble based on word overlaps: Repeats cause problems:
  • 8. Sequencers also produce errors
 It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 9. Shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 10. Assembly – no subdivision! Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection
  • 11. Assembly – no subdivision! Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection I am, of course, lying. There were no good ways

  • 12. Four main challenges for de novo sequencing.  Repeats.  Low coverage.  Errors These introduce breaks in the construction of contigs.  Variation in coverage – transcriptomes and metagenomes, as well as amplified genomic. This challenges the assembler to distinguish between erroneous connections (e.g. repeats) and real connections.
  • 13. Repeats  Overlaps don‟t place sequences uniquely when there are repeats present. UMD assembly primer (cbcb.umd.edu)
  • 14. Coverage Easy calculation: (# reads x avg read length) / genome size So, for haploid human genome: 30m reads x 100 bp = 3 bn
  • 15. Coverage  “1x” doesn‟t mean every DNA sequence is read once.  It means that, if sampling were systematic, it would be.  Sampling isn‟t systematic, it‟s random!
  • 16. Actual coverage varies widely from the average.
  • 17. Actual coverage varies widely from the average. Low coverage introduces unavoidable breaks.
  • 18. Two basic assembly approaches  Overlap/layout/consensus  De Bruijn or k-mer graphs The former is used for long reads, esp all Sanger- based assemblies. The latter is used because of memory efficiency.
  • 19. Overlap/layout/consensus Essentially, 1. Calculate all overlaps (n^2) 2. Cluster based on overlap. 3. Do a multiple sequence alignment UMD assembly primer (cbcb.umd.edu)
  • 20. K-mer graph Break reads (of any length) down into multiple overlapping words of fixed length k. ATGGACCAGATGACAC (k=12) => ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC
  • 21. K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010)
  • 22. K-mer graph (k=14) Each node represents a 14-mer; Links between each node are 13-mer overlaps
  • 23. K-mer graph (k=14) Branches in the graph represent partially overlapping sequences.
  • 24. K-mer graph (k=14) Single nucleotide variations cause long branches
  • 25. K-mer graph (k=14) Single nucleotide variations cause long branches; They don‟t rejoin quickly.
  • 26. K-mer graphs – choosing paths For decisions about which paths etc, biology-based heuristics come into play as well.
  • 27. The computational conundrum More data => better. and More data => computationally more challenging.
  • 28. Reads vs edges (memory) in de Bruijn graphs Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 29. The scale of the problem is stunning.  I estimate a worldwide capacity for DNA sequencing of 15 petabases/yr (it‟s probably larger).  Individual labs can generate ~100 Gbp in ~1 week for $10k.  This sequencing is at a boutique level:  Sequencing formats are semi-standard.  Basic analysis approaches are ~80% cookbook.  Every biological prep, problem, and analysis is different.  Traditionally, biologists receive no training in computation. (And computational people receive no training in biology :)  
and our computational infrastructure is optimizing for high performance computing, not high throughput.
  • 30. My problems are also very annoying
  (From Monday seminar) Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil.  Currently we have approximately 2 Tbp spread across 9 soil samples.  Need 3 TB RAM on single chassis to do assembly of 300 Gbp.  
estimate 500 TB RAM for 50 Tbp of sequence. That just won‟t do.
  • 31. Theoretical => applied solutions. Theoretical advances Practically useful & usable Demonstrated in data structures and implementations, at scale. effectiveness on real data. algorithms
  • 32. Three parts to our solution. 1. Adaptation of a suite of probabilistic data structures for representing set membership and counting (Bloom filters and CountMin Sketch). 2. An online streaming approach to lossy compression. 3. Compressible de Bruijn graph representation.
  • 33. 1. CountMin Sketch To add element: increment associated counter at all hash locales To get count: retrieve minimum counter across all hash locales http://highlyscalable.wordpress.com/2012/0 5/01/probabilistic-structures-web-analytics- data-mining/
  • 34. Our approach is very memory efficient

  • 35. 
and does not introduce significant miscounts on NGS data sets.
  • 36. 2. Online, streaming, lossy (NOVEL) compression. Much of next-gen sequencing is redundant.
  • 37. Uneven coverage => even more (NOVEL) redundancy Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  • 38. Can we preferentially retain reads that contain “true edges”? Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 39. Downsample based on de Bruijn graph structure; this can be derived online.
  • 40. Digital normalization algorithm for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  • 41. The median k-mer count in a “sentence” is a good estimator of redundancy within the graph. This gives us a reference-free measure of coverage.
  • 42. Digital normalization retains information, while discarding data and errors
  • 43. Contig assembly now scales with underlying genome size  Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results.  Memory efficient is improved by use of CountMin Sketch.
  • 44. (NOVEL) 3. Compressible de Bruijn graphs Each node represents a 14-mer; Links between each node are 13-mer overlaps
  • 45. Can store implicit de Bruijn graphs in a Bloom filter AGTCGG AGTCGGCATGAC AGTCGG 
C GTCGGC TCGGCA 
A CGGCAT GGCATG 
T GCATGA CATGAC 
G 
A Bloom ïŹlter 
C
  • 46. False positives introduce false nodes/edges. When does this start to distort the graph?
  • 47. Average component size remains low through 18% FPR.
  • 48. Graph diameter remains constant through 18% FPR.
  • 49. Global graph structure is retained past 18% FPR 1% 5% 10% 15%
  • 50. Equivalent to bond percolation problem; percolation threshold independent of k (?)
  • 51. This data structure is strikingly efficient for storing sparse k-mer graphs. “Exact” is for best possible information-theoretical storage.
  • 52. We implemented graph partitioning on top of this probabilistic de Bruijn graph. Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences.
  • 53. Partitioning scales assembly for a subset of problems.  Can be done in ~10x less memory than assembly.  Partition at low k and assemble exactly at any higher k (DBG).  Partitions can then be assembled independently  Multiple processors -> scaling  Multiple k, coverage -> improved assembly  Multiple assembly packages (tailored to high variation, etc.)  Can eliminate small partitions/contigs in the partitioning phase.  An incredibly convenient approach enabling divide & conquer approaches across the board.
  • 54. Technical challenges met (and defeated)  Exhaustive in-memory traversal of graphs containing 5-15 billion nodes.  Sequencing technology introduces false connections in graph (Howe et al., in prep.)  Implementation lets us scale ~20x over other approaches.
  • 55. Minia assembler (minia.geneouest.org) Chaikhi thesis presentation
  • 56. Our approaches yield a variety of strategies
 Assembly Assembly Metagenomic data Partitioning Assembly Assembly Shotgun data Digital normalization Shotgun data Assembly
  • 57. Concluding thoughts, thus far  Our approaches provide significant and substantial practical and theoretical leverage to one of the most challenging current problems in computational biology: assembly.  They also improve quality of analysis, in some cases.  They provide a path to the future:  Many-core compatible; distributable?  Decreased memory footprint => cloud computing can be used for many analyses.  They are in use, ~dozens of labs using digital normalization.
  • 58. Future research Many directions in the works! (see posted grant props)  Theoretical groundwork for normalization approach.  Graph search & alignment algorithms.  Error detection & correction.  Resequencing analysis.  Online (“infinite”) assembly.
  • 60. Running HMMs over de Bruijn graphs (=> cross validation)  hmmgs: Assemble based on good-scoring HMM paths through the graph.  Independent of other assemblers; very sensitive, specific.  95% of hmmgs rplB domains are present in our partitioned assemblies. Jordan Fish, Qiong Wang, and Jim Cole (RDP)
  • 61. Side note: error correction is the biggest “data” problem left in sequencing. Both for mapping & assembly.
  • 62. Streaming error correction. First pass Second pass Error-correct low- Error-correct low- All reads Yes! abundance k-mers in Yes! abundance k-mers in read. read. Does read come Does read come from a high- from a now high- coverage locus? coverage locus? Add read to graph Leave unchanged. and save for later. Only saved reads No! No! We can do error trimming of genomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory. We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment

Hinweis der Redaktion

  1. High coverage is essential.
  2. High coverage is essential.
  3. Note, no tolerance for indels
  4. Note that any such measure will do.
  5. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression =&gt; OLC assembly.
  6. Completely different style of assembler; useful for cross validation.