SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Digital normalization and some consequences.
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Nov 2013
ctb@msu.edu
Acknowledgements
Lab members involved

Collaborators

 Adina Howe (w/Tiedje)

 Jim Tiedje, MSU
 Erich Schwarz, Caltech /

 Jason Pell
 Arend Hintze
 Rosangela Canino-Koning
 Qingpeng Zhang

 Elijah Lowe
 Likit Preeyanon
 Jiarong Guo
 Tim Brom
 Kanchan Pavangadkar
 Eric McDonald
 Chris Welcher
 Michael Crusoe

Cornell
 Paul Sternberg, Caltech
 Robin Gasser, U.
Melbourne
 Weiming Li

Funding

USDA NIFA; NSF IOS;
NIH; BEACON.
We practice open science!
“Be the change you want”
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/interests.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟
Outline
Digital normalization basics
2. Diginorm as streaming lossy compression of
NGS data…
3. …surprisingly useful.
1.

Three new directions:

4.

Reference-free data set investigation
2. Streaming algorithms
3. Open protocols
1.
Philosophy: hypothesis generation is
important.
 We need better methods to investigate and

analyze large sequencing data sets.
 To be most useful, these methods should be fast

& computationally efficient, because:
 Data gathering rate is already quite high
 Allows iterations

 Better methods for good computational

hypothesis generation are critical to moving
forward.
High-throughput sequencing
 I mostly work on ecologically and evolutionarily

interesting organisms.
 This includes non-model transcriptomes and

environmental metagenomes.
 Volume of data is a huge problem because of the

diversity of these samples, and because
assembly must be applied to them.
Why are big data sets difficult?
Need to resolve errors: the more coverage there is, the
more errors there are.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
There is quite a bit of life left to sequence & assem

http://pacelab.colorado.edu/
Shotgun sequencing and
coverage

“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.
Random sampling => deep sampling
needed

Typically 10-100x needed for robust recovery (300 Gbp for human)
Mixed populations.
 Approximately 20-40x coverage is required to

assemble the majority of a bacterial genome from
short reads. 100x is required for a “good” assembly.
 To sample a mixed population thoroughly, you need to

sample 100x of the lowest abundance species
present.
 For example, for E. coli in 1/1000 dilution, you would

need approximately 100x coverage of a 5mb genome
at 1/1000, or 500 Gbp of sequence!
 …actually getting this much sequence is fairly easy,

but is then hard to assemble in a reasonable
computer.
Approach: Digital normalization
(a computational version of library normalization)
Suppose you have
a dilution factor of
A (10) to B(1). To
get 10x of B you
need to get 100x
of A! Overkill!!
The high-coverage
reads in sample A
are unnecessary
for assembly, and,
in fact, distract.
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
How can this possibly work!?
All you really need is a way to estimate the
coverage of a read in a data set w/o an
assembly.
for read in dataset:
if estimated_coverage(read) < CUTOFF:
save(read)
(This read coverage estimator does need to be errortolerant.)
The median k-mer count in a read is a good
estimator of coverage.
This gives us a
reference-free
measure of
coverage.
Digital normalization algorithm
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read
Note, single pass; fixed memory.
Digital normalization approach
A digital analog to cDNA library
normalization, diginorm:
 Is streaming and single pass: looks at each read

only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads;
 Smooths out coverage of regions.
Key: underlying assembly graph
structure is retained.
Diginorm as a filter

Reads

Read filter/trim

Digital normalization
to C=20

 Diginorm is a pre-filter: it

loads in reads & emits
(some) of them.

Error trim with kmers

 You can then assemble

the reads however you
wish.

Digital normalization
to C=5

Calculate
abundances of
contigs

Assemble with your
favorite assembler
Contig assembly now scales with underlying genome
size

 Transcriptomes, microbial genomes incl MDA,

and most metagenomes can be assembled in
under 50 GB of RAM, with identical or improved
results.
 Memory efficient is improved by use of CountMin

Sketch.
Digital normalization retains information, while
discarding data and errors
Lossy compression

http://en.wikipedia.org/wiki/JPEG
Lossy compression

http://en.wikipedia.org/wiki/JPEG
Lossy compression

http://en.wikipedia.org/wiki/JPEG
Lossy compression

http://en.wikipedia.org/wiki/JPEG
Lossy compression

http://en.wikipedia.org/wiki/JPEG
Raw data
(~10-100 GB)

Compression
(~2 GB)

Analysis

"Information"
~1 GB

"Information"
"Information"
"Information"
"Information"
Database &
integration

Lossy compression can substantially
reduce data size while retaining
information needed for later (re)analysis.
Some diginorm examples:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable coverage”
problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P.
marinus) transcriptome, a “big assembly” problem.
(in prep)
3. Assembly of two Midwest soil metagenomes,
Iowa corn and Iowa prairie – the “impossible”
assembly problem.
Diginorm works well.
 Significantly decreases memory requirements,

esp. for metagenome and transcriptome
assemblies.
 Memory required for assembly now scales with

richness rather than diversity.
 Works on same underlying principle as assembly,
so assembly results can be nearly identical.
Diginorm works well.
 Improves some (many?) assemblies, especially

for:
 Repeat rich data.
 Highly polymorphic samples.
 Data with significant sequencing bias.
Diginorm works well.
 Nearly perfect lossy compression from an

information theoretic perspective:
 Discards 95% more of data for genomes.
 Loses < 00.02% of information.
Drawbacks of diginorm
 Some assemblers do not perform well

downstream of diginorm.
 Altered coverage statistics.
 Removal of repeats.

 No well-developed theory.
 …not yet published (but paper available as

preprint, with ~10 citations).
Diginorm is in wide (?) use
 Dozens to hundreds of labs using it.
 Seven research publications (at least) using it

already.
 A diginorm-derived algorithm, in silico

normalization, is now a standard part of the Trinity
mRNAseq pipeline.
Whither goest our research?
1. Pre-assembly analysis of shotgun

data.
2. Moving more sequence analysis onto

streaming reference-free basis.
3. Computing in the cloud.
1. Pre-assembly analysis of shotgun data
Rationale:
 Assembly is a “big black box” – data
goes in, contigs come out, ???
 In cases where assembly goes wrong,

or does not yield hoped-for results, we
need methods to diagnose potential
problems.
Perpetual Spouter hot spring
(Yellowstone)

Eric Boyd, Montana State U.
Data gathering =? Assembly
 Est low-complexity hot spring (~3-6 species)

 25m MiSeq reads (2x250), but no good assembly.
 Why?
 Several possible reasons:





Bad data
Significant strain variation
Low coverage
??
Information saturation curve (“collector‟s
curve”) suggests more information
needed.

Note: saturation to C=20
Read coverage spectrum

Many reads with low coverage
Cumulative read coverage

60% of data < 20x coverage
Cumulative read coverage

Some very high coverage data
Hot spring data conclusions --

Many reads with low coverage

Some very high
coverage data

 Need ~5 times more sequencing: assemblers do not

work well with reads < 20x coverage.
 But! Data is there, just low coverage.
 Many sequence reads are from small, high coverage
genomes (probably phage); this “dilutes” sequencing.
Directions for reference free
work:
 Richness estimation!
 MM5 deep carbon: 60 Mbp
 Great Prairie soil: 12 Gbp
 Amazon Rain Forest Microbial Observatory: 26 Gbp

 “How much more sequencing do I need to see

X?”
 Correlation with 16s

Qingpeng Zhang
2. Streaming/efficient reference-free
analysis
 Streaming online algorithms only look at data

~once.
(This is in comparison to most algorithms which are
“offline”: they require that all data be loaded in
completely before analyis begins.)
 Diginorm is streaming, online…
 Conceptually, can move many aspects of

sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.
Example: calculating read error rates
by position within read
 Shotgun data is randomly

Reads

sampled;
Assemble

 Any variation in mismatches

with reference by position is
likely due to errors or bias.

Map reads to
assembly

Calculate positionspecific mismatches
Reads from Shakya et al., pmid 2338786
Diginorm can detect graph saturation
Reads from Shakya et al., pmid 2338786
Reference-free error profile analysis
1.
2.
3.
4.
5.

Requires no prior information!
Immediate feedback on sequencing quality (for
cores & users)
Fast, lightweight (~100 MB, ~2 minutes)
Works for any shotgun sample (genomic,
metagenomic, transcriptomic).
Not affected by polymorphisms.
Reference-free error profile analysis
7. …if we know where the errors are, we can trim
them.
8. …if we know where the errors are, we can
correct them.
9. …if we look at differences by graph position
instead of by read position, we can call variants.

=> Streaming, online variant calling.
Streaming online reference-free variant calling.

Single pass, reference free, tunable, streaming online varian
Coverage is adjusted to retain signal
Directions for streaming graph
analysis
 Generate error profile for shotgun reads;
 Variable coverage error trimming;
 Streaming low-memory error correction for

genomes, metagenomes, and transcriptomes;
 Strain variant detection & resolution;
 Streaming variant analysis.

Jordan Fish & Jason Pe
3. Computing in the cloud
 Rental or “cloud” computers enable

expenditures on computing resources only on
demand.
 Everyone is generating data but few have

expertise, computational infrastructure to
analyze.
 Assembly has traditionally been “expensive”

but diginorm makes it cheap…
khmer-protocols
Read cleaning

 Close-to-release effort to provide

standard “cheap” assembly options
in the cloud.
 Entirely copy/paste; ~2-6 days from

raw reads to assembly,
annotations, and differential
expression analysis. ~$150 on
Amazon per data set.
 Open, versioned, forkable, citable.

Diginorm

Assembly

Annotation

RSEM differential
expression
Concluding thoughts
 Diginorm is a practically useful technique for

enabling more/better assembly.
 However, it also offers a number of opportunities

to put sequence analysis on a streaming basis.
 Underlying basis is really simple, but with (IMO)

profound implications: streaming, low memory.
Acknowledgements
Lab members involved

Collaborators

 Adina Howe (w/Tiedje)

 Jim Tiedje, MSU
 Erich Schwarz, Caltech /

 Jason Pell
 Arend Hintze
 Rosangela Canino-Koning
 Qingpeng Zhang

 Elijah Lowe
 Likit Preeyanon
 Jiarong Guo
 Tim Brom
 Kanchan Pavangadkar
 Eric McDonald
 Chris Welcher
 Michael Crusoe

Cornell
 Paul Sternberg, Caltech
 Robin Gasser, U.
Melbourne
 Weiming Li

Funding

USDA NIFA; NSF IOS;
NIH; BEACON.
Other interests!
 “Better Science through Superior Software”
 Open science/data/source
 Training!
 Software Carpentry
 “Zero-entry”

 Advanced workshops

 Reproducible research
 IPython Notebook!!!!!
IPython Notebook: data + code
=>
IPython)Notebook)

Weitere ähnliche Inhalte

Was ist angesagt?

2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
c.titus.brown
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
c.titus.brown
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
Shruthi Choudary
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data Perspective
Palaniappan SP
 

Was ist angesagt? (20)

2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Rna seq
Rna seqRna seq
Rna seq
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data Perspective
 

Andere mochten auch

Bone Fractures
Bone FracturesBone Fractures
Bone Fractures
avlainich
 
Mn1 sec 2 - les 4 - (taghabun 1-18)
Mn1   sec 2 - les 4 - (taghabun 1-18)Mn1   sec 2 - les 4 - (taghabun 1-18)
Mn1 sec 2 - les 4 - (taghabun 1-18)
Fawad Kiyani
 
How to do windows movie maker?
How to do windows movie maker?How to do windows movie maker?
How to do windows movie maker?
jessecadelina
 
Rational App Scan&Policy Tester
Rational App Scan&Policy TesterRational App Scan&Policy Tester
Rational App Scan&Policy Tester
Kristina O'Regan
 
Everest - Everything is a resource
Everest - Everything is a resourceEverest - Everything is a resource
Everest - Everything is a resource
Clément Escoffier
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Kegler Brown Hill + Ritter
 

Andere mochten auch (20)

Review Adobe Wallaby
Review Adobe WallabyReview Adobe Wallaby
Review Adobe Wallaby
 
R E V I V A L C O L L E G E S A Presentation
R E V I V A L   C O L L E G E  S A PresentationR E V I V A L   C O L L E G E  S A Presentation
R E V I V A L C O L L E G E S A Presentation
 
Informationsleder Jane Kruse
Informationsleder Jane KruseInformationsleder Jane Kruse
Informationsleder Jane Kruse
 
Bone Fractures
Bone FracturesBone Fractures
Bone Fractures
 
Keeping the Gold: Successfully Resolving Preference Claims
Keeping the Gold: Successfully Resolving Preference ClaimsKeeping the Gold: Successfully Resolving Preference Claims
Keeping the Gold: Successfully Resolving Preference Claims
 
Real Kings Of Logistics
Real Kings Of LogisticsReal Kings Of Logistics
Real Kings Of Logistics
 
Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?
 
Rachel Wolfe Writing Portfolio
Rachel Wolfe Writing PortfolioRachel Wolfe Writing Portfolio
Rachel Wolfe Writing Portfolio
 
Mn1 sec 2 - les 4 - (taghabun 1-18)
Mn1   sec 2 - les 4 - (taghabun 1-18)Mn1   sec 2 - les 4 - (taghabun 1-18)
Mn1 sec 2 - les 4 - (taghabun 1-18)
 
How to do windows movie maker?
How to do windows movie maker?How to do windows movie maker?
How to do windows movie maker?
 
Canada
CanadaCanada
Canada
 
Vizerra 2010
Vizerra 2010Vizerra 2010
Vizerra 2010
 
Gene :: Properties
Gene :: PropertiesGene :: Properties
Gene :: Properties
 
Cloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvuCloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvu
 
PROCESS elementary
PROCESS elementaryPROCESS elementary
PROCESS elementary
 
Rational App Scan&Policy Tester
Rational App Scan&Policy TesterRational App Scan&Policy Tester
Rational App Scan&Policy Tester
 
Everest - Everything is a resource
Everest - Everything is a resourceEverest - Everything is a resource
Everest - Everything is a resource
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
 
May Calendar
May CalendarMay Calendar
May Calendar
 
Navigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaNavigating Your Way to Business Success in India
Navigating Your Way to Business Success in India
 

Ähnlich wie 2013 talk at TGAC, November 4

2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
c.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
c.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
c.titus.brown
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
c.titus.brown
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
Adina Chuang Howe
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
c.titus.brown
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
Computer Science Club
 

Ähnlich wie 2013 talk at TGAC, November 4 (20)

2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsManaging & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 

Mehr von c.titus.brown

2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
c.titus.brown
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
c.titus.brown
 

Mehr von c.titus.brown (20)

2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

2013 talk at TGAC, November 4

  • 1. Digital normalization and some consequences. C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Nov 2013 ctb@msu.edu
  • 2. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Erich Schwarz, Caltech /  Jason Pell  Arend Hintze  Rosangela Canino-Koning  Qingpeng Zhang  Elijah Lowe  Likit Preeyanon  Jiarong Guo  Tim Brom  Kanchan Pavangadkar  Eric McDonald  Chris Welcher  Michael Crusoe Cornell  Paul Sternberg, Caltech  Robin Gasser, U. Melbourne  Weiming Li Funding USDA NIFA; NSF IOS; NIH; BEACON.
  • 3. We practice open science! “Be the change you want” Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟
  • 4.
  • 5. Outline Digital normalization basics 2. Diginorm as streaming lossy compression of NGS data… 3. …surprisingly useful. 1. Three new directions: 4. Reference-free data set investigation 2. Streaming algorithms 3. Open protocols 1.
  • 6. Philosophy: hypothesis generation is important.  We need better methods to investigate and analyze large sequencing data sets.  To be most useful, these methods should be fast & computationally efficient, because:  Data gathering rate is already quite high  Allows iterations  Better methods for good computational hypothesis generation are critical to moving forward.
  • 7. High-throughput sequencing  I mostly work on ecologically and evolutionarily interesting organisms.  This includes non-model transcriptomes and environmental metagenomes.  Volume of data is a huge problem because of the diversity of these samples, and because assembly must be applied to them.
  • 8. Why are big data sets difficult? Need to resolve errors: the more coverage there is, the more errors there are. Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 9. There is quite a bit of life left to sequence & assem http://pacelab.colorado.edu/
  • 10. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 11. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  • 12. Mixed populations.  Approximately 20-40x coverage is required to assemble the majority of a bacterial genome from short reads. 100x is required for a “good” assembly.  To sample a mixed population thoroughly, you need to sample 100x of the lowest abundance species present.  For example, for E. coli in 1/1000 dilution, you would need approximately 100x coverage of a 5mb genome at 1/1000, or 500 Gbp of sequence!  …actually getting this much sequence is fairly easy, but is then hard to assemble in a reasonable computer.
  • 13. Approach: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! The high-coverage reads in sample A are unnecessary for assembly, and, in fact, distract.
  • 20. How can this possibly work!? All you really need is a way to estimate the coverage of a read in a data set w/o an assembly. for read in dataset: if estimated_coverage(read) < CUTOFF: save(read) (This read coverage estimator does need to be errortolerant.)
  • 21. The median k-mer count in a read is a good estimator of coverage. This gives us a reference-free measure of coverage.
  • 22. Digital normalization algorithm for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  • 23. Digital normalization approach A digital analog to cDNA library normalization, diginorm:  Is streaming and single pass: looks at each read only once;  Does not “collect” the majority of errors;  Keeps all low-coverage reads;  Smooths out coverage of regions.
  • 24. Key: underlying assembly graph structure is retained.
  • 25. Diginorm as a filter Reads Read filter/trim Digital normalization to C=20  Diginorm is a pre-filter: it loads in reads & emits (some) of them. Error trim with kmers  You can then assemble the reads however you wish. Digital normalization to C=5 Calculate abundances of contigs Assemble with your favorite assembler
  • 26. Contig assembly now scales with underlying genome size  Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results.  Memory efficient is improved by use of CountMin Sketch.
  • 27. Digital normalization retains information, while discarding data and errors
  • 33. Raw data (~10-100 GB) Compression (~2 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Lossy compression can substantially reduce data size while retaining information needed for later (re)analysis.
  • 34. Some diginorm examples: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Assembly of two Midwest soil metagenomes, Iowa corn and Iowa prairie – the “impossible” assembly problem.
  • 35. Diginorm works well.  Significantly decreases memory requirements, esp. for metagenome and transcriptome assemblies.  Memory required for assembly now scales with richness rather than diversity.  Works on same underlying principle as assembly, so assembly results can be nearly identical.
  • 36. Diginorm works well.  Improves some (many?) assemblies, especially for:  Repeat rich data.  Highly polymorphic samples.  Data with significant sequencing bias.
  • 37. Diginorm works well.  Nearly perfect lossy compression from an information theoretic perspective:  Discards 95% more of data for genomes.  Loses < 00.02% of information.
  • 38. Drawbacks of diginorm  Some assemblers do not perform well downstream of diginorm.  Altered coverage statistics.  Removal of repeats.  No well-developed theory.  …not yet published (but paper available as preprint, with ~10 citations).
  • 39. Diginorm is in wide (?) use  Dozens to hundreds of labs using it.  Seven research publications (at least) using it already.  A diginorm-derived algorithm, in silico normalization, is now a standard part of the Trinity mRNAseq pipeline.
  • 40. Whither goest our research? 1. Pre-assembly analysis of shotgun data. 2. Moving more sequence analysis onto streaming reference-free basis. 3. Computing in the cloud.
  • 41. 1. Pre-assembly analysis of shotgun data Rationale:  Assembly is a “big black box” – data goes in, contigs come out, ???  In cases where assembly goes wrong, or does not yield hoped-for results, we need methods to diagnose potential problems.
  • 42. Perpetual Spouter hot spring (Yellowstone) Eric Boyd, Montana State U.
  • 43. Data gathering =? Assembly  Est low-complexity hot spring (~3-6 species)  25m MiSeq reads (2x250), but no good assembly.  Why?  Several possible reasons:     Bad data Significant strain variation Low coverage ??
  • 44. Information saturation curve (“collector‟s curve”) suggests more information needed. Note: saturation to C=20
  • 45. Read coverage spectrum Many reads with low coverage
  • 46. Cumulative read coverage 60% of data < 20x coverage
  • 47. Cumulative read coverage Some very high coverage data
  • 48. Hot spring data conclusions -- Many reads with low coverage Some very high coverage data  Need ~5 times more sequencing: assemblers do not work well with reads < 20x coverage.  But! Data is there, just low coverage.  Many sequence reads are from small, high coverage genomes (probably phage); this “dilutes” sequencing.
  • 49. Directions for reference free work:  Richness estimation!  MM5 deep carbon: 60 Mbp  Great Prairie soil: 12 Gbp  Amazon Rain Forest Microbial Observatory: 26 Gbp  “How much more sequencing do I need to see X?”  Correlation with 16s Qingpeng Zhang
  • 50. 2. Streaming/efficient reference-free analysis  Streaming online algorithms only look at data ~once. (This is in comparison to most algorithms which are “offline”: they require that all data be loaded in completely before analyis begins.)  Diginorm is streaming, online…  Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.
  • 51. Example: calculating read error rates by position within read  Shotgun data is randomly Reads sampled; Assemble  Any variation in mismatches with reference by position is likely due to errors or bias. Map reads to assembly Calculate positionspecific mismatches
  • 52. Reads from Shakya et al., pmid 2338786
  • 53. Diginorm can detect graph saturation
  • 54. Reads from Shakya et al., pmid 2338786
  • 55. Reference-free error profile analysis 1. 2. 3. 4. 5. Requires no prior information! Immediate feedback on sequencing quality (for cores & users) Fast, lightweight (~100 MB, ~2 minutes) Works for any shotgun sample (genomic, metagenomic, transcriptomic). Not affected by polymorphisms.
  • 56. Reference-free error profile analysis 7. …if we know where the errors are, we can trim them. 8. …if we know where the errors are, we can correct them. 9. …if we look at differences by graph position instead of by read position, we can call variants. => Streaming, online variant calling.
  • 57. Streaming online reference-free variant calling. Single pass, reference free, tunable, streaming online varian
  • 58. Coverage is adjusted to retain signal
  • 59. Directions for streaming graph analysis  Generate error profile for shotgun reads;  Variable coverage error trimming;  Streaming low-memory error correction for genomes, metagenomes, and transcriptomes;  Strain variant detection & resolution;  Streaming variant analysis. Jordan Fish & Jason Pe
  • 60. 3. Computing in the cloud  Rental or “cloud” computers enable expenditures on computing resources only on demand.  Everyone is generating data but few have expertise, computational infrastructure to analyze.  Assembly has traditionally been “expensive” but diginorm makes it cheap…
  • 61. khmer-protocols Read cleaning  Close-to-release effort to provide standard “cheap” assembly options in the cloud.  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set.  Open, versioned, forkable, citable. Diginorm Assembly Annotation RSEM differential expression
  • 62.
  • 63. Concluding thoughts  Diginorm is a practically useful technique for enabling more/better assembly.  However, it also offers a number of opportunities to put sequence analysis on a streaming basis.  Underlying basis is really simple, but with (IMO) profound implications: streaming, low memory.
  • 64. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Erich Schwarz, Caltech /  Jason Pell  Arend Hintze  Rosangela Canino-Koning  Qingpeng Zhang  Elijah Lowe  Likit Preeyanon  Jiarong Guo  Tim Brom  Kanchan Pavangadkar  Eric McDonald  Chris Welcher  Michael Crusoe Cornell  Paul Sternberg, Caltech  Robin Gasser, U. Melbourne  Weiming Li Funding USDA NIFA; NSF IOS; NIH; BEACON.
  • 65. Other interests!  “Better Science through Superior Software”  Open science/data/source  Training!  Software Carpentry  “Zero-entry”  Advanced workshops  Reproducible research  IPython Notebook!!!!!
  • 66. IPython Notebook: data + code => IPython)Notebook)

Hinweis der Redaktion

  1. Add CSRemove Rose
  2. Note that any such measure will do.
  3. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression =&gt; OLC assembly.
  4. @@ do at lower cov?
  5. Add CSRemove Rose