SlideShare ist ein Scribd-Unternehmen logo
1 von 65
Like the Dog that Caught the Bus:
Sequencing, Big Data, and Biology
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Jan 2014
ctb@msu.edu
20 years in…
 Started working in Dr. Koonin‟s group in 1993;
 First publication was submitted almost exactly 20

years ago!
Like the Dog that Caught the Bus:
Sequencing, Big Data, and Biology
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Jan 2014
ctb@msu.edu
Analogy: we seek an understanding
of humanity via our libraries.

http://eofdreams.com/library.html;
But, our only observation tool is
shredding a mixture of all of the
books & digitizing the shreds.

http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
Points:
 Lots of fragments needed! (Deep sampling.)
 Having read and understood some books will help








quite a bit (Prior knowledge.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
The more, different specialized libraries you
sample, the more likely you are to discover valid
correlations between topics and books.
A categorization system would be an invaluable but
not infallible guide to book topics.
Understanding the language would help you validate
& understand the books.
Biological analog: shotgun
metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.

“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png
Investigating soil microbial
communities
 95% or more of soil microbes cannot be cultured

in lab.
 Very little transport in soil and sediment =>
slow mixing rates.
 Estimates of immense diversity:
 Billions of microbial cells per gram of soil.
 Million+ microbial species per gram of soil (Gans et

al, 2005)
 One observed lower bound for genomic sequence
complexity => 26 Gbp (Amazon Rain Forest
Microbial Observatory)
“By 'soil' we understand (Vil'yams, 1931) a loose
surface layer of earth capable of yielding plant
crops. In the physical sense the soil represents a
complex disperse system consisting of three
phases: solid, liquid, and gaseous.”

Microbies live in & on:
• Surfaces of
aggregate particles;
• Pores within
microaggregates;

N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS
http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h
tml
Questions to address
 Role of soil microbes in nutrient cycling:
 How does agricultural soil differ from native soil?

 How do soil microbial communities respond to

climate perturbation?
 Genome-level questions:
 What kind of strain-level heterogeneity is present in

the population?
 What are the phage and viral populations &
dynamic?
 What species are where, and how much is shared
between different geographical locations?
Must use culture independent and
metagenomic approaches
 Many reasons why you can‟t or don‟t want to

culture:
 Syntrophic relationships
 Niche-specificity or unknown physiology
 Dormant microbes
 Abundance within communities

 If you want to get at underlying function, 16s

analysis alone is not sufficient.
Single-cell sequencing & shotgun metagenomics
are two common ways to investigate complex
microbial communities.
Shotgun metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.

“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png
Computational reconstruction of
(meta)genomic content.

http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
Points:
 Lots of fragments needed! (Deep sampling.)
 Having read and understood some books will help








quite a bit (Reference genomes.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
(Sequencing error)
The more, different specialized libraries you sample,
the more likely you are to discover valid correlations
between topics and books. (We don’t understand
most microbial function.)
A categorization system would be an invaluable but
not infallible guide to book topics. (Phylogeny can
guide interpretation.)
Understanding the language would help you validate
Great Prairie Grand
Challenge --SAMPLING
LOCATIONS

2008
A “Grand Challenge” dataset
(DOE/JGI)
Total: 1,846 Gbp soil metagenome
600

MetaHIT (Qin et. al, 2011), 578 Gbp

Basepairs of Sequencing (Gbp)

500

400

Rumen (Hess et. al, 2011), 268 Gbp

300

200

Rumen K-mer Filtered,
111 Gbp

100

NCBI nr database,
37 Gbp

0
Iowa,
Iowa, Native Kansas,
Continuous
Prairie
Cultivated
corn
corn

Kansas,
Native
Prairie
GAII

Wisconsin, Wisconsin, Wisconsin, Wisconsin,
Restored Switchgrass
Continuous
Native
corn
Prairie
Prairie

HiSeq
Why do we need so much data?!
 20-40x coverage is necessary; 100x is ~sufficient.

 Mixed population sampling => sensitivity driven by

lowest abundance.
 For example, for E. coli in 1/1000 dilution, you would

need approximately 100x coverage of a 5mb genome
at 1/1000, or 500 Gbp of sequence!
(For soil, estimate is 50 Tbp)
 Sequencing is straightforward; data analysis is not.

“$1000 genome with $1m analysis”
Great Prairie Grand Challenge goals
 How much of the source metagenome can we

reconstruct from ~300-600 Gbp+ of shotgun
sequencing? (Largest data sets thus far.)
 What can we learn about soil from looking at the

reconstructed metagenome? (See list of
questions)
Great Prairie Grand Challenge goals
 How much of the source metagenome can we

reconstruct from ~300-600 Gbp+ of shotgun
sequencing? (Largest data sets thus far.)
 What can we learn about soil from looking at the

reconstructed metagenome? (See list of
questions)
(For complex ecological and evolutionary
systems, we‟re just starting to get past the first
question. More on that later.)
So, we want to go from raw data:
Name
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score
@SRR606249.17/2
CGAANNNNNNNNNNNNNNNNNCCTGGCTCA
+
CCCF#################22@GHIJJJ
…to “assembled” original sequence.

UMD assembly primer (cbcb.umd.edu)
De Bruijn graphs – assemble on
overlaps

J.R. Miller et al. / Genomics (2010)
Two problems: (1) variation/error

Single nucleotide variations cause long branches;
They don‟t rejoin quickly.
Two problems: (2) No graph
locality.
Assembly is inherently an all by all process. There
is no good way to subdivide the reads without
potentially missing a key connection
Assembly graphs scale with data size, not
information.

Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Why do k-mer assemblers scale
badly?
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Practical memory measurements

Velvet measurements (Adina Howe)
The Problem
 We can cheaply gather DNA data in quantities

sufficient to swamp straightforward assembly
algorithms running on commodity hardware.
 No locality to the data in terms of graph structure.
 Since ~2008:
 The field has engaged in lots of engineering

optimization…
 …but the data generation rate has consistently
outstripped Moore‟s Law.
Our two solutions.
1. Subdivide data
2. Discard redundant data.
1. Data partitioning
(a computational version of cell sorting)
Split reads into “bins”
belonging to
different source
species.
Can do this based
almost entirely on
connectivity of
sequences.
“Divide and conquer”
Memory-efficient
implementation
helps to scale
assembly.

Pell et al., 2012, PNAS
Our two solutions.
1. Subdivide data (~20x scaling; 2 years to develop;
100x data increase)
2. Discard redundant data.
2. Approach: Digital normalization
(a computational version of library normalization)
Suppose you have
a dilution factor of
A (10) to B(1). To
get 10x of B you
need to get 100x
of A!
Diversity vs
richness.

The high-coverage
reads in sample A
are unnecessary
for assembly, and,
Shotgun sequencing and
coverage

“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.
Most shotgun data is redundant.

You only need 5-10 reads at a locus to
assemble or call (diploid) SNPs… but
because sampling is random, and you
need 5-10 reads at every locus, you
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Coverage estimation
If you can estimate the coverage of a read in a
data set without a reference, this is
straightforward:
for read in dataset:
if estimated_coverage(read) < CUTOFF:
save(read)
(Trick: the read coverage estimator needs to be errortolerant.)
The median k-mer count in a read is a good
approximate estimator of coverage.
This gives us a
reference-free
measure of
coverage.
Diginorm builds a De Bruijn graph & then
downsamples based on observed coverage.
Corresponds exactly to
underlying abstraction used for
assembly; retains graph
structure.
Digital normalization approach
 Is streaming and single pass: looks at each read

only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads;
 Smooths out coverage of regions.

…raw data can be retained for later abundance
estimation.
Contig assembly now scales with richness, not
(data)
(information)
diversity.

Most samples can be assembled in < 50 GB of
memory.
Diginorm is widely useful:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable coverage”
problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P.
marinus) transcriptome, a “big assembly” problem.
(in prep)
3. Osedax symbiont metagenome, a “contaminated
metagenome” problem (Goffredi et al, 2013; pmid
Diginorm is “lossy compression”
 Nearly perfect from an information theoretic

perspective:
 Discards 95% more of data for genomes.
 Loses < 00.02% of information.
Prospective: sequencing tumor cells
 Goal: phylogenetically reconstruct causal “driver

mutations” in face of passenger mutations.
 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of

sequence.
 Most of this data will be redundant and not useful.
 Developing diginorm-based algorithms to

eliminate data while retaining variant information.
Where are we taking this?
 Streaming online algorithms only look at data

~once.
 Diginorm is streaming, online…

 Conceptually, can move many aspects of

sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.
=> Streaming, online variant
calling.

Single pass, reference free, tunable, streaming online varian
Potentially quite clinically useful.
What about the assembly results for Iowa
corn and prairie??

Total
Assembly

Total Contigs
(> 300 bp)

% Reads
Assembled

Predicted
protein
coding

2.5 bill

4.5 mill

19%

5.3 mill

3.5 bill

5.9 mill

22%

6.8 mill

Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp

Adina Howe
Resulting contigs are low
coverage.

Figure 11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil met agenomes.
So, for soil:
 We really do need more data;
 But at least now we can assemble what we

already have.
 Estimate required sequencing depth at 50 Tbp;

 Now also have 2-8 Tbp from Amazon Rain Forest

Microbial Observatory.
 …still not saturated coverage, but getting closer.
But, diginorm approach turns out to be widely
useful.
Biogeography: Iowa sample overlap?
Corn and prairie De Bruijn graps have 51% overlap.

Corn

Prairie

Suggests that at greater depth, samples may have similar geno
Concluding thoughts
 Empirically effective tools, in reasonably wide

use.
 Diginorm provides streaming, online algorithmic

basis for





Coverage downsampling/lossy compression
Error identification (sublinear)
Error correction
Variant calling?

 Enables analyses that would otherwise be hard or

impossible.
 Most assembly doable in cloud or on commodity

hardware;
The real challenge:
understanding
 We have gotten distracted by shiny toys:

sequencing!! Data!!
 Data is now plentiful! But:
 We typically have no knowledge of what > 50% of

an environmental metagenome “means”,
functionally.
 Most data is not openly available, so we cannot
mine correlations across data sets.
 Most computational science is not reproducible,
so I can‟t reuse other people‟s tools or
approaches.
Data intensive biology & hypothesis
generation
 My interest in biological data is to enable better

hypothesis generation.
My interests
 Open source ecosystem of analysis tools.
 Loosely coupled APIs for querying databases.
 Publishing reproducible and reusable analyses,

openly.
 Education and training.

“Platform perspective”
Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now cheaper

than data gathering (i.e. essentially free);
 …plus, we can run most of our approaches in

the cloud.
khmer-protocols
Read cleaning

 Effort to provide standard “cheap”

assembly protocols for the cloud.
Diginorm

 Entirely copy/paste; ~2-6 days from

raw reads to
assembly, annotations, and
differential expression analysis.
~$150 on Amazon per data set.
 Open, versioned, forkable, citable.

Assembly

Annotation

RSEM differential
expression
IPython Notebook: data + code
=>
IPython)Notebook)
My interests
 Open source ecosystem of analysis tools.
 Loosely coupled APIs for querying databases.
 Publishing reproducible and reusable

analyses, openly.
 Education and training.

“Platform perspective”
We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/research.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟
Acknowledgements
Lab members involved















Adina Howe (w/Tiedje)
Jason Pell
Arend Hintze
Qingpeng Zhang
Elijah Lowe
Likit Preeyanon
Jiarong Guo
Tim Brom
Kanchan Pavangadkar
Eric McDonald
Camille Scott
Jordan Fish
Michael Crusoe
Leigh Sheneman

Collaborators
 Jim Tiedje, MSU
 Susannah Tringe and Janet






Jansson (JGI, LBNL)
Erich Schwarz, Caltech /
Cornell
Paul Sternberg, Caltech
Robin Gasser, U. Melbourne
Weiming Li, MSU
Shana Goffredi, Occidental

Funding

USDA NIFA; NSF IOS;
NIH; BEACON.

Weitere ähnliche Inhalte

Was ist angesagt?

2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea maysParallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea maysjrossibarra
 
Genome size and adaptation in plants
Genome size and adaptation in plantsGenome size and adaptation in plants
Genome size and adaptation in plantsjrossibarra
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
Open Tree of Life at Evolution 2014
Open Tree of Life at Evolution 2014Open Tree of Life at Evolution 2014
Open Tree of Life at Evolution 2014Karen Cranston
 
Adaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizeAdaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizejrossibarra
 
Revised Bio 1wfx Recombinant D N A
Revised  Bio 1wfx   Recombinant  D N ARevised  Bio 1wfx   Recombinant  D N A
Revised Bio 1wfx Recombinant D N AHans Lim
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
 
Graphs are Feeding the World
Graphs are Feeding the WorldGraphs are Feeding the World
Graphs are Feeding the WorldTim Williamson
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Monica Munoz-Torres
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Denis C. Bauer
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...Jim McCusker
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Surya Saha
 

Was ist angesagt? (20)

2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
2014 davis-talk
2014 davis-talk2014 davis-talk
2014 davis-talk
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea maysParallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
 
Genome size and adaptation in plants
Genome size and adaptation in plantsGenome size and adaptation in plants
Genome size and adaptation in plants
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Open Tree of Life at Evolution 2014
Open Tree of Life at Evolution 2014Open Tree of Life at Evolution 2014
Open Tree of Life at Evolution 2014
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Adaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizeAdaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maize
 
Revised Bio 1wfx Recombinant D N A
Revised  Bio 1wfx   Recombinant  D N ARevised  Bio 1wfx   Recombinant  D N A
Revised Bio 1wfx Recombinant D N A
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1
 
Graphs are Feeding the World
Graphs are Feeding the WorldGraphs are Feeding the World
Graphs are Feeding the World
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
David
DavidDavid
David
 
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...
 

Andere mochten auch

2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcastc.titus.brown
 
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber ShandwickWeber Shandwick Korea
 
Engage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - TechnicalEngage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - TechnicalWebtrends
 
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...Gina Montgomery, V-TSP
 
The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...ProductCamp Boston
 
ProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening SlidesProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening SlidesProductCamp Boston
 
Moments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer BehaviorMoments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer BehaviorKyle Lacy
 
Cost effective azure
Cost effective azureCost effective azure
Cost effective azureGal Kogman
 
Engage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2BEngage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2BAnco Stuij
 
SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.Gina Montgomery, V-TSP
 

Andere mochten auch (13)

2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
 
Engage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - TechnicalEngage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - Technical
 
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
 
The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...
 
Internal, External and Digital Presence of the CEO is becoming more and more ...
Internal, External and Digital Presence of the CEO is becoming more and more ...Internal, External and Digital Presence of the CEO is becoming more and more ...
Internal, External and Digital Presence of the CEO is becoming more and more ...
 
ProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening SlidesProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening Slides
 
Moments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer BehaviorMoments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer Behavior
 
actividad 1.4
actividad 1.4actividad 1.4
actividad 1.4
 
Cost effective azure
Cost effective azureCost effective azure
Cost effective azure
 
Engage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2BEngage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2B
 
John saraguro diapositiva
John saraguro diapositivaJohn saraguro diapositiva
John saraguro diapositiva
 
SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.
 

Ähnlich wie 2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology"

2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
CRI - Teaching Through Research - John Jungck - BioQuest
CRI - Teaching Through Research - John Jungck - BioQuestCRI - Teaching Through Research - John Jungck - BioQuest
CRI - Teaching Through Research - John Jungck - BioQuestLeadershipProgram
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing techc.titus.brown
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynotec.titus.brown
 
Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013Qingpeng "Q.P." Zhang
 
Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Karen Cranston
 
Dan Graur - Can the human genome be 100% functional?
Dan Graur - Can the human genome be 100% functional?Dan Graur - Can the human genome be 100% functional?
Dan Graur - Can the human genome be 100% functional?Andrei Afanasiev
 
Novel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial DiversityNovel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial DiversityQingpeng "Q.P." Zhang
 
ppgardner-lecture03-genomesize-complexity.pdf
ppgardner-lecture03-genomesize-complexity.pdfppgardner-lecture03-genomesize-complexity.pdf
ppgardner-lecture03-genomesize-complexity.pdfPaul Gardner
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysisGenome Reference Consortium
 

Ähnlich wie 2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology" (20)

2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
CRI - Teaching Through Research - John Jungck - BioQuest
CRI - Teaching Through Research - John Jungck - BioQuestCRI - Teaching Through Research - John Jungck - BioQuest
CRI - Teaching Through Research - John Jungck - BioQuest
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013
 
Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Carleton Biology talk : March 2014
Carleton Biology talk : March 2014
 
Dan Graur - Can the human genome be 100% functional?
Dan Graur - Can the human genome be 100% functional?Dan Graur - Can the human genome be 100% functional?
Dan Graur - Can the human genome be 100% functional?
 
Novel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial DiversityNovel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial Diversity
 
ppgardner-lecture03-genomesize-complexity.pdf
ppgardner-lecture03-genomesize-complexity.pdfppgardner-lecture03-genomesize-complexity.pdf
ppgardner-lecture03-genomesize-complexity.pdf
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
 

Mehr von c.titus.brown

Mehr von c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 

Kürzlich hochgeladen

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Kürzlich hochgeladen (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology"

  • 1. Like the Dog that Caught the Bus: Sequencing, Big Data, and Biology C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Jan 2014 ctb@msu.edu
  • 2. 20 years in…  Started working in Dr. Koonin‟s group in 1993;  First publication was submitted almost exactly 20 years ago!
  • 3. Like the Dog that Caught the Bus: Sequencing, Big Data, and Biology C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Jan 2014 ctb@msu.edu
  • 4. Analogy: we seek an understanding of humanity via our libraries. http://eofdreams.com/library.html;
  • 5. But, our only observation tool is shredding a mixture of all of the books & digitizing the shreds. http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • 6. Points:  Lots of fragments needed! (Deep sampling.)  Having read and understood some books will help      quite a bit (Prior knowledge.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. A categorization system would be an invaluable but not infallible guide to book topics. Understanding the language would help you validate & understand the books.
  • 7. Biological analog: shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. “Sequence it all and let the bioinformaticians sort it out” Wikipedia: Environmental shotgun sequencing.png
  • 8. Investigating soil microbial communities  95% or more of soil microbes cannot be cultured in lab.  Very little transport in soil and sediment => slow mixing rates.  Estimates of immense diversity:  Billions of microbial cells per gram of soil.  Million+ microbial species per gram of soil (Gans et al, 2005)  One observed lower bound for genomic sequence complexity => 26 Gbp (Amazon Rain Forest Microbial Observatory)
  • 9. “By 'soil' we understand (Vil'yams, 1931) a loose surface layer of earth capable of yielding plant crops. In the physical sense the soil represents a complex disperse system consisting of three phases: solid, liquid, and gaseous.” Microbies live in & on: • Surfaces of aggregate particles; • Pores within microaggregates; N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h tml
  • 10. Questions to address  Role of soil microbes in nutrient cycling:  How does agricultural soil differ from native soil?  How do soil microbial communities respond to climate perturbation?  Genome-level questions:  What kind of strain-level heterogeneity is present in the population?  What are the phage and viral populations & dynamic?  What species are where, and how much is shared between different geographical locations?
  • 11. Must use culture independent and metagenomic approaches  Many reasons why you can‟t or don‟t want to culture:  Syntrophic relationships  Niche-specificity or unknown physiology  Dormant microbes  Abundance within communities  If you want to get at underlying function, 16s analysis alone is not sufficient. Single-cell sequencing & shotgun metagenomics are two common ways to investigate complex microbial communities.
  • 12. Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. “Sequence it all and let the bioinformaticians sort it out” Wikipedia: Environmental shotgun sequencing.png
  • 13. Computational reconstruction of (meta)genomic content. http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • 14. Points:  Lots of fragments needed! (Deep sampling.)  Having read and understood some books will help      quite a bit (Reference genomes.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. (Sequencing error) The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. (We don’t understand most microbial function.) A categorization system would be an invaluable but not infallible guide to book topics. (Phylogeny can guide interpretation.) Understanding the language would help you validate
  • 15. Great Prairie Grand Challenge --SAMPLING LOCATIONS 2008
  • 16. A “Grand Challenge” dataset (DOE/JGI) Total: 1,846 Gbp soil metagenome 600 MetaHIT (Qin et. al, 2011), 578 Gbp Basepairs of Sequencing (Gbp) 500 400 Rumen (Hess et. al, 2011), 268 Gbp 300 200 Rumen K-mer Filtered, 111 Gbp 100 NCBI nr database, 37 Gbp 0 Iowa, Iowa, Native Kansas, Continuous Prairie Cultivated corn corn Kansas, Native Prairie GAII Wisconsin, Wisconsin, Wisconsin, Wisconsin, Restored Switchgrass Continuous Native corn Prairie Prairie HiSeq
  • 17. Why do we need so much data?!  20-40x coverage is necessary; 100x is ~sufficient.  Mixed population sampling => sensitivity driven by lowest abundance.  For example, for E. coli in 1/1000 dilution, you would need approximately 100x coverage of a 5mb genome at 1/1000, or 500 Gbp of sequence! (For soil, estimate is 50 Tbp)  Sequencing is straightforward; data analysis is not. “$1000 genome with $1m analysis”
  • 18. Great Prairie Grand Challenge goals  How much of the source metagenome can we reconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest data sets thus far.)  What can we learn about soil from looking at the reconstructed metagenome? (See list of questions)
  • 19. Great Prairie Grand Challenge goals  How much of the source metagenome can we reconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest data sets thus far.)  What can we learn about soil from looking at the reconstructed metagenome? (See list of questions) (For complex ecological and evolutionary systems, we‟re just starting to get past the first question. More on that later.)
  • 20. So, we want to go from raw data: Name @SRR606249.17/1 GAGTATGTTCTCATAGAGGTTGGTANNNNT + B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score @SRR606249.17/2 CGAANNNNNNNNNNNNNNNNNCCTGGCTCA + CCCF#################22@GHIJJJ
  • 21. …to “assembled” original sequence. UMD assembly primer (cbcb.umd.edu)
  • 22. De Bruijn graphs – assemble on overlaps J.R. Miller et al. / Genomics (2010)
  • 23. Two problems: (1) variation/error Single nucleotide variations cause long branches; They don‟t rejoin quickly.
  • 24. Two problems: (2) No graph locality. Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection
  • 25. Assembly graphs scale with data size, not information. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 26. Why do k-mer assemblers scale badly? Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 27. Practical memory measurements Velvet measurements (Adina Howe)
  • 28. The Problem  We can cheaply gather DNA data in quantities sufficient to swamp straightforward assembly algorithms running on commodity hardware.  No locality to the data in terms of graph structure.  Since ~2008:  The field has engaged in lots of engineering optimization…  …but the data generation rate has consistently outstripped Moore‟s Law.
  • 29. Our two solutions. 1. Subdivide data 2. Discard redundant data.
  • 30. 1. Data partitioning (a computational version of cell sorting) Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences. “Divide and conquer” Memory-efficient implementation helps to scale assembly. Pell et al., 2012, PNAS
  • 31. Our two solutions. 1. Subdivide data (~20x scaling; 2 years to develop; 100x data increase) 2. Discard redundant data.
  • 32. 2. Approach: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Diversity vs richness. The high-coverage reads in sample A are unnecessary for assembly, and,
  • 33. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 34. Most shotgun data is redundant. You only need 5-10 reads at a locus to assemble or call (diploid) SNPs… but because sampling is random, and you need 5-10 reads at every locus, you
  • 41. Coverage estimation If you can estimate the coverage of a read in a data set without a reference, this is straightforward: for read in dataset: if estimated_coverage(read) < CUTOFF: save(read) (Trick: the read coverage estimator needs to be errortolerant.)
  • 42. The median k-mer count in a read is a good approximate estimator of coverage. This gives us a reference-free measure of coverage.
  • 43. Diginorm builds a De Bruijn graph & then downsamples based on observed coverage. Corresponds exactly to underlying abstraction used for assembly; retains graph structure.
  • 44. Digital normalization approach  Is streaming and single pass: looks at each read only once;  Does not “collect” the majority of errors;  Keeps all low-coverage reads;  Smooths out coverage of regions. …raw data can be retained for later abundance estimation.
  • 45. Contig assembly now scales with richness, not (data) (information) diversity. Most samples can be assembled in < 50 GB of memory.
  • 46. Diginorm is widely useful: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid
  • 47. Diginorm is “lossy compression”  Nearly perfect from an information theoretic perspective:  Discards 95% more of data for genomes.  Loses < 00.02% of information.
  • 48. Prospective: sequencing tumor cells  Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations.  1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.  Most of this data will be redundant and not useful.  Developing diginorm-based algorithms to eliminate data while retaining variant information.
  • 49. Where are we taking this?  Streaming online algorithms only look at data ~once.  Diginorm is streaming, online…  Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.
  • 50. => Streaming, online variant calling. Single pass, reference free, tunable, streaming online varian Potentially quite clinically useful.
  • 51. What about the assembly results for Iowa corn and prairie?? Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Adina Howe
  • 52. Resulting contigs are low coverage. Figure 11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil met agenomes.
  • 53. So, for soil:  We really do need more data;  But at least now we can assemble what we already have.  Estimate required sequencing depth at 50 Tbp;  Now also have 2-8 Tbp from Amazon Rain Forest Microbial Observatory.  …still not saturated coverage, but getting closer. But, diginorm approach turns out to be widely useful.
  • 54. Biogeography: Iowa sample overlap? Corn and prairie De Bruijn graps have 51% overlap. Corn Prairie Suggests that at greater depth, samples may have similar geno
  • 55. Concluding thoughts  Empirically effective tools, in reasonably wide use.  Diginorm provides streaming, online algorithmic basis for     Coverage downsampling/lossy compression Error identification (sublinear) Error correction Variant calling?  Enables analyses that would otherwise be hard or impossible.  Most assembly doable in cloud or on commodity hardware;
  • 56. The real challenge: understanding  We have gotten distracted by shiny toys: sequencing!! Data!!  Data is now plentiful! But:  We typically have no knowledge of what > 50% of an environmental metagenome “means”, functionally.  Most data is not openly available, so we cannot mine correlations across data sets.  Most computational science is not reproducible, so I can‟t reuse other people‟s tools or approaches.
  • 57. Data intensive biology & hypothesis generation  My interest in biological data is to enable better hypothesis generation.
  • 58. My interests  Open source ecosystem of analysis tools.  Loosely coupled APIs for querying databases.  Publishing reproducible and reusable analyses, openly.  Education and training. “Platform perspective”
  • 59. Practical implications of diginorm  Data is (essentially) free;  For some problems, analysis is now cheaper than data gathering (i.e. essentially free);  …plus, we can run most of our approaches in the cloud.
  • 60. khmer-protocols Read cleaning  Effort to provide standard “cheap” assembly protocols for the cloud. Diginorm  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set.  Open, versioned, forkable, citable. Assembly Annotation RSEM differential expression
  • 61.
  • 62. IPython Notebook: data + code => IPython)Notebook)
  • 63. My interests  Open source ecosystem of analysis tools.  Loosely coupled APIs for querying databases.  Publishing reproducible and reusable analyses, openly.  Education and training. “Platform perspective”
  • 64. We practice open science! Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/research.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟
  • 65. Acknowledgements Lab members involved               Adina Howe (w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Camille Scott Jordan Fish Michael Crusoe Leigh Sheneman Collaborators  Jim Tiedje, MSU  Susannah Tringe and Janet      Jansson (JGI, LBNL) Erich Schwarz, Caltech / Cornell Paul Sternberg, Caltech Robin Gasser, U. Melbourne Weiming Li, MSU Shana Goffredi, Occidental Funding USDA NIFA; NSF IOS; NIH; BEACON.

Hinweis der Redaktion

  1. @@ change slide up =&gt; more complex diversity
  2. Taking advantage of structure within read
  3. Note that any such measure will do.
  4. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression =&gt; OLC assembly.
  5. Copy slide to end.