SlideShare ist ein Scribd-Unternehmen logo
1 von 41
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
ctb@msu.edu
HMP – Metagenome assembly
Acknowledgements
Lab members involved Collaborators
• Adina Howe (w/Tiedje)
• Jason Pell
• Arend Hintze
• Rosangela Canino-Koning
• Qingpeng Zhang
• Elijah Lowe
• Likit Preeyanon
• Jiarong Guo
• Tim Brom
• Kanchan Pavangadkar
• Eric McDonald
• Jordan Fish
• Chris Welcher
• Jim Tiedje, MSU
• Billie Swalla, UW
• Janet Jansson, LBNL
• Susannah Tringe, JGI
Funding
USDA NIFA; NSF IOS;
BEACON.
Open, online science
All of the software and approaches I’m talking about
today are available:
Assembling large, complex metagenomes
arxiv.org/abs/1212.2832
khmer software:
github.com/ged-lab/khmer/
Blog: http://ivory.idyll.org/blog/
Twitter: @ctitusbrown
Illumina! De Bruijn graphs!
• Today I’ll be talking about Illumina data
sets, and de Bruijn graph assembly (k-mer
assembly).
• This is because my research has largely
focused on scaling to large data sets (soil
metagenomics!) and Illumina is the real
scaling challenge.
Assembler heuristics
• In order to build assemblies, each assembler
makes choices – uses heuristics – to reach a
conclusion.
• These heuristics may not be appropriate for your
sample!
– High polymorphism?
– Mixed population vs clonal?
– Genomic vs metagenomic vs mRNA
– Low coverage drives differences in assembly.
Evaluating assembly
Predicted genome.
X
X
X
X
X
X
X
X
XX
Reads - noisy observations
of some genome.
Assembler
(a Big Black Box)
Evaluating correctness of metagenomes is still undiscovered country.
Shotgun sequencing
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
Reducing to k-mers overlaps
Note that k-mer abundance is not properly represented here! Each
blue k-mer will be present around 10 times.
Errors create new k-mers
Each single base error generates ~k new k-mers.
Generally, erroneous k-mers show up only once – errors are random.
So, k-mer abundance plots are
mixtures of true and false k-mers.
Counting k-mers - histograms
Low-abundance peak (errors)
Counting k-mers - histograms
High-abundance peak
(true k-mers)
Approach: Digital normalization
(a computational version of library normalization)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
• Reference free.
• Is single pass: looks at each read only once;
• Does not “collect” the majority of errors;
• Keeps all low-coverage reads;
• Smooths out coverage of regions.
Coverage before digital normalization:
(MD amplified)
Coverage after digital normalization:
Normalizes coverage
Discards redundancy
Eliminates majority of
errors
Scales assembly dramatically.
Assembly is 98% identical.
In our experience…
• Digital normalization produces “good”
metagenome assemblies.
• Smooths out abundance variation, strain
variation.
• Reduces computational requirements for
assembly.
• It also kinda makes sense :)
Additional Approach for
Metagenomes: Data partitioning
(a computational version of cell sorting)
Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on connectivity
of sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Pell et al., 2012, PNAS
Partitioning separates reads by genome.
Strain variants co-partition.
When computationally spiking HMP mock data with one E. coli
genome (left) or multiple E. coli strains (right), majority of partitions
contain reads from only a single genome (blue) vs multi-genome
partitions (green).
Partitions containing spiked data indicated with a * Adina Howe
**
Conclusions re strain
variation/chimerism (previous slide)
• When spiking in intentionally complex
mixtures, only a small fraction of partitions
are chimeric.
• These means that only a small fraction of
contigs could be chimeric.
• Strain variants will almost certainly assemble
together.
• Can separate on abundance.
See Sharon et al., 2013, PMID 22936250, for Banfield work on this.
Looking at k-mer histograms…
Diginorm shifts left
Partitioning picks out diff genomes
Error correction “fixes” k-mers
Jason Pell
Our experience
• Our metagenome assemblies compare well with
others, but we have little in the way of ground
truth with which to evaluate.
• Scaffold assembly is tricky; we believe in contig
assembly for metagenomes, but not scaffolding.
• See arXiv paper, “Assembling large, complex
metagenomes”, for our suggested pipeline and
statistics & references.
Metagenomic assemblies are highly variable
Adina Howe et al., arXiv 1212.0159
High coverage is needed.
Low coverage is the dominant problem blocking assembly of
your soil metagenome.
Strain variation (soil)Toptwoallelefrequencies
Position within contig
Of 5000 most
abundant
contigs, only 1 has
a
polymorphism
rate > 5%
Can measure by
read mapping.
Overconfident predictions
• We can assemble virtually anything but soil ;).
– Genomes, transcriptomes, MDA, mixtures, etc.
– Repeat resolution will be fundamentally limited by
sequencing technology (insert size; sampling depth)
• Strain variation confuses assembly, but does not
prevent useful results.
– Diginorm is systematic strategy to enable assembly.
– Banfield has shown how to deconvolve strains at
differential abundance.
– Kostas K. results suggest that there will be a species gap
sufficient to prevent contig misassembly.
– Even genes “chimeric” between strains are useful.
Reasons why you shouldn’t believe me
1) Strain variation – when we get deeper in soil, we
should see more (?). Not sure what will
happen, and we do not (yet) have proven
approaches.
2) We, by definition, are not yet seeing anything
that doesn’t assemble.
3) We have not tackled scaffolding much. Serious
investigation of scaffolding will be necessary for
any good genome assembly, and scaffolding is
weak point.
Metagenome assemblers
In addition to khmer prefiltering,
• SPADES
• IDBA-UD
• MetaVelvet
• Ray Meta
Assembling in the cloud
• Most metagenomes require 50-150 GB of RAM.
• Many people don’t have access to computers of
that size.
• Amazon Web Services (aws.amazon.com) will
happily rent you such computers for $1-2/hr.
• I will post instructions and sample data sets for
using Amazon today at ged.msu.edu/angus/.
Current research
• Optimizing our programs => faster.
• Building an evaluation framework for
metagenome assemblers.
• Error correction!
De novo metagenome error correction
makes reads more mappable.
Jason Pell, unpub.
Concluding thoughts
• Achieving one or more assemblies is fairly
straightforward.
• Evaluating them is challenging, however, and
where you should be thinking hardest about
assembly.
• There are relatively few pipelines available for
analyzing assembled metagenomic data. MG-
RAST does support this; others?

Weitere ähnliche Inhalte

Was ist angesagt?

2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestKeith Bradnam
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 

Was ist angesagt? (7)

2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contest
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 

Andere mochten auch

Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Keith Bradnam
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotationScott Dawson
 
BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genomePaul Gardner
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems BiologyMike Hucka
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsNtino Krampis
 
Keeping the Gold: Successfully Resolving Preference Claims
Keeping the Gold: Successfully Resolving Preference ClaimsKeeping the Gold: Successfully Resolving Preference Claims
Keeping the Gold: Successfully Resolving Preference ClaimsKegler Brown Hill + Ritter
 
Eyeblaster Trends In Conversion 2009
Eyeblaster Trends In Conversion 2009Eyeblaster Trends In Conversion 2009
Eyeblaster Trends In Conversion 2009Eyeblaster Spain
 
Hohmann Learning spaces Warwick english
Hohmann Learning spaces Warwick englishHohmann Learning spaces Warwick english
Hohmann Learning spaces Warwick englishTina Hohmann
 
ROI, magic bullets and social business
ROI, magic bullets and social businessROI, magic bullets and social business
ROI, magic bullets and social businessNiall O'Malley
 
Project Charter10 Point Mentoring Program
Project Charter10 Point Mentoring ProgramProject Charter10 Point Mentoring Program
Project Charter10 Point Mentoring Programbsrmailbox
 
Experimenting with the OSGi platform in the Aspire RFID middleware
Experimenting with the OSGi platform in the Aspire RFID middlewareExperimenting with the OSGi platform in the Aspire RFID middleware
Experimenting with the OSGi platform in the Aspire RFID middlewareClément Escoffier
 
At Home In The Usa
At Home In The UsaAt Home In The Usa
At Home In The Usamaresorenson
 
Intellisoft ipad iphone Info March13
Intellisoft ipad iphone Info March13Intellisoft ipad iphone Info March13
Intellisoft ipad iphone Info March13Sham Yemul
 
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?Kegler Brown Hill + Ritter
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 

Andere mochten auch (20)

Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
 
BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genome
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems Biology
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
 
Keeping the Gold: Successfully Resolving Preference Claims
Keeping the Gold: Successfully Resolving Preference ClaimsKeeping the Gold: Successfully Resolving Preference Claims
Keeping the Gold: Successfully Resolving Preference Claims
 
Eyeblaster Trends In Conversion 2009
Eyeblaster Trends In Conversion 2009Eyeblaster Trends In Conversion 2009
Eyeblaster Trends In Conversion 2009
 
Hohmann Learning spaces Warwick english
Hohmann Learning spaces Warwick englishHohmann Learning spaces Warwick english
Hohmann Learning spaces Warwick english
 
ROI, magic bullets and social business
ROI, magic bullets and social businessROI, magic bullets and social business
ROI, magic bullets and social business
 
Project Charter10 Point Mentoring Program
Project Charter10 Point Mentoring ProgramProject Charter10 Point Mentoring Program
Project Charter10 Point Mentoring Program
 
2 3 Principios
2 3 Principios2 3 Principios
2 3 Principios
 
Experimenting with the OSGi platform in the Aspire RFID middleware
Experimenting with the OSGi platform in the Aspire RFID middlewareExperimenting with the OSGi platform in the Aspire RFID middleware
Experimenting with the OSGi platform in the Aspire RFID middleware
 
Illustrations
IllustrationsIllustrations
Illustrations
 
At Home In The Usa
At Home In The UsaAt Home In The Usa
At Home In The Usa
 
Intellisoft ipad iphone Info March13
Intellisoft ipad iphone Info March13Intellisoft ipad iphone Info March13
Intellisoft ipad iphone Info March13
 
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 

Ähnlich wie 2013 hmp-assembly-webinar

2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 

Ähnlich wie 2013 hmp-assembly-webinar (20)

2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 

Mehr von c.titus.brown

Mehr von c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

2013 hmp-assembly-webinar

  • 1. C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University ctb@msu.edu HMP – Metagenome assembly
  • 2. Acknowledgements Lab members involved Collaborators • Adina Howe (w/Tiedje) • Jason Pell • Arend Hintze • Rosangela Canino-Koning • Qingpeng Zhang • Elijah Lowe • Likit Preeyanon • Jiarong Guo • Tim Brom • Kanchan Pavangadkar • Eric McDonald • Jordan Fish • Chris Welcher • Jim Tiedje, MSU • Billie Swalla, UW • Janet Jansson, LBNL • Susannah Tringe, JGI Funding USDA NIFA; NSF IOS; BEACON.
  • 3. Open, online science All of the software and approaches I’m talking about today are available: Assembling large, complex metagenomes arxiv.org/abs/1212.2832 khmer software: github.com/ged-lab/khmer/ Blog: http://ivory.idyll.org/blog/ Twitter: @ctitusbrown
  • 4. Illumina! De Bruijn graphs! • Today I’ll be talking about Illumina data sets, and de Bruijn graph assembly (k-mer assembly). • This is because my research has largely focused on scaling to large data sets (soil metagenomics!) and Illumina is the real scaling challenge.
  • 5. Assembler heuristics • In order to build assemblies, each assembler makes choices – uses heuristics – to reach a conclusion. • These heuristics may not be appropriate for your sample! – High polymorphism? – Mixed population vs clonal? – Genomic vs metagenomic vs mRNA – Low coverage drives differences in assembly.
  • 6. Evaluating assembly Predicted genome. X X X X X X X X XX Reads - noisy observations of some genome. Assembler (a Big Black Box) Evaluating correctness of metagenomes is still undiscovered country.
  • 7. Shotgun sequencing “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 8. Reducing to k-mers overlaps Note that k-mer abundance is not properly represented here! Each blue k-mer will be present around 10 times.
  • 9. Errors create new k-mers Each single base error generates ~k new k-mers. Generally, erroneous k-mers show up only once – errors are random.
  • 10. So, k-mer abundance plots are mixtures of true and false k-mers.
  • 11. Counting k-mers - histograms Low-abundance peak (errors)
  • 12. Counting k-mers - histograms High-abundance peak (true k-mers)
  • 13. Approach: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 20. Digital normalization approach A digital analog to cDNA library normalization, diginorm: • Reference free. • Is single pass: looks at each read only once; • Does not “collect” the majority of errors; • Keeps all low-coverage reads; • Smooths out coverage of regions.
  • 21. Coverage before digital normalization: (MD amplified)
  • 22. Coverage after digital normalization: Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramatically. Assembly is 98% identical.
  • 23. In our experience… • Digital normalization produces “good” metagenome assemblies. • Smooths out abundance variation, strain variation. • Reduces computational requirements for assembly. • It also kinda makes sense :)
  • 24. Additional Approach for Metagenomes: Data partitioning (a computational version of cell sorting) Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences. “Divide and conquer” Memory-efficient implementation helps to scale assembly. Pell et al., 2012, PNAS
  • 25. Partitioning separates reads by genome. Strain variants co-partition. When computationally spiking HMP mock data with one E. coli genome (left) or multiple E. coli strains (right), majority of partitions contain reads from only a single genome (blue) vs multi-genome partitions (green). Partitions containing spiked data indicated with a * Adina Howe **
  • 26. Conclusions re strain variation/chimerism (previous slide) • When spiking in intentionally complex mixtures, only a small fraction of partitions are chimeric. • These means that only a small fraction of contigs could be chimeric. • Strain variants will almost certainly assemble together. • Can separate on abundance. See Sharon et al., 2013, PMID 22936250, for Banfield work on this.
  • 27. Looking at k-mer histograms…
  • 29. Partitioning picks out diff genomes
  • 30. Error correction “fixes” k-mers Jason Pell
  • 31. Our experience • Our metagenome assemblies compare well with others, but we have little in the way of ground truth with which to evaluate. • Scaffold assembly is tricky; we believe in contig assembly for metagenomes, but not scaffolding. • See arXiv paper, “Assembling large, complex metagenomes”, for our suggested pipeline and statistics & references.
  • 32. Metagenomic assemblies are highly variable Adina Howe et al., arXiv 1212.0159
  • 33. High coverage is needed. Low coverage is the dominant problem blocking assembly of your soil metagenome.
  • 34. Strain variation (soil)Toptwoallelefrequencies Position within contig Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5% Can measure by read mapping.
  • 35. Overconfident predictions • We can assemble virtually anything but soil ;). – Genomes, transcriptomes, MDA, mixtures, etc. – Repeat resolution will be fundamentally limited by sequencing technology (insert size; sampling depth) • Strain variation confuses assembly, but does not prevent useful results. – Diginorm is systematic strategy to enable assembly. – Banfield has shown how to deconvolve strains at differential abundance. – Kostas K. results suggest that there will be a species gap sufficient to prevent contig misassembly. – Even genes “chimeric” between strains are useful.
  • 36. Reasons why you shouldn’t believe me 1) Strain variation – when we get deeper in soil, we should see more (?). Not sure what will happen, and we do not (yet) have proven approaches. 2) We, by definition, are not yet seeing anything that doesn’t assemble. 3) We have not tackled scaffolding much. Serious investigation of scaffolding will be necessary for any good genome assembly, and scaffolding is weak point.
  • 37. Metagenome assemblers In addition to khmer prefiltering, • SPADES • IDBA-UD • MetaVelvet • Ray Meta
  • 38. Assembling in the cloud • Most metagenomes require 50-150 GB of RAM. • Many people don’t have access to computers of that size. • Amazon Web Services (aws.amazon.com) will happily rent you such computers for $1-2/hr. • I will post instructions and sample data sets for using Amazon today at ged.msu.edu/angus/.
  • 39. Current research • Optimizing our programs => faster. • Building an evaluation framework for metagenome assemblers. • Error correction!
  • 40. De novo metagenome error correction makes reads more mappable. Jason Pell, unpub.
  • 41. Concluding thoughts • Achieving one or more assemblies is fairly straightforward. • Evaluating them is challenging, however, and where you should be thinking hardest about assembly. • There are relatively few pipelines available for analyzing assembled metagenomic data. MG- RAST does support this; others?

Hinweis der Redaktion

  1. Bad habit…
  2. Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.