SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Adina Howe
Michigan State University, Adjunct
Argonne National Laboratory, Postdoc
ASMWorkshop, May 2013
Visual Complexity
http://www.flickr.com/photos/maisonbisson
 Titus Brown
 Jim Tiedje
 Jason Pell
 Qingpeng Zhang
 Jordan Fish
 Eric McDonald
 Chris Welcher
 Aaron Garoutte
 Jiarong Guo
 Janet Jansson
 Susannah Tringe
MSU Lab: Collaborators:
 I will upload this on slideshare (adinachuanghowe)
 Khmer documentation
github.com/ged-lab/khmer/
https://khmer.readthedocs.org/en/latest/guide.html
 Manuscripts
Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
http://www.pnas.org/content/early/2012/07/25/1121464109
A reference-free algorithm for computational normalization of shotgun sequencing
data
http://arxiv.org/abs/1203.4802
Assembling large, complex metagenomes
http://arxiv.org/abs/1212.2832
High Abundance
Low Abundance
In t heenvironment (Our goal)
In our hands
X X
X
XX
XX
X
X
A few gotchas of sequencing:
Errors / Artifacts (confusion)
Diversity / Complexity (scale)
High Abundance
Low Abundance
In t heenvironment (Our goal)
In our hands
X X
X
XX
XX
X
X
High Abundance
Low Abundance
In theenvironment (Our goal)
In our hands
X
X
XX
XX
X
X1. Digital normalization (lossy compression)
2. Partitioning
3. Enabling usage of current previously unusable
assembly tools
 Reduces data for analysis
 Longer sequences (increased accuracy of annotation)
 Gene order
 Does not rely on known references, access to unknowns
 Creates new references
 Lots of assembly tools available
But…
 Reduces data for analysis
 Longer sequences (increased accuracy of annotation)
 Gene order
 Does not rely on known references, access to unknowns
 Creates new references
 Lots of assembly tools available
But…
Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
High memory requirements Depends on good (~10x) sequencing coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
Note that k-mer abundance is not properly represented here! Each
blue k-mer will be present around 10 times.
Each single base error generates ~k new k-mers.
Generally, erroneous k-mers show up only once – errors are random.
Low-abundance peak (errors)
High-abundance peak
(true k-mers)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…
A digital analog to cDNA library normalization,
diginorm:
Reference free.
Is single pass: looks at each read only once;
Does not “collect” the majority of errors;
Keeps all low-coverage reads;
Smooths out coverage of regions.
 Digital normalization produces “good”
metagenome assemblies.
 Smooths out abundance variation, strain
variation.
 Reduces computational requirements for
assembly.
 It also kinda makes sense :)
Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on connectivity
of sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Pell et al., 2012, PNAS
Low coverage is the dominant problem blocking assembly of
your soil metagenome.
 In order to build assemblies, each assembler
makes choices – uses heuristics – to reach a
conclusion.
 These heuristics may not be appropriate for your
sample!
 High polymorphism?
 Mixed population vs clonal?
 Genomic vs metagenomic vs mRNA
 Low coverage drives differences in assembly.
 We can assemble virtually anything but soil ;).
 Genomes, transcriptomes, MDA, mixtures, etc.
 Repeat resolution will be fundamentally limited by
sequencing technology (insert size; sampling depth)
 Strain variation confuses assembly, but does not
prevent useful results.
 Diginorm is systematic strategy to enable assembly.
 Banfield has shown how to deconvolve strains at
differential abundance.
 Kostas K. results suggest that there will be a species
gap sufficient to prevent contig misassembly.
 Most metagenomes require 50-150 GB of RAM.
 Many people don’t have access to computers of
that size.
 Amazon Web Services (aws.amazon.com) will
happily rent you such computers for $1-2/hr.
 http://ged.msu.edu/angus/2013-hmp-assembly-
webinar/index.html
 Optimizing our programs => faster.
 Building an evaluation framework for
metagenome assemblers.
 Error correction!
 Achieving one or more assemblies is fairly
straightforward.
 An assembly is a hypothesis and evaluating
them is challenging, however, and where you
should be thinking hardest about assembly.
 There are relatively few pipelines available
for analyzing assembled metagenomic data.
 Questions?
 How do we study complexity? Interactions? Diversity?
Communities? Evolution? Our environment?
Visual Complexity
http://www.flickr.com/photos/maisonbisson
• Major efforts of data collection
• Open-mind for discoveries
• Willingness to adjust to change
• Multiple efforts
• Well-designed experiments
Workshop example: Illumina deep
sequencing and scaling large datasets
on soil metagenomes
 We receive Gb of sequences
 Generally, my data is…
 Split by barcodes
 Untrimmed
 Adapters are present
 Two paired end fastq files
 Underestimation of computational
requirements:
 Quality control steps usually require 2-3 times the
amount of hard drive space
 Similarity comparison against known databases
impractical (soil metagenome ~50 years to BLAST)
Home Alone Scream
My first slide graphic that I’m scared may date me.
Two ways to reduce the onslaught:
Cluster into known observances (annotate,
bin)
Assembly
Some mix of the above
Ten of you upload 1 Hiseq
flowcell into MG-RAST
Illumina short reads from soil
metagenome (~100 bp)
454 short reads from soil
metagenome (~368 bp)
Assembled contigs (Illumina)
reads from soil metagenome
(~491 bp)
Read length will increase… computational requirements? Assembly great way to reduce data.

Weitere ähnliche Inhalte

Andere mochten auch

高専カンファレン○
高専カンファレン○高専カンファレン○
高専カンファレン○Daichi OBINATA
 
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendale
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendaleSmau Bologna 2014 - Twitter come strumento di comunicazione aziendale
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendaleSMAU
 
спортивное соревнование 17.04.2015
спортивное соревнование 17.04.2015спортивное соревнование 17.04.2015
спортивное соревнование 17.04.2015virtualtaganrog
 
Vichiunai Group Presentation
Vichiunai Group PresentationVichiunai Group Presentation
Vichiunai Group PresentationRob Schreur
 
Relacion de medida y pensamiento
Relacion de medida y pensamientoRelacion de medida y pensamiento
Relacion de medida y pensamientovaleriaambrocio
 
Desierto egipcio
Desierto egipcioDesierto egipcio
Desierto egipcioPlof
 
대신리포트_대신브라우저_140620
대신리포트_대신브라우저_140620대신리포트_대신브라우저_140620
대신리포트_대신브라우저_140620DaishinSecurities
 
Analise Imagem Luciaguilherme Bataguassu
Analise Imagem Luciaguilherme BataguassuAnalise Imagem Luciaguilherme Bataguassu
Analise Imagem Luciaguilherme BataguassuLuciaguilherme
 
Question 7
Question 7Question 7
Question 7bradmoss
 
A.I. - 로봇의 진화. 어디까지 허용해야 하는가?
A.I. - 로봇의 진화. 어디까지 허용해야 하는가?A.I. - 로봇의 진화. 어디까지 허용해야 하는가?
A.I. - 로봇의 진화. 어디까지 허용해야 하는가?Youn-Hee Han
 
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di GoogleFluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di GoogleMichele De Capitani
 
Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...
Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...
Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...Michele De Capitani
 

Andere mochten auch (17)

Molecular biology tecniques
Molecular biology tecniquesMolecular biology tecniques
Molecular biology tecniques
 
Metagenomics newer approach in understanding Microbes
Metagenomics newer approach in understanding Microbes  Metagenomics newer approach in understanding Microbes
Metagenomics newer approach in understanding Microbes
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
Presentation_NEW.PPTX
Presentation_NEW.PPTXPresentation_NEW.PPTX
Presentation_NEW.PPTX
 
高専カンファレン○
高専カンファレン○高専カンファレン○
高専カンファレン○
 
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendale
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendaleSmau Bologna 2014 - Twitter come strumento di comunicazione aziendale
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendale
 
спортивное соревнование 17.04.2015
спортивное соревнование 17.04.2015спортивное соревнование 17.04.2015
спортивное соревнование 17.04.2015
 
Vichiunai Group Presentation
Vichiunai Group PresentationVichiunai Group Presentation
Vichiunai Group Presentation
 
Relacion de medida y pensamiento
Relacion de medida y pensamientoRelacion de medida y pensamiento
Relacion de medida y pensamiento
 
Desierto egipcio
Desierto egipcioDesierto egipcio
Desierto egipcio
 
대신리포트_대신브라우저_140620
대신리포트_대신브라우저_140620대신리포트_대신브라우저_140620
대신리포트_대신브라우저_140620
 
Analise Imagem Luciaguilherme Bataguassu
Analise Imagem Luciaguilherme BataguassuAnalise Imagem Luciaguilherme Bataguassu
Analise Imagem Luciaguilherme Bataguassu
 
Question 7
Question 7Question 7
Question 7
 
A.I. - 로봇의 진화. 어디까지 허용해야 하는가?
A.I. - 로봇의 진화. 어디까지 허용해야 하는가?A.I. - 로봇의 진화. 어디까지 허용해야 하는가?
A.I. - 로봇의 진화. 어디까지 허용해야 하는가?
 
0944388579
09443885790944388579
0944388579
 
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di GoogleFluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
 
Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...
Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...
Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...
 

Ähnlich wie ASM 2013 Metagenomic Assembly Workshop Slides

Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsc.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior softwareMichael R. Crusoe
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine LearningSri Ambati
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binningA. Murat Eren
 

Ähnlich wie ASM 2013 Metagenomic Assembly Workshop Slides (20)

Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior software
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binning
 

Mehr von Adina Chuang Howe

Merrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaMerrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaAdina Chuang Howe
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainAdina Chuang Howe
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America MeetingAdina Chuang Howe
 
ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesAdina Chuang Howe
 
Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringAdina Chuang Howe
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina Chuang Howe
 
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisAdina Chuang Howe
 
Metagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopMetagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopAdina Chuang Howe
 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkAdina Chuang Howe
 

Mehr von Adina Chuang Howe (13)

Merrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaMerrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, Nebraska
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America Meeting
 
ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar Slides
 
Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio Engineering
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABE
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
 
Metagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopMetagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON Workshop
 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data Talk
 

Kürzlich hochgeladen

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Kürzlich hochgeladen (20)

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

ASM 2013 Metagenomic Assembly Workshop Slides

  • 1. Adina Howe Michigan State University, Adjunct Argonne National Laboratory, Postdoc ASMWorkshop, May 2013 Visual Complexity http://www.flickr.com/photos/maisonbisson
  • 2.  Titus Brown  Jim Tiedje  Jason Pell  Qingpeng Zhang  Jordan Fish  Eric McDonald  Chris Welcher  Aaron Garoutte  Jiarong Guo  Janet Jansson  Susannah Tringe MSU Lab: Collaborators:
  • 3.  I will upload this on slideshare (adinachuanghowe)  Khmer documentation github.com/ged-lab/khmer/ https://khmer.readthedocs.org/en/latest/guide.html  Manuscripts Scaling metagenome sequence assembly with probabilistic de Bruijn graphs http://www.pnas.org/content/early/2012/07/25/1121464109 A reference-free algorithm for computational normalization of shotgun sequencing data http://arxiv.org/abs/1203.4802 Assembling large, complex metagenomes http://arxiv.org/abs/1212.2832
  • 4. High Abundance Low Abundance In t heenvironment (Our goal) In our hands X X X XX XX X X A few gotchas of sequencing: Errors / Artifacts (confusion) Diversity / Complexity (scale) High Abundance Low Abundance In t heenvironment (Our goal) In our hands X X X XX XX X X
  • 5. High Abundance Low Abundance In theenvironment (Our goal) In our hands X X XX XX X X1. Digital normalization (lossy compression) 2. Partitioning 3. Enabling usage of current previously unusable assembly tools
  • 6.  Reduces data for analysis  Longer sequences (increased accuracy of annotation)  Gene order  Does not rely on known references, access to unknowns  Creates new references  Lots of assembly tools available But…
  • 7.  Reduces data for analysis  Longer sequences (increased accuracy of annotation)  Gene order  Does not rely on known references, access to unknowns  Creates new references  Lots of assembly tools available But… Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes. High memory requirements Depends on good (~10x) sequencing coverage
  • 8. “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 9. Note that k-mer abundance is not properly represented here! Each blue k-mer will be present around 10 times.
  • 10. Each single base error generates ~k new k-mers. Generally, erroneous k-mers show up only once – errors are random.
  • 11.
  • 12.
  • 15. Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. A digital analog to cDNA library normalization, diginorm: Reference free. Is single pass: looks at each read only once; Does not “collect” the majority of errors; Keeps all low-coverage reads; Smooths out coverage of regions.
  • 23.  Digital normalization produces “good” metagenome assemblies.  Smooths out abundance variation, strain variation.  Reduces computational requirements for assembly.  It also kinda makes sense :)
  • 24. Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences. “Divide and conquer” Memory-efficient implementation helps to scale assembly. Pell et al., 2012, PNAS
  • 25.
  • 26.
  • 27.
  • 28. Low coverage is the dominant problem blocking assembly of your soil metagenome.
  • 29.  In order to build assemblies, each assembler makes choices – uses heuristics – to reach a conclusion.  These heuristics may not be appropriate for your sample!  High polymorphism?  Mixed population vs clonal?  Genomic vs metagenomic vs mRNA  Low coverage drives differences in assembly.
  • 30.
  • 31.  We can assemble virtually anything but soil ;).  Genomes, transcriptomes, MDA, mixtures, etc.  Repeat resolution will be fundamentally limited by sequencing technology (insert size; sampling depth)  Strain variation confuses assembly, but does not prevent useful results.  Diginorm is systematic strategy to enable assembly.  Banfield has shown how to deconvolve strains at differential abundance.  Kostas K. results suggest that there will be a species gap sufficient to prevent contig misassembly.
  • 32.  Most metagenomes require 50-150 GB of RAM.  Many people don’t have access to computers of that size.  Amazon Web Services (aws.amazon.com) will happily rent you such computers for $1-2/hr.  http://ged.msu.edu/angus/2013-hmp-assembly- webinar/index.html
  • 33.  Optimizing our programs => faster.  Building an evaluation framework for metagenome assemblers.  Error correction!
  • 34.  Achieving one or more assemblies is fairly straightforward.  An assembly is a hypothesis and evaluating them is challenging, however, and where you should be thinking hardest about assembly.  There are relatively few pipelines available for analyzing assembled metagenomic data.
  • 36.  How do we study complexity? Interactions? Diversity? Communities? Evolution? Our environment? Visual Complexity http://www.flickr.com/photos/maisonbisson • Major efforts of data collection • Open-mind for discoveries • Willingness to adjust to change • Multiple efforts • Well-designed experiments Workshop example: Illumina deep sequencing and scaling large datasets on soil metagenomes
  • 37.  We receive Gb of sequences  Generally, my data is…  Split by barcodes  Untrimmed  Adapters are present  Two paired end fastq files  Underestimation of computational requirements:  Quality control steps usually require 2-3 times the amount of hard drive space  Similarity comparison against known databases impractical (soil metagenome ~50 years to BLAST) Home Alone Scream My first slide graphic that I’m scared may date me.
  • 38. Two ways to reduce the onslaught: Cluster into known observances (annotate, bin) Assembly Some mix of the above
  • 39. Ten of you upload 1 Hiseq flowcell into MG-RAST
  • 40. Illumina short reads from soil metagenome (~100 bp) 454 short reads from soil metagenome (~368 bp) Assembled contigs (Illumina) reads from soil metagenome (~491 bp) Read length will increase… computational requirements? Assembly great way to reduce data.