SlideShare ist ein Scribd-Unternehmen logo
1 von 83
Downloaden Sie, um offline zu lesen
Multivariate exploration of microbial communities
Josh D. Neufeld
Braunschweig, Germany
December, 2013

Andre Masella (MSc): Computer science
Michael Lynch (PhD): Taxonomy, phylogenetics, ecology
Michael Hall (co-op): mathematics, programming, user friendly!
Posted on Slideshare without images and unpublished data
Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
Who lives with whom, and why, and where?
Data reduction is essential for:
a) summarizing large numbers of observations
into manageable numbers
b) visualizing many interconnected variables in a
compact manner
Alpha diversity: species richness (and evenness)
within a single sample
Beta diversity: change in species composition
across a collection of samples
Gamma diversity: total species richness across an
environmental gradient
An (abbreviated) history
Numerical ecology
phenetics and statistical analysis of organismal
counts
macroecology

16S rRNA gene era
sequence analysis as a surrogate for counting
mapping of marker to taxonomy

NGS enabled synthesis of phenetics,
phylogenetics, and numerical ecology
Now generate V3-V4 bacterial amplicons (~450 bases)
Usually PE 300
Assembling paired-end
reads dramatically
reduces error
Corrects mismatches in
region of overlap
(quality threshold >0.9),
set a minimum overlap.
Can compare to perfect
overlap assembly:
“completelymissesthepoint”
(name changing soon)
PANDAseq
>30x faster
than next
fastest
alternative
assembler
1. p-value threshold
2. parallelizes correctly
(both are now
added or fixed
in PANDAseq)
Biological Observation Matrix
BIOM file format (MacDonald et al. 2012)
Standard recognized by EMP, MG-RAST,
VAMPS
Based on JSON data interchange format
Computational structure in multiple languages

“facilitates the efficient handling and
storage of large, sparse biological
contingency tables”
Encapsulates metadata and contingency
table (e.g., OTU table) in one file
Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
Who lives with whom, and why, and where?
Data reduction is essential for:
a) summarizing large numbers of observations
into manageable numbers
b) visualizing many interconnected variables in a
compact manner
Alpha diversity: species richness (and evenness)
within a single sample
Beta diversity: change in species composition
across a collection of samples
Gamma diversity: total species richness across an
environmental gradient
Diversity
(richness and evenness)
α-diversity: Richness and
Evenness

Shannon index (H’), Estimators (Chao1, ACE), Phylogenetic Diversity

Shannon index (H’): richness and evenness
Estimators: richness
Faith’s PD: phylogenetic richness
Stearns et al., 2011

Hughes et al., 2001
“All biologists who sample natural
communities are plagued with the
problem of how well a sample reflects a
community’s ‘true’ diversity.”
Hughes et al. 2001
“Nonparametric estimators show particular promise for microbial data and in
some habitats may require sample sizes of only 200 to 1,000 clones to detect
richness differences of only tens of species.”
1

Google Scholar proportion
[Seqeuncing tech] AND 16S

400

454

300

Sanger

re
e

re
Ra

0
2000

200

2002

2004

2004

ph
os

100

bi

2008

0
2010

Time (year)
Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.

2012

“Rare biosphere” citations

Illumina

500
GOALS
Understanding of community structure
Better alpha-diversity measures
Robust beta-diversity measures

Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.
Stearns et al. 2011
Bartram et al. 2011
Clustering algorithms
(influence alpha diversity primarily)

CD-HIT (Li and Godzik, Sanford-Burnham Medical
Research Institute)
‘longest-sequence-first’ removal algorithm
Fast, many implementations (nucleotide, protein, OTUspecific)
Tends to be more stringent than UCLUST

UCLUST (R. Edgar, drive5.com)
Faster than CD-HIT
Tends to generate larger number of low-abundance OTUs
Broader range of clustering thresholds

"I do not recommend using the UCLUST algorithm or
CD-HIT for generating OTUs” – Robert Edgar
CROP: Clustering 16S rRNA for OTU Prediction (CROP)
“CROP can find clusters based on the natural organization of data without setting a
hard cut-off threshold (3%/5%) as required by hierarchical clustering methods.”
Chimeras
DNA from two or more parent molecules
PCR artifact
Can easily be classified as a “novel” sequence
Increases α-diversity

Software
ChimeraSlayer, Bellerophon, UCHIME, Pintail

Reference database or de novo
Classification and taxonomy
Ribosomal Database Project (RDP) classifier
Naïve Bayesian classifier (James Cole and Tiedje)
http://rdp.cme.msu.edu/

pplacer
Phylogenetic placement and visualization

BLAST
The tool we know and love

RTAX (UC Berkely, Rob Knight involved)
http://dev.davidsoergel.com/trac/rtax/

mothur (Patrick Schloss)
http://www.mothur.org/

SINA (SILVA)
RDP classifier
Large training sets require active memory management
Can be easily run in parallel by breaking up very large data sets
Can classify Bacteria/Archaea SSU and fungal LSU (can be re-trained)
Algorithm:
determine the probability that an unknown query sequence is a member of a
known genus (training set), based on the profile of word subsets of known
genera.

Confidence estimation:
the number of times in 100 trials that a genus was selected based on a
random subset of words in the query

Take home:
The higher the diversity (bigger sequence space) of the training set, the
better the assignment
Longer query = better and more reliable assignment
Short reads (i.e., <250 base) will have lower confidence estimates (cutoff of
0.5 suggested)
Database sources
GreenGenes
Latest May 2013

SILVA
Latest 115 (August 2013)
Includes 18S, 23S, 28S, LSU

RDP Database
Latest 11 (October 2013)

GenBank
Research-specific
e.g., CORE Oral
Multivariate data reduction
β-diversity
Visualization (ordination) versus hypothesis
testing (MRPP, indicator species analysis)
Many more algorithms out there for
exploration and statistical testing
mostly through widely used R packages
vegan (Community Ecology Package)
labdsv (Ordination and Multivariate Analysis for
Ecology)
ape (Analyses of Phylogenetics and Evolution)
picante (community analyses etc.)
Visualization (ordination)
Complementary to data clustering
looks for discontinuities

Ordination extracts main trends as continuous
axes
analysis of the square matrix derived from the
OTU table

Non-parametric, unconstrained ordination
methods most widely used (and best suited)
methods that can work directly on a square matrix

An appropriate metric is required to derive
this square matrix
many options...
Metrics
Ordination is essentially reducing dimensionality
first requirement: accurately model differences
among samples
Models are *really* important. Examples include:
OTU presence/absence
“all models are wrong,
Dice, Jaccard
some are useful”
OTU abundance
- G.E. Box
Bray-Curtis
“You can't publish anything without a
Phylogenetic
PCoA plot anymore, but METRICS

UniFrac

used to draw plot important.”
- Susan Huse
Metrics: UniFrac
A distance measure comparing multiple
communities using phylogenetic information
Requires sequence alignment and tree-building
PyNAST, MUSCLE, Infernal
Time-consuming and susceptible to poor phylogenetic
inference (does it matter?)

Weighted (abundance)
ecological features related to
abundance

Unweighted
ecological features related to
taxonomic presence/absence
Ordination example 1 (of many):

Principal Coordinates Analysis
Classical Multidimensional Scaling (MDS; Gower 1966)
Procedure:
based on eigenvectors
position objects in low-dimensional space while preserving
distance relationships as well as possible

highly flexible
can choose among many association measures

In microbial ecology, used for visualizing
phylogenetic or count-based distances
Consistent visual output for given distance matrix
Include variance explained (%) on Axis 1 and 2
Ordination example 2 (of many):

Non-metric Multidimensional Scaling
Ordination not based on eigenvectors
Does not preserve exact distances among objects
attempts to preserve ordering of samples (“ranks”)

Procedure:
iterative, tries to position the objects in a few (2-3) dimensions in such a way
that minimizes the “stress”
how well does the new ranked distribution of points represent the original
distances in the association matrix? Can express as R2 on axes 1 and 2.
the adjustment goes on until the stress value reaches a local minimum
(heuristic solution)

NMDS often represents distance relationships better than PCoA in the
same number of dimensions
Susceptible to the “local minimum issue”, and therefore should have
strong starting point (e.g., PCoA) or many permutations
You won't get the same result each time you run the analysis. Try several
runs until you are comfortable with the result.
Do my treatments separate?
Beta-diversity: Hypothesis testing
Multiple methods, implemented in QIIME,
mothur, AXIOME
e.g., MRPP, adonis, NP-MANOVA (perMANOVA),
ANOSIM
Are treatment effects significant?

Because these are predominantly
nonparametric methods, tests for
significance rely on testing by permutation
Let's focus on MRPP
Multiresponse Permutation Procedures
Compare intragroup average distances with the
average distances that would have resulted from all
the other possible combinations
T statistic: more negative with
increasing group separation
(T>-10 common for ecology)
A statistic: Degree of scatter
within groups (A=1 when all
points fall on top of one another)
p value: likelihood of similar
separation with randomized
data.
Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
“PCoA plots are the first
step of a community
analysis, not the last.”
Josh Neufeld
Searching for species that matter
High dimensional data often have too many
features to investigate
solution: identify and study species significantly
associated with categorical metadata

Indicator species (Dufrene-Legendre)
calculates indicator value (fidelity and relative
abundance) of species
Permutation test for significance
Need solution for sparse data - be wary
of groups with small numbers of sites (influence on
permutation tests)
low abundance can artificially inflate indicator values
Specificity
Fidelity
IndVal (Dufrene & Legendre, 1997)
Specificity
Large mean abundance within group relative to summed
mean abundances of other groups

Fidelity
Presence in most or all sites of that group

Groups defined by a priori by metadata or
statistical clustering
Simple linear correlations
Metadata
mbc

Taxon R^2 value

k__Bacteria;p__Planctomycetes;c__Planctomycetia;o__Gemmat
ales;f__Isosphaeraceae;g__
0.611368489781491
mbc
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz
obiales;f__Methylocystaceae;g__
0.677209935419981
mbn
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz
obiales;f__Methylocystaceae;g__
0.64092523702996
soil_depth
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomyc
etales;f__Intrasporangiaceae;g__
0.669761188668774
mothur: cooccurrence function, measuring whether populations are co-occurring
more frequently than you would expect by chance.
Non-negative Matrix Factorization
NMF as a representation method for portraying
high-dimensional data as a small number of
taxonomic components.
Patterns of co-occurring OTUs can be
described by a smaller number of taxonomic
components.
Each sample represented by the collection of
component taxa, helping identify relationships
between taxa and the environment.
Jonathan Dushoff, McMaster University, Ontario, Canada
SSUnique
SILVA
SILVA
SILVA
SILVA
SILVA
Nakai et al. 2012

Lynch et al. 2012
Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
Why pipelines?
Merge and manage (many) disparate techniques
Democratize analysis
improve accessibility

Accelerate pace of innovation, collaboration, and
research
Early synthesis
Early synthesis for numerical microbial ecology
Synthesis of 16S phylogenetics (Woese et al.)
and Hughes (Counting the uncountable)
Numerical ecology for microorganisms

Algorithm development
libshuff, dotur (mothur)

Analysis pipelines
QIIME, mothur
Knight Lab, U. Colorado at Boulder
Predominantly a collection of integrated Python/R
scripts
Many dependencies
easy managed installation:
qiime-deploy
MacQIIME
virtual box and Ubuntu fork
avoid for anything but small runs

Becoming the standard for marker gene studies
integrated analysis and visualization
easy access to broad computational biology toolbox
(Python/R)
Automation and extension
AXIOME and phyloseq
Extend existing technologies (QIIME, mothur, R,
custom)

Layers of abstraction
Automation and rapid re-analysis
Promote reproducible research (iPython, XML,
make)

Implement existing techniques (e.g., MRPP,
Dufrene-Legendre IndVal)
numerical microbial ecology needs to better
incorporate modern statistical theory

Develop and test new techniques
Axiometic
GUI companion for AXIOME
Cross-platform
New implementation in
development

Generates AXIOME file (XML)

xls template
coming soon for
all commands,
sample metadata,
and extra info…
much easier for
everyone.
“QIIME wraps many other software
packages, and these should be cited if
they are used. Any time you're using
tools that QIIME wraps, it is essential
to cite those tools.”
http://qiime.org/index.html
Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
The future
As data get bigger, interpretation should be
“hands off”
Move towards hypothesis testing of highdimension taxonomic data

Convergence on Galaxy
e.g., QIIME in Galaxy is developing

Further extension to cloud services
e.g., Amazon EC2

Machine learning and data mining
applications
Open-source, web-based platform
Deployed locally or in the cloud
Ongoing development of 16S rRNA gene analysis
Galaxy Workshed (available tools)
“The advantages of having large numbers of
samples at shallow coverage (~1,000 sequences
per sample) clearly outweigh having a small
number of samples at greater coverage for many
datasets, suggesting that the focus for future
studies should be on broader sampling that can
reveal association with key biological
parameters rather than on deeper sequencing.”
“….even [phylogenetic beta-diversity]
measures suited to the underlying
mechanism of differentiation may
require deep sequencing to reveal
subtle patterns”
Dr. Donovan Parks
Method standardization
Impossible.
Data storage
Sequence reads outpacing data storage costs
Federated data?
File formats
e.g., FASTA (difficult to search, difficult to retrieve sequences, not space efficient,
do not ensure data is in correct format, no space for metadata, no absolute
standard)… relational databases?
Software
Free and Open Source enables an experiment to be faithfully replicated
Algorithms
Memory!
Many clustering and phylogenetic inference algorithms vary n2
Distributed, parallel, or cloud computing may not be helpful
Metadata
What to do with it? How to marry sequence and metadata sets?
We need better metadata integration, not necessarily more/better metadata
What should we be doing?
(take-home messages)

*Surveys are really important for
spatial and temporal mapping
*Hypothesis testing follows (or implicit)
*What species account for treatment effects?
*Who tracks with who? (why=function)
*Who avoids who?
*Are all microorganisms accounted for? (no)
*How can we use this information to
manipulate, manage and predict ecosystems?
What should we be doing?
(take-home messages)

There is no “one way” to analyze 16S rRNA
You need to build a pipeline for you.
If this seems daunting, it is.
If this is not daunting, your hands are dirty.
It’s getting better all the tii-ime.
Helpful resources
Thank you
jneufeld@uwaterloo.ca

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsIntroduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsAndrea Telatin
 
Studying the microbiome
Studying the microbiomeStudying the microbiome
Studying the microbiomeMick Watson
 
16S rRNA Analysis using Mothur Pipeline
16S rRNA Analysis using Mothur Pipeline16S rRNA Analysis using Mothur Pipeline
16S rRNA Analysis using Mothur PipelineEman Abdelrazik
 
Bioinformatics Omics
Bioinformatics OmicsBioinformatics Omics
Bioinformatics OmicsHiplot
 
Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)Syed Lokman
 
Internet and Bioinformatics for Biologists
Internet and Bioinformatics for BiologistsInternet and Bioinformatics for Biologists
Internet and Bioinformatics for BiologistsDr Mehul Dave
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...Varij Nayan
 

Was ist angesagt? (20)

Introduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsIntroduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR Genomics
 
Studying the microbiome
Studying the microbiomeStudying the microbiome
Studying the microbiome
 
biological detabase
biological detabasebiological detabase
biological detabase
 
proteomics
 proteomics proteomics
proteomics
 
philogenetic tree
philogenetic treephilogenetic tree
philogenetic tree
 
Protein Data Bank (PDB)
Protein Data Bank (PDB)Protein Data Bank (PDB)
Protein Data Bank (PDB)
 
16S rRNA Analysis using Mothur Pipeline
16S rRNA Analysis using Mothur Pipeline16S rRNA Analysis using Mothur Pipeline
16S rRNA Analysis using Mothur Pipeline
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
Bioinformatics Omics
Bioinformatics OmicsBioinformatics Omics
Bioinformatics Omics
 
Pubmed Basics
Pubmed BasicsPubmed Basics
Pubmed Basics
 
Specialized Databases
Specialized Databases Specialized Databases
Specialized Databases
 
Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)
 
Internet and Bioinformatics for Biologists
Internet and Bioinformatics for BiologistsInternet and Bioinformatics for Biologists
Internet and Bioinformatics for Biologists
 
Fasta
FastaFasta
Fasta
 
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
 

Andere mochten auch

Bacterial Identification by 16s rRNA Sequencing.ppt
Bacterial Identification by 16s rRNA Sequencing.pptBacterial Identification by 16s rRNA Sequencing.ppt
Bacterial Identification by 16s rRNA Sequencing.pptRakesh Kumar
 
16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence Analysis16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence AnalysisAbdulrahman Muhammad
 
[13.09.19] 16S workshop introduction
[13.09.19] 16S workshop introduction[13.09.19] 16S workshop introduction
[13.09.19] 16S workshop introductionMads Albertsen
 
Amplicon Sequencing Introduction
Amplicon Sequencing IntroductionAmplicon Sequencing Introduction
Amplicon Sequencing IntroductionAaron Marc Saunders
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiomejukais
 
Document 12
Document 12Document 12
Document 12gkuygk
 
Toast 2015 qiime_talk2
Toast 2015 qiime_talk2Toast 2015 qiime_talk2
Toast 2015 qiime_talk2TOASTworkshop
 
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...Anupam Singh
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
 
Toast 2015 qiime_talk
Toast 2015 qiime_talkToast 2015 qiime_talk
Toast 2015 qiime_talkTOASTworkshop
 
Horse gut microbiome
Horse gut microbiomeHorse gut microbiome
Horse gut microbiomeShebl E Salem
 
Esa 2014 qiime
Esa 2014 qiimeEsa 2014 qiime
Esa 2014 qiimeZech Xu
 
Amplicon sequencing slides - Trina McMahon - MEWE 2013
Amplicon sequencing slides - Trina McMahon - MEWE 2013Amplicon sequencing slides - Trina McMahon - MEWE 2013
Amplicon sequencing slides - Trina McMahon - MEWE 2013mcmahonUW
 
Silva ribosomal RNA database
Silva ribosomal RNA databaseSilva ribosomal RNA database
Silva ribosomal RNA databasecfloare
 
CCBC tutorial beiko
CCBC tutorial beikoCCBC tutorial beiko
CCBC tutorial beikobeiko
 
Introduction to Biodiversity
Introduction to  BiodiversityIntroduction to  Biodiversity
Introduction to BiodiversityMark McGinley
 
NCBI -Pcr primer design
NCBI -Pcr primer designNCBI -Pcr primer design
NCBI -Pcr primer designMohammed Fawzi
 

Andere mochten auch (20)

Bacterial Identification by 16s rRNA Sequencing.ppt
Bacterial Identification by 16s rRNA Sequencing.pptBacterial Identification by 16s rRNA Sequencing.ppt
Bacterial Identification by 16s rRNA Sequencing.ppt
 
16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence Analysis16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence Analysis
 
[13.09.19] 16S workshop introduction
[13.09.19] 16S workshop introduction[13.09.19] 16S workshop introduction
[13.09.19] 16S workshop introduction
 
16s
16s16s
16s
 
Thesis
ThesisThesis
Thesis
 
16S classifier
16S classifier16S classifier
16S classifier
 
Amplicon Sequencing Introduction
Amplicon Sequencing IntroductionAmplicon Sequencing Introduction
Amplicon Sequencing Introduction
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
Document 12
Document 12Document 12
Document 12
 
Toast 2015 qiime_talk2
Toast 2015 qiime_talk2Toast 2015 qiime_talk2
Toast 2015 qiime_talk2
 
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
 
Toast 2015 qiime_talk
Toast 2015 qiime_talkToast 2015 qiime_talk
Toast 2015 qiime_talk
 
Horse gut microbiome
Horse gut microbiomeHorse gut microbiome
Horse gut microbiome
 
Esa 2014 qiime
Esa 2014 qiimeEsa 2014 qiime
Esa 2014 qiime
 
Amplicon sequencing slides - Trina McMahon - MEWE 2013
Amplicon sequencing slides - Trina McMahon - MEWE 2013Amplicon sequencing slides - Trina McMahon - MEWE 2013
Amplicon sequencing slides - Trina McMahon - MEWE 2013
 
Silva ribosomal RNA database
Silva ribosomal RNA databaseSilva ribosomal RNA database
Silva ribosomal RNA database
 
CCBC tutorial beiko
CCBC tutorial beikoCCBC tutorial beiko
CCBC tutorial beiko
 
Introduction to Biodiversity
Introduction to  BiodiversityIntroduction to  Biodiversity
Introduction to Biodiversity
 
NCBI -Pcr primer design
NCBI -Pcr primer designNCBI -Pcr primer design
NCBI -Pcr primer design
 

Ähnlich wie Introduction to 16S rRNA gene multivariate analysis

Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisPrasanthperceptron
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...EmadfHABIB2
 
Network Biology: A paradigm for modeling biological complex systems
Network Biology: A paradigm for modeling biological complex systemsNetwork Biology: A paradigm for modeling biological complex systems
Network Biology: A paradigm for modeling biological complex systemsGanesh Bagler
 
RPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 KeynoteRPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 KeynoteRob Guralnick
 
iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010Rob Guralnick
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein functionLars Juhl Jensen
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018David Cook
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1Double Check ĆŐNSULTING
 
Bayesian network-based predictive analytics applied to invasive species distr...
Bayesian network-based predictive analytics applied to invasive species distr...Bayesian network-based predictive analytics applied to invasive species distr...
Bayesian network-based predictive analytics applied to invasive species distr...Wisdom Dlamini
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9Jonathan Eisen
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for PhyloinformaticsRutger Vos
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsLars Juhl Jensen
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07Paolo Missier
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationDmitry Grapov
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Karen Cranston
 

Ähnlich wie Introduction to 16S rRNA gene multivariate analysis (20)

Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic Analysis
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
 
Network Biology: A paradigm for modeling biological complex systems
Network Biology: A paradigm for modeling biological complex systemsNetwork Biology: A paradigm for modeling biological complex systems
Network Biology: A paradigm for modeling biological complex systems
 
RPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 KeynoteRPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 Keynote
 
iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Gf o2014talk
Gf o2014talkGf o2014talk
Gf o2014talk
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
 
Bayesian network-based predictive analytics applied to invasive species distr...
Bayesian network-based predictive analytics applied to invasive species distr...Bayesian network-based predictive analytics applied to invasive species distr...
Bayesian network-based predictive analytics applied to invasive species distr...
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for Phyloinformatics
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data sets
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and Visualization
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Carleton Biology talk : March 2014
Carleton Biology talk : March 2014
 

Mehr von Josh Neufeld

How to give a good scientific oral presentation
How to give a good scientific oral presentationHow to give a good scientific oral presentation
How to give a good scientific oral presentationJosh Neufeld
 
So you want to be an academic?
So you want to be an academic?So you want to be an academic?
So you want to be an academic?Josh Neufeld
 
Neufeld erin 2012 for posting
Neufeld erin 2012 for postingNeufeld erin 2012 for posting
Neufeld erin 2012 for postingJosh Neufeld
 
Neufeld citizen science
Neufeld citizen scienceNeufeld citizen science
Neufeld citizen scienceJosh Neufeld
 

Mehr von Josh Neufeld (6)

How to give a good scientific oral presentation
How to give a good scientific oral presentationHow to give a good scientific oral presentation
How to give a good scientific oral presentation
 
So you want to be an academic?
So you want to be an academic?So you want to be an academic?
So you want to be an academic?
 
Neufeld ISME14
Neufeld ISME14Neufeld ISME14
Neufeld ISME14
 
Neufeld CSM 2012
Neufeld CSM 2012Neufeld CSM 2012
Neufeld CSM 2012
 
Neufeld erin 2012 for posting
Neufeld erin 2012 for postingNeufeld erin 2012 for posting
Neufeld erin 2012 for posting
 
Neufeld citizen science
Neufeld citizen scienceNeufeld citizen science
Neufeld citizen science
 

Kürzlich hochgeladen

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Introduction to 16S rRNA gene multivariate analysis

  • 1. Multivariate exploration of microbial communities Josh D. Neufeld Braunschweig, Germany December, 2013 Andre Masella (MSc): Computer science Michael Lynch (PhD): Taxonomy, phylogenetics, ecology Michael Hall (co-op): mathematics, programming, user friendly! Posted on Slideshare without images and unpublished data
  • 2. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  • 3. Who lives with whom, and why, and where? Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient
  • 4. An (abbreviated) history Numerical ecology phenetics and statistical analysis of organismal counts macroecology 16S rRNA gene era sequence analysis as a surrogate for counting mapping of marker to taxonomy NGS enabled synthesis of phenetics, phylogenetics, and numerical ecology
  • 5. Now generate V3-V4 bacterial amplicons (~450 bases) Usually PE 300
  • 6. Assembling paired-end reads dramatically reduces error Corrects mismatches in region of overlap (quality threshold >0.9), set a minimum overlap. Can compare to perfect overlap assembly: “completelymissesthepoint” (name changing soon)
  • 8. 1. p-value threshold 2. parallelizes correctly (both are now added or fixed in PANDAseq)
  • 9.
  • 10. Biological Observation Matrix BIOM file format (MacDonald et al. 2012) Standard recognized by EMP, MG-RAST, VAMPS Based on JSON data interchange format Computational structure in multiple languages “facilitates the efficient handling and storage of large, sparse biological contingency tables” Encapsulates metadata and contingency table (e.g., OTU table) in one file
  • 11. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  • 12. Who lives with whom, and why, and where? Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient
  • 14. α-diversity: Richness and Evenness Shannon index (H’), Estimators (Chao1, ACE), Phylogenetic Diversity Shannon index (H’): richness and evenness Estimators: richness Faith’s PD: phylogenetic richness Stearns et al., 2011 Hughes et al., 2001
  • 15. “All biologists who sample natural communities are plagued with the problem of how well a sample reflects a community’s ‘true’ diversity.”
  • 16. Hughes et al. 2001 “Nonparametric estimators show particular promise for microbial data and in some habitats may require sample sizes of only 200 to 1,000 clones to detect richness differences of only tens of species.”
  • 17. 1 Google Scholar proportion [Seqeuncing tech] AND 16S 400 454 300 Sanger re e re Ra 0 2000 200 2002 2004 2004 ph os 100 bi 2008 0 2010 Time (year) Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation. 2012 “Rare biosphere” citations Illumina 500
  • 18. GOALS Understanding of community structure Better alpha-diversity measures Robust beta-diversity measures Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.
  • 21. Clustering algorithms (influence alpha diversity primarily) CD-HIT (Li and Godzik, Sanford-Burnham Medical Research Institute) ‘longest-sequence-first’ removal algorithm Fast, many implementations (nucleotide, protein, OTUspecific) Tends to be more stringent than UCLUST UCLUST (R. Edgar, drive5.com) Faster than CD-HIT Tends to generate larger number of low-abundance OTUs Broader range of clustering thresholds "I do not recommend using the UCLUST algorithm or CD-HIT for generating OTUs” – Robert Edgar
  • 22.
  • 23. CROP: Clustering 16S rRNA for OTU Prediction (CROP) “CROP can find clusters based on the natural organization of data without setting a hard cut-off threshold (3%/5%) as required by hierarchical clustering methods.”
  • 24. Chimeras DNA from two or more parent molecules PCR artifact Can easily be classified as a “novel” sequence Increases α-diversity Software ChimeraSlayer, Bellerophon, UCHIME, Pintail Reference database or de novo
  • 25. Classification and taxonomy Ribosomal Database Project (RDP) classifier Naïve Bayesian classifier (James Cole and Tiedje) http://rdp.cme.msu.edu/ pplacer Phylogenetic placement and visualization BLAST The tool we know and love RTAX (UC Berkely, Rob Knight involved) http://dev.davidsoergel.com/trac/rtax/ mothur (Patrick Schloss) http://www.mothur.org/ SINA (SILVA)
  • 26. RDP classifier Large training sets require active memory management Can be easily run in parallel by breaking up very large data sets Can classify Bacteria/Archaea SSU and fungal LSU (can be re-trained) Algorithm: determine the probability that an unknown query sequence is a member of a known genus (training set), based on the profile of word subsets of known genera. Confidence estimation: the number of times in 100 trials that a genus was selected based on a random subset of words in the query Take home: The higher the diversity (bigger sequence space) of the training set, the better the assignment Longer query = better and more reliable assignment Short reads (i.e., <250 base) will have lower confidence estimates (cutoff of 0.5 suggested)
  • 27. Database sources GreenGenes Latest May 2013 SILVA Latest 115 (August 2013) Includes 18S, 23S, 28S, LSU RDP Database Latest 11 (October 2013) GenBank Research-specific e.g., CORE Oral
  • 29. β-diversity Visualization (ordination) versus hypothesis testing (MRPP, indicator species analysis) Many more algorithms out there for exploration and statistical testing mostly through widely used R packages vegan (Community Ecology Package) labdsv (Ordination and Multivariate Analysis for Ecology) ape (Analyses of Phylogenetics and Evolution) picante (community analyses etc.)
  • 30. Visualization (ordination) Complementary to data clustering looks for discontinuities Ordination extracts main trends as continuous axes analysis of the square matrix derived from the OTU table Non-parametric, unconstrained ordination methods most widely used (and best suited) methods that can work directly on a square matrix An appropriate metric is required to derive this square matrix many options...
  • 31. Metrics Ordination is essentially reducing dimensionality first requirement: accurately model differences among samples Models are *really* important. Examples include: OTU presence/absence “all models are wrong, Dice, Jaccard some are useful” OTU abundance - G.E. Box Bray-Curtis “You can't publish anything without a Phylogenetic PCoA plot anymore, but METRICS UniFrac used to draw plot important.” - Susan Huse
  • 32. Metrics: UniFrac A distance measure comparing multiple communities using phylogenetic information Requires sequence alignment and tree-building PyNAST, MUSCLE, Infernal Time-consuming and susceptible to poor phylogenetic inference (does it matter?) Weighted (abundance) ecological features related to abundance Unweighted ecological features related to taxonomic presence/absence
  • 33. Ordination example 1 (of many): Principal Coordinates Analysis Classical Multidimensional Scaling (MDS; Gower 1966) Procedure: based on eigenvectors position objects in low-dimensional space while preserving distance relationships as well as possible highly flexible can choose among many association measures In microbial ecology, used for visualizing phylogenetic or count-based distances Consistent visual output for given distance matrix Include variance explained (%) on Axis 1 and 2
  • 34. Ordination example 2 (of many): Non-metric Multidimensional Scaling Ordination not based on eigenvectors Does not preserve exact distances among objects attempts to preserve ordering of samples (“ranks”) Procedure: iterative, tries to position the objects in a few (2-3) dimensions in such a way that minimizes the “stress” how well does the new ranked distribution of points represent the original distances in the association matrix? Can express as R2 on axes 1 and 2. the adjustment goes on until the stress value reaches a local minimum (heuristic solution) NMDS often represents distance relationships better than PCoA in the same number of dimensions Susceptible to the “local minimum issue”, and therefore should have strong starting point (e.g., PCoA) or many permutations You won't get the same result each time you run the analysis. Try several runs until you are comfortable with the result.
  • 35. Do my treatments separate?
  • 36. Beta-diversity: Hypothesis testing Multiple methods, implemented in QIIME, mothur, AXIOME e.g., MRPP, adonis, NP-MANOVA (perMANOVA), ANOSIM Are treatment effects significant? Because these are predominantly nonparametric methods, tests for significance rely on testing by permutation Let's focus on MRPP
  • 37. Multiresponse Permutation Procedures Compare intragroup average distances with the average distances that would have resulted from all the other possible combinations T statistic: more negative with increasing group separation (T>-10 common for ecology) A statistic: Degree of scatter within groups (A=1 when all points fall on top of one another) p value: likelihood of similar separation with randomized data.
  • 38. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  • 39. “PCoA plots are the first step of a community analysis, not the last.” Josh Neufeld
  • 40. Searching for species that matter High dimensional data often have too many features to investigate solution: identify and study species significantly associated with categorical metadata Indicator species (Dufrene-Legendre) calculates indicator value (fidelity and relative abundance) of species Permutation test for significance Need solution for sparse data - be wary of groups with small numbers of sites (influence on permutation tests) low abundance can artificially inflate indicator values
  • 42. IndVal (Dufrene & Legendre, 1997) Specificity Large mean abundance within group relative to summed mean abundances of other groups Fidelity Presence in most or all sites of that group Groups defined by a priori by metadata or statistical clustering
  • 43. Simple linear correlations Metadata mbc Taxon R^2 value k__Bacteria;p__Planctomycetes;c__Planctomycetia;o__Gemmat ales;f__Isosphaeraceae;g__ 0.611368489781491 mbc k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz obiales;f__Methylocystaceae;g__ 0.677209935419981 mbn k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz obiales;f__Methylocystaceae;g__ 0.64092523702996 soil_depth k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomyc etales;f__Intrasporangiaceae;g__ 0.669761188668774
  • 44. mothur: cooccurrence function, measuring whether populations are co-occurring more frequently than you would expect by chance.
  • 45. Non-negative Matrix Factorization NMF as a representation method for portraying high-dimensional data as a small number of taxonomic components. Patterns of co-occurring OTUs can be described by a smaller number of taxonomic components. Each sample represented by the collection of component taxa, helping identify relationships between taxa and the environment. Jonathan Dushoff, McMaster University, Ontario, Canada
  • 46.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54. SILVA
  • 55. SILVA
  • 56. SILVA
  • 57. SILVA
  • 58. SILVA
  • 59. Nakai et al. 2012 Lynch et al. 2012
  • 60. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  • 61. Why pipelines? Merge and manage (many) disparate techniques Democratize analysis improve accessibility Accelerate pace of innovation, collaboration, and research
  • 62. Early synthesis Early synthesis for numerical microbial ecology Synthesis of 16S phylogenetics (Woese et al.) and Hughes (Counting the uncountable) Numerical ecology for microorganisms Algorithm development libshuff, dotur (mothur) Analysis pipelines QIIME, mothur
  • 63. Knight Lab, U. Colorado at Boulder Predominantly a collection of integrated Python/R scripts Many dependencies easy managed installation: qiime-deploy MacQIIME virtual box and Ubuntu fork avoid for anything but small runs Becoming the standard for marker gene studies integrated analysis and visualization easy access to broad computational biology toolbox (Python/R)
  • 64. Automation and extension AXIOME and phyloseq Extend existing technologies (QIIME, mothur, R, custom) Layers of abstraction Automation and rapid re-analysis Promote reproducible research (iPython, XML, make) Implement existing techniques (e.g., MRPP, Dufrene-Legendre IndVal) numerical microbial ecology needs to better incorporate modern statistical theory Develop and test new techniques
  • 65.
  • 66.
  • 67. Axiometic GUI companion for AXIOME Cross-platform New implementation in development Generates AXIOME file (XML) xls template coming soon for all commands, sample metadata, and extra info… much easier for everyone.
  • 68. “QIIME wraps many other software packages, and these should be cited if they are used. Any time you're using tools that QIIME wraps, it is essential to cite those tools.” http://qiime.org/index.html
  • 69. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  • 70. The future As data get bigger, interpretation should be “hands off” Move towards hypothesis testing of highdimension taxonomic data Convergence on Galaxy e.g., QIIME in Galaxy is developing Further extension to cloud services e.g., Amazon EC2 Machine learning and data mining applications
  • 71. Open-source, web-based platform Deployed locally or in the cloud Ongoing development of 16S rRNA gene analysis
  • 73. “The advantages of having large numbers of samples at shallow coverage (~1,000 sequences per sample) clearly outweigh having a small number of samples at greater coverage for many datasets, suggesting that the focus for future studies should be on broader sampling that can reveal association with key biological parameters rather than on deeper sequencing.”
  • 74. “….even [phylogenetic beta-diversity] measures suited to the underlying mechanism of differentiation may require deep sequencing to reveal subtle patterns” Dr. Donovan Parks
  • 75. Method standardization Impossible. Data storage Sequence reads outpacing data storage costs Federated data? File formats e.g., FASTA (difficult to search, difficult to retrieve sequences, not space efficient, do not ensure data is in correct format, no space for metadata, no absolute standard)… relational databases? Software Free and Open Source enables an experiment to be faithfully replicated Algorithms Memory! Many clustering and phylogenetic inference algorithms vary n2 Distributed, parallel, or cloud computing may not be helpful Metadata What to do with it? How to marry sequence and metadata sets? We need better metadata integration, not necessarily more/better metadata
  • 76. What should we be doing? (take-home messages) *Surveys are really important for spatial and temporal mapping *Hypothesis testing follows (or implicit) *What species account for treatment effects? *Who tracks with who? (why=function) *Who avoids who? *Are all microorganisms accounted for? (no) *How can we use this information to manipulate, manage and predict ecosystems?
  • 77. What should we be doing? (take-home messages) There is no “one way” to analyze 16S rRNA You need to build a pipeline for you. If this seems daunting, it is. If this is not daunting, your hands are dirty. It’s getting better all the tii-ime.
  • 79.
  • 80.
  • 81.
  • 82.