2. Welcome!
Your Tutorial Team:
Me (16S theory)
Mike Hall (16S practical)
Morgan Langille (metagenomics theory and practical)
Special thanks to:
Will Hsiao (CBW presentation)
2
4. Overview
Morning session
1. A brief history of molecules and microbes
2. Why 16S?
3. How 16S analysis is usually done
4. Assumptions
5. Hands-on practical
Afternoon session
1. 16S vs Metagenomics
2. Metagenome Taxonomic Composition
3. Metagenome Functional Composition
4. PICRUSt: Functional Inference
5. Hands-on practical
4
5. Learning objectives
At the end of the 16S tutorial, you should be able to do the following:
1. Run a simple QIIME analysis of a data set
(https://www.dropbox.com/s/kpte51nm17wav9o/stool_data.zip)
2. Interpret analysis results
3. Understand the limitations of the standard 16S analysis pipeline
5
6. Defining metagenomics
Microbiome: Attributed to Joshua Lederberg by Hooper and Gordon (2001):
“the collective genome of our indigenous microbes (microflora), the idea
being that a comprehensive genetic view of Homo sapiens as a life-form
should include the genes in our microbiome”
Is also used to mean microbiota, the group of microorganisms found in a
particular setting
(usage varies: be careful and precise!)
Metagenome: Handelsman et al. (1998) “…advances in molecular biology
and eukaryotic genomics, which have laid the groundwork for cloning and
functional analysis of the collective genomes of soil microflora, which we
term the metagenome of the soil.”
Does not encompass marker-gene surveys (e.g., 16S)
This report says it does.
6
7. Micro-what?
Metagenomics is often defined to encompass only Bacteria and Archaea
(and often Archaea are excluded too!)
Other small things to consider:
◦ Viruses / phages
◦ Microbial eukaryotes
◦ Worms (helminths, nematodes, …)
7
Lukeš et al. (2015) PLoS Pathogens
8. The dawn of metagenomics
3.5 BYA – the Archaean Eon
16S position 349 (-ish)
?
G A
Archaea Bacteria
8
11. 11
Yarza et al. (2014)
Escherichia coli
ribosome (PDB 4YBB)
So much RNA!
12. Why 16S?
The “universal phylogenetic marker”
(1) Present in all living organisms
(2) Single copy* (no recombination)
(3) Highly conserved + highly variable regions
(4) Huge reference databases
12
19. Sample collection and DNA extraction
Defined protocols exist, many kits (e.g. PowerSoil®)
Need to consider barriers to DNA recovery and PCR (e.g. humic acids
from soil, bile salts from feces)
Additional mechanical approaches (e.g., mechanical lysis of tissues with
bead beating)
Kits and rogue lab DNA can end up in your sample – need to run
negative controls!!
◦ Example from [year redacted]: shocking finding of bacterial DNA in the
[location redacted]! However, [taxonomic group redacted] was a known
frequent contaminant of DNA extraction kits.
19
21. Choosing a PCR strategy
Need to consider:
◦ Correct melting temperature (60-65 degrees C for Illumina
protocol)
◦ DNA sequencing read length (influences choice of primers)
◦ Primer specificity!
◦ Comparability with previous studies?
[Good luck with that]
[but that’s what the Earth Microbiome Project protocol
http://www.earthmicrobiome.org/emp-standard-protocols/16s/
is meant to achieve]
21
22. Which variable regions to target?
V1-V3 favours Prevotella, Fusobacterium, Streptococcus, Granulicatella, Bacteroides,
Porphyromonas and Treponema
V4-V6 favours Streptococcus, Treponema, Prevotella, Eubacterium, Porphyromonas,
Campylobacter and Enterococcus.
◦ failed to detect Fusobacterium
V7-V9 favours Veillonella, Streptococcus, Eubacterium, Enterococcus, Treponema,
Catonella and Selenomonas.
◦ failed to detect Selenomonas, TM7 and Mycoplasma
22
23. At least there’s no shortage of options…
23
Detailed in silico evaluation of primers, experimental evaluation of two sets
Heavily biased recovery of Bacteria, Archaea, and missing groups depending on primer
choice.
“Out of the 175 primers and 512 primer pairs checked, only 10 can be recommended as
broad-range primers.”
25. Analysis
(examples mostly from QIIME)
1. Quality Control
◦ Error checking
2. Sample diversity
◦ Taxonomy agnostic
◦ Taxonomy aware
3. Similarity among samples
4. Associations with metadata/groups (ANOSIM, MRPP)
5. Machine-learning classification
6. Functional prediction
25
26. 26
QIIME Mothur
A python interface to glue together many
programs
Single program with minimal external
dependency
Wrappers for existing programs Reimplementation of popular algorithms
Large number of dependencies / VM
available
Easy to install and setup; work best on single
multi-core server with lots of memory
More scalable Less scalable
Steeper learning curve but more flexible
workflow if you can write your own scripts
Easy to learn but workflow works the best
with built-in tools
http://www.ncbi.nlm.nih.gov/pubmed/2406
0131
http://www.mothur.org/wiki/MiSeq_SOP
Will Hsiao
27. “Analysis” #1
Quality Control
27
Quality score filtering:
◦ Minimal length of consecutive high-quality bases (as % of total read length)
◦ Maximal number of consecutive low-quality bases
◦ Maximal number of ambiguous bases (N’s)
◦ Minimum Phred quality score
Other quality filtering tools available
◦ Cutadapt (https://github.com/marcelm/cutadapt)
◦ Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)
◦ Sickle (https://github.com/najoshi/sickle)
Chimera checking:
◦ UCHIME
29. Analysis #2
Within-sample (“alpha”) diversity
To describe the diversity of a sample, you need to know what you are
counting!
Individual sequences?
◦ Most precise, but vulnerable to sequencing error effects – inflation of
diversity
Clusters of sequences?
◦ Operational taxonomic units (OTUs) – 97% sequence identity as the
“species” level of similarity
Taxonomic groups?
◦ It’s always reassuring to put names on things, but taxonomic labels can be
extremely misleading
29
30. OTU clustering
30
Choose a % identity threshold
97%
Cluster centroids in some order
(e.g., length, abundance) – these
are reference sequences
Continue procedure until all
sequences are clustered OTU
(singletons may be excluded)
Calculate distances between sequences
6%
31. What’s in a name?
31
Bacteroides
Parabacteroides
Ruminococcus
???
???
???
???
Akkermansia
32. Taxonomic assignment
Many choices:
BLAST – assign taxonomic label of closest match (simple, possibly too simple)
Phylogenetic placement – e.g. Pplacer (Matsen et al., BMC Bioinformatics
2010)
Machine-learning classification, in particular Naïve Bayes e.g. RDP Classifier,
Wang et al. (2007) BMC Bioinformatics
32
33. Example RDP Classifier output
33
GD6JEAT01AYGPE Root rootrank 1.0 Bacteria domain 1.0
"Planctomycetes" phylum 1.0 "Planctomycetacia"class 1.0
Planctomycetales order 1.0 Planctomycetaceaefamily 1.0
Schlesneria genus 0.96
GD6JEAT01BEUG6 Root rootrank 1.0 Bacteria domain 1.0
Firmicutes phylum 0.32 Clostridia class 0.26
Clostridiales order 0.23 Ruminococcaceae family 0.22
Anaerotruncus genus 0.19
Includes bootstrap support
34. Calculating alpha diversity
OTU counts – richness only
Simpson index – probability of sampling two individuals of the same type
Phylogenetic diversity – sum of branch lengths
34
36. Analysis #3
Among-sample (“beta”) diversity
1. Perform pairwise comparisons between all samples to build a
dissimilarity matrix
2. Summarize the matrix using based on major patterns of covariance
or hierarchical similarity
36
37. Analysis #3
Among-sample (“beta”) diversity
Given a pair of samples (described as e.g. OTU abundance), calculate
their dissimilarity
Beta-diversity measures can be:
◦ non-phylogenetic or phylogenetic
◦ weighted or unweighted
There are a lot of measures!
-Bray-Curtis (weighted, non-phylogenetic)
-Jaccard (unweighted, non-phylogenetic)
-Weighted UniFrac (weighted, phylogenetic)
-…
37
38. Analysis #3
Among-sample (“beta”) diversity
How similar are the results of different
measures?
CORRELATIONS between calculated
values
38
Parks and Beiko (2013): ISME J
39. Analysis #3
Among-sample (“beta”) diversity
What to do with a dissimilarity matrix?
39
Yatsunenko et al. (2012) Nature Parks and Beiko (2012) Mol Biol Evol
Ordination
Clustering
40. Analysis #3
Among-sample (“beta”) diversity
Different beta-diversity measures can
yield dramatically different clusters!
40
Parks and Beiko (2013): ISME J
41. Analysis #4
Associations with metadata
PERMANOVA: Permutational multivariate analysis of variance
ANOSIM: Rank-based analysis of similarity
Mantel test: Comparison of between-group vs within-group distances
41
Good review: Anderson and Walsh (2013) Ecological Monographs
Example:
Weighted UniFrac distance: root compartment
explains 46.62% of variance (PERMANOVA p<0.001)
Unweighted UniFrac: root compartment explains only
18.07% of variance (PERMANOVA p<0.001); soil type
is more important
42. Analysis #5
Machine-learning classification
Identify aspects of community structure that are predictive of sample
attributes
Advantages of machine-learning approaches:
◦ Non-linear combinations of variables
◦ Data transformations
◦ Can accommodate many different representations of the data
Disadvantages:
◦ Complex, may “overfit”
◦ Can be time consuming
◦ Obfuscation of predictive rules
42
43. Random forests
(supervised_learning.py)
43
“…there are only weak and, for the most part, non-significant associations of
particular taxa or overall diversity with the obese human gut that hold true across
different studies. However, using supervised learning with receiver operator
curves to maximize sensitivity and specificity, one can categorize subjects
according to lean and obese states with in some cases considerable accuracy…”
44. Tree-based classifications
Nested clade analysis
and feature selection
Classification of plaque samples
using support vector machines
44
Ning and Beiko (2015): Microbiome
47. Do not assume that
#1: 16S is an effective proxy for microbial diversity.
#2: All 16S studies are created equal, with results that are comparable.
#3: Rarefaction is a good idea.
#4: 16S OTUs describe ecologically cohesive units (“species”?).
#5: The 16S tree is the “Tree of Life”.
47
48. Assumption #1
16S is an effective proxy for microbial diversity.
48
rrnDB: Stoddard et al.
NAR (2014)
Estimating copy number:
Kembel et al. (2012) and
PICRUSt (coming up later)
Variation: Coenye and Vandamme (2003)
49. Assumption #1
16S is an effective proxy for microbial
diversity.
Alternative marker genes: cpn60, rpoB, …
Smaller reference databases!
Protein-coding genes!
49
50. Assumption #2
All 16S studies are created equal.
Effects of sequencing platform, V region, amplicon vs metagenomics
50
Tremblay et al. (2015)
Front Microbiol
51. Assumption #3
Rarefaction is a good idea.
Example of statistics before and after rarefaction:
Loss of statistical power
Random subsampling can increase false-positive differences
Arbitrary minimum library size chosen for downsampling
Alternatives e.g. Negative Binomial fitting (e.g., DeSeq2)
51
McMurdie and Holmes (2014) PLoS Comp Biol
52. Assumption #4
16S OTUs describe ecologically cohesive units.
52
Distribution of
sequence similarity
(dashed line = OTU threshold)
branch lengths
Nguyen et al. (2016) npj Biofilms and Microbiomes
53. Assumption #4
16S OTUs describe ecologically cohesive units.
53
Hall et al., in preparation
Same OTU, different temporal patterns
54. Assumption #4
16S OTUs describe ecologically cohesive units.
54
Many alternatives exist,
including Swarm: Mahé et al.
(2015) PeerJ
55. Assumption #5
The 16S tree is the “Tree of Life”.
16S is limited for several reasons:
Limited resolving power
Subject to compositional bias
Subject to recombination and lateral
transfer
Models typically applied to protein-
coding genes do not make sense for
noncoding RNA
55
57. Multi-omics??
16S can profile the biodiversity of a microbial sample…
But we need the metagenome to shine a light on function…
The metatranscriptome tells us what is expressed under specific
conditions…
And the metaproteome can quantify the relative abundance of different
enzymes…
While the metametabolome focuses on the products of metabolism.
What do we really need?
57