epilepsy and status epilepticus for undergraduate.pptx
Jonathan Eisen @phylogenomics talk for #LAMG12
1. Phylogenomic Approaches to the
Study of Microbial Diversity
September 16, 2012
Lake Arrowhead Microbial Genomes
#LAMG12
Jonathan A. Eisen
University of California, Davis
@phylogenomics
Sunday, September 16, 12
2. A Bit of History
• For the real story about the Lake
Arrowhead Microbial Genomes meetings
see http://tinyurl.com/LAMG12
• But the key to LAMG meetings are ...
Sunday, September 16, 12
4. Quotes
• Space-time continuum of genes and
genomes
Sunday, September 16, 12
5. Quotes
• Space-time continuum of genes and
genomes
• Microbes not only have a lot of sex, they
have a lot of weird sex
Sunday, September 16, 12
6. Quotes
• Space-time continuum of genes and
genomes
• Microbes not only have a lot of sex, they
have a lot of weird sex
• Gene sequences are the wormhole that
allows one to tunnel into the past
Sunday, September 16, 12
7. Quotes
• Space-time continuum of genes and
genomes
• Microbes not only have a lot of sex, they
have a lot of weird sex
• Gene sequences are the wormhole that
allows one to tunnel into the past
• This is how you do metagenomics on 50
dollars, and that’s Canadian dollars
Sunday, September 16, 12
8. Quotes
• Space-time continuum of genes and
genomes
• Microbes not only have a lot of sex, they
have a lot of weird sex
• Gene sequences are the wormhole that
allows one to tunnel into the past
• This is how you do metagenomics on 50
dollars, and that’s Canadian dollars
• The human guts are a real milieu of stuff
Sunday, September 16, 12
9. Quotes
• Space-time continuum of genes and
genomes
• Microbes not only have a lot of sex, they
have a lot of weird sex
• Gene sequences are the wormhole that
allows one to tunnel into the past
• This is how you do metagenomics on 50
dollars, and that’s Canadian dollars
• The human guts are a real milieu of stuff
• Antibiotics do not kill things, they corrupt
them
Sunday, September 16, 12
10. Quotes
• There comes a point in life when you have
to bring chemists into the picture
Sunday, September 16, 12
11. Quotes
• There comes a point in life when you have
to bring chemists into the picture
• The rectal swabs are here in tan color
Sunday, September 16, 12
12. Quotes
• There comes a point in life when you have
to bring chemists into the picture
• The rectal swabs are here in tan color
• If I have time I will tell you about a dream
Sunday, September 16, 12
13. Quotes
• There comes a point in life when you have
to bring chemists into the picture
• The rectal swabs are here in tan color
• If I have time I will tell you about a dream
• Another thing you need to know" pause
"Actually you don't NEED to know any of
this
Sunday, September 16, 12
14. Quotes
• There comes a point in life when you have
to bring chemists into the picture
• The rectal swabs are here in tan color
• If I have time I will tell you about a dream
• Another thing you need to know" pause
"Actually you don't NEED to know any of
this
• I have been influenced by Fisher Price
throughout my life
Sunday, September 16, 12
15. Quotes
• There comes a point in life when you have
to bring chemists into the picture
• The rectal swabs are here in tan color
• If I have time I will tell you about a dream
• Another thing you need to know" pause
"Actually you don't NEED to know any of
this
• I have been influenced by Fisher Price
throughout my life
• This is going to be ironic coming from
someone who studies circumcision
Sunday, September 16, 12
16. Quotes
• And we will bring out the unused cheese
from yesterday
Sunday, September 16, 12
17. Quotes
• And we will bring out the unused cheese
from yesterday
• A paper came out next year
Sunday, September 16, 12
18. Quotes
• And we will bring out the unused cheese
from yesterday
• A paper came out next year
• It takes 1000 nanobiologists to make one
microbiologist
Sunday, September 16, 12
19. Quotes
• And we will bring out the unused cheese
from yesterday
• A paper came out next year
• It takes 1000 nanobiologists to make one
microbiologist
• In an engineering sense, the vagina is a
simple plug flow reactor
Sunday, September 16, 12
20. Phylogenomic Approaches to
Studying Microbial Diversity
Example 1:
Phylotyping
and
Phylogenetic Diversity
Sunday, September 16, 12
21. rRNA Phylotyping
DNA
extraction PCR
Makes lots of Sequence
PCR copies of the rRNA genes
rRNA genes
in sample
rRNA1
5’...ACACACATAGGTGGAGCTA
GCGATCGATCGA... 3’
Sequence alignment = Data matrix
rRNA2
rRNA1 A C A C A C 5’..TACAGTATAGGTGGAGCTAG
CGACGATCGA... 3’
rRNA2 T A C A G T
rRNA3
rRNA3 C A C T G T 5’...ACGGCAAAATAGGTGGATT
rRNA4 C A C A G T CTAGCGATATAGA... 3’
E. coli A G A C A G rRNA4
5’...ACGGCCCGATAGGTGGATT
Humans T A T A G T CTAGCGCCATAGA... 3’
Yeast T A C A G T
Sunday, September 16, 12
30. rRNA Phylotyping
Just
E. coli Humans
Phylogeny
Yeast
Sunday, September 16, 12
31. rRNA Phylotyping
B
A
Cluster C
Just
B E. coli Humans
Phylogeny
A
Yeast
OTUs C
OTU2 OTU1
OTU1 OTU4
OTU3
OTU2
OTU3 E. coli Humans
OTU4 Yeast
Sunday, September 16, 12
32. rRNA Phylotyping
• OTUs
• Taxonomic lists
• Relative abundance of taxa
• Ecological metrics (alpha and beta diversity)
• Phylogenetic metrics
• Binning
• Identification of novel groups
• Clades
• Rates of change
• LGT
• Convergence
• PD
• Phylogenetic ecology (e.g., Unifrac)
Sunday, September 16, 12
34. What’s New in Phylotyping I
• More PCR products
• Deeper sequencing
• The rare biosphere
• Relative abundance estimates
• More samples (with barcoding)
• Times series
• Spatially diverse sampling
• Fine scale sampling
Sunday, September 16, 12
35. intense research (5–9), as such studies of β-diversity (variation in mental variation or dispersal limitation
community composition) yield insights into the maintenance of vary by spatial scale? Because most bac
Beta-Diversity
biodiversity. These studies are still relatively rare for micro-
organisms, however, and thus our understanding of the mecha-
and hardy, we predicted that dispersa
primarily across continents, resulting
nisms underlying microbial diversity—most of the tree of life— microbial “provinces” (15). At the sam
remains limited. environmental factors would contrib
β-Diversity, and therefore distance-decay patterns, could be decay at all scales, resulting in the steepe
driven solely by differences in environmental conditions across scale as reported in plant and animal c
space, a hypothesis summed up by microbiologists as, “every-
thing is everywhere—the environmental selects” (10). Under this Results and Discussion
model, a distance-decay curve is observed because environmen- We characterized AOB community co
tal variables tend to be spatially autocorrelated, and organisms Sanger sequencing of 16S rRNA gene
with differing niche preferences are selected from the available primer sets. Here we focus on the resu
pool of taxa as the environment changes with distance. sequences from the order Nitrosomo
Dispersal limitation can also give rise to β-diversity, as it per- primers specific for AOB within the β-
mits historical contingencies to influence present-day biogeo- The second primer set (18) generate
graphic patterns. For example, neutral niche models, in which an
organism’s marshes 1.sampled marshes sampled for details). for details). its environmental
Fig. 1. The 13
abundance (see Table S1 (see Table S1 Marshes com- com-
Fig. The 13 is not influenced by Marshes
pared with one another within regions are circled. (Inset) The arrangement
preferences, predict apoints within marshes. Six pointsThe arrangement a 100-m relatively
pared with one another within regions are circled. (Inset) were sampled along On
of sampling distance-decay curve (8, 11).
Author contributions: J.B.H.M. and M.C.H.-D. designed
of sampling points within marshes. Six points births ∼1 kmalongTwo marshescontribute to
transect, and a seventh point was sampled
were sampled away. a 100-m in the
short time seventh pointstochastic km away.were sampled morethe
scales, was sampled(outlined stars) Two marshes in intensively,
and deaths
Northeast United States M.C.H.-D. performed research; J.B.H.M., S.D.A., and M
transect, and a ∼1 a grid pattern.
a Northeast United Statesdistributionweretaxa (ecological drift). On longer
heterogeneous (outlined stars) of sampled more intensively,
along four 100-m transects in
and M.C.H.-D. wrote the paper.
time four 100-m transects in a rangegenetic processes allow results taxon Distance-decay curves for the declare no conflict of interest.
along scales, stochastic pattern.
a broader
grid
of Proteobacteria, but yielded similar
for Fig. 2. di- The authors
Nitrosomadales communities. The
versification across the Tables S2 and S3).
(Fig. S1 and landscape (evolutionary drift). If dispersal denotes thearticle is alinear regression across all spatial
dashed, blue line This least-squares PNAS Direct Submission.
Across all samples, we identified 4,931 quality Nitrosomadales scales. The solid lines denote separate regressions within each of the three
isa limiting, then current environmental or (operational taxo- 2.spatial scales: within marshes, regional the Nitrosomadales communities. The acces
broader range of Proteobacteria, but yielded similar results conditions will
sequences, which grouped into 176 OTUs biotic Fig. Distance-decay Freely available marshes within regions circledPNAS open
curves for (across online through the in
notAcrossand samples, theidentified 4,931 qualitycurve, and thusdashed,Thebluelinelines significantlyregions). The slopes of all withinsolid theof thespatialthis pape
(Fig. S1
fully all Tables S2 units)retained a arbitrary 99% Nitrosomadales cutoff. Fig. 1),solidcontinental Dataseparate zero. linear regression across all three
nomic and S3). an
explain we distance-decay sequence similarity but light andline)denotes(acrossleast-squares The slopeslinesthe each solid in
This cutoff
using blue the
geographicare denote deposition: The sequences red lines
less than regressions of
(except
reported
high amount of sequence diversity, scales. are significantly different from the slope of the all scale (blue dashed) line.
distance will begrouped the chance of including diversity similarity even after marshes, regional (across marshes within regions circled in
correlated with community because se-
sequences, which minimized into 176 OTUs (operational taxo- of spatial scales: within Bank database (accession nos. HQ271472–HQ276885
quencing or PCR99% sequence similarity cutoff. appear 1), and continental (across regions). The slopes of all lines (except the solid
errors. Most (95%) of the sequences Fig.
controlling for closelyarbitraryeither(2).the marine Nitrosospira-like clade, blue line) are significantly less than zero. The slopesdistancesolid red lines E-m
nomic units) using another factors to
related
1
light somonadales community similarity. Geographic of the con- addressed.
To whom correspondence should be
Drivers of bacterial β-diversity depend on spatial scale
This cutoff retained a to be abundant inof sequence diversity,ref. 19) orare significantly different from the slope of the all scale (blue dashed) line.
known high amount estuarine sediments (e.g., but
For macroorganisms, the relative because contribution of environ- largest partial regression coefficient (b = 0.40,
to tributed the
ECOLOGY
marine of including diversity (20) (Fig. This article contains supporting information online at
minimized the chance bacterium C-17, classified as Nitrosomonasof se- S2). P < 0.0001), with sediment moisture, nitrate concentration, plant
mental factors Pairwise community similaritythe sequences appear calcu- cover, salinity, and1073/pnas.1016308108/-/DCSupplemental.
quencing or PCR or dispersal limitation to β-diversity depends on
errors. Most (95%) of between the samples was air and water temperature contributing to
Jennifer relatedMartinya,1, Jonathan A. Nitrosospira-likePennc, Steven D. Allisona,d, and M. Claire Horner-Devinedistance con-
closely B. H. either based the the presence or absence of each OTU using smaller, but significant, partial regression coefficients (b e 0.09–
lated to on marine Eisenb, Kevin clade,
somonadales community similarity. Geographic =
a rarefied Sørensen’s index (4). Community similarity using this
sediments (e.g., ref. abundance-based 0.17, the 0.05) (Table 1). Because salt marsh bacteria may be
known to be abundant in estuarinehighly correlated with the19) or to tributed P < largest of California, Irvine, CAused a global ocean of
incidence index was Biology, and dDepartment of Earth System Science, University ocean currents, we also coefficient (b = 0.40,
partial regression 92697; bDepartment
ECOLOGY
a
Department of Ecology and classified as Nitrosomonas (20) (Fig. S2).
Evolutionary dispersing through
marine bacterium Sørensen index (Mantel test: ρvol. 108 P =no. 19 (21). P < 0.0001), with sediment moisture, nitrate(24), to estimate plant
C-17, May 10, 2011 | = 0.9239; | 0.0001)
7850–7854 |Ecology, University of California Davis Genome Center, Davis, CA 95616;circulation model (23), as applied previously concentration,
Evolution and PNAS | c www.pnas.org/
Center for Marine Biotechnology and Biomedicine, The Scripps
Pairwise community similarity between the samples was Jolla, CA 92093; and eSchool and timesandFishery Sciences, University between
A plot of community similarity San Diego, La calcu-
Institution of Oceanography, University of California atversus geographic distance cover, salinity,of Aquatic and hypothetical microbial cells of Washington,
for relative dispersal air of water temperature contributing to
Seattle, WA 98195 the presence or samples revealed that the Nitrosomonadales
lated based on each pairwise set of absence of each OTU using smaller, but significant, partial regression coefficientspoints 0.09–
each sampling location. Dispersal times between sampling (b =
Sunday, September 16, 12 display a significant, negative distance-decay curve (slope = −0.08, did not explain more variability in bacterial community similarity
38. Things You Could Do
• Mississippi River: 2320 miles long
Sunday, September 16, 12
39. Things You Could Do
• Mississippi River: 2320 miles long
• 1 site / mile
• 3 samples / site
• 6960 samples
• rRNA PCR w/ barcodes
• metagenomics w/ barcodes
• Miseq Run:
• 30 million sequence reads
• 4310 sequences / sample
• Hiseq 2000
• 6 billion sequence reads
• 862,068 sequences / sample
Sunday, September 16, 12
40. Things You Could Do
• Mississippi River: 12,249,600 feet long
• 1 site / 500 feet
• 3 samples / site
• 73497 samples
• rRNA PCR w/ barcodes
• metagenomics w/ barcodes
• Miseq Run:
• 30 million sequence reads
• 408 sequences / sample
• Hiseq 2000
• 6 billion sequence reads
• 81,635 sequences / sample
Sunday, September 16, 12
41. What’s New in Phylotyping II
• Metagenomics avoids biases of rRNA
PCR
shotgun
sequence
Sunday, September 16, 12
42. Metagenomic Phylotyping
B
A
Cluster C
Just
B E. coli Humans
Phylogeny
A
Yeast
OTUs C
OTU2 OTU1
OTU1 OTU4
OTU3
OTU2
OTU3 E. coli Humans
OTU4 Yeast
Sunday, September 16, 12
47. Method 1: Each is an island
• Build alignment, models, trees for full length seqs
• Analyze fragmented reads one at a time
Sunday, September 16, 12
48. Method 1: Each is an island
• Build alignment, models, trees for full length seqs
• Analyze fragmented reads one at a time
Sunday, September 16, 12
49. Method 1: Each is an island
• Build alignment, models, trees for full length seqs
• Analyze fragmented reads one at a time
Sunday, September 16, 12
50. STAP ss-rRNA Taxonomy Pip
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
STAP database, and the query sequence is aligned to them using a
the CLUSTALW profile alignment algorithm [40] as described w
above for domain assignment. By adapting the profile alignment s
a
t
o
G
t
t
Each sequence
s
T
c
analyzed separately a
q
c
e
b
b
S
p
a
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to t
each query sequence based on its position in a maximum likelihood d
tree of representative ss-rRNA sequences. Because the tree illustrated ‘
here is not rooted, domain assignment would not be accurate and s
reliable (sequence similarity based methods cannot make an accurate
s
assignment in this case either). However the figure illustrates an
important role of the tree-based domain assignment step, namely s
automatic identification of deep-branching environmental ss-rRNAs. d
doi:10.1371/journal.pone.0002566.g002 a
PLoS ONE | www.plosone.org 5
Wu et al. 2008 PLoS One
FigureSeptember 16, 12
Sunday, 1. A flow chart of the STAP pipeline.
51. AMPHORA
Wu and Eisen Genome
Biology 2008 9:R151
doi:10.1186/
gb-2008-9-10-r151 Guide tree
Sunday, September 16, 12
52. Phylotyping w/ Proteins
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Sunday, September 16, 12
55. Method 2: Most in family
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
One tree for those w/ overlap
Sunday, September 16, 12
56. rRNA in Sargasso Metagenome
Venter et al., Science
304: 66. 2004
Sunday, September 16, 12
57. RecA Phylotyping in Sargasso Data
Venter et al., Science
304: 66. 2004
Sunday, September 16, 12
58. Weighted % of Clones
0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
Sunday, September 16, 12
ta ct
er
pr ia
ot
eo
G
304: 66. 2004
am b ac
m t er
ap ia
ro
Ep t eo
si ba
lo ct
Venter et al., Science
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C
EFG
ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
EFTu
ut
es
Ac
tin
ob
ac
te
ria
C
hl
HSP70
or
ob
i
C
Major Phylogenetic Group
FB
Sargasso Phylotypes
C
RecA
hl
or
of
le
xi
Sp
iro
ch
ae
te
s
RpoB
Fu
so
ba
De ct
in er
ia
oc
Sargasso Phylotyping
oc
cu
s-
rRNA
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
59. STAP, QIIME, Mothur ss-rRNA Taxonomy Pip
Combine all into
one alignment
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
Sunday, September 16, 12
60. all of these bioinformatics steps together in one package. therefore, to invest a large amount of time and effort to
To this end, we have built an automated, user-friendly, get to that list of microbes. But now that current efforts
workflow-based system called WATERS: a Workflow for are significantly more advanced and often require com-
WATERs
the Alignment, Taxonomy, and Ecology of Ribosomal parison of dozens of factors and variables with datasets of
Sequences (Fig. 1). In addition to being automated and thousands of sequences, it is not practically feasible to
Page 2 of 14 simple to use, because WATERS is executed in the Kepler process these large collections "by hand", and hugely inef-
scientific workflow system (Fig. 2) it also has the advan- ficient if instead automated methods can be successfully
tage that it keeps track of the data lineage and provenance employed.
of data products [23,24]. Broadening the user base
Automation A second motivation and perspective is that by minimiz-
The primary motivation in building WATERS was to ing the technical difficulty of 16 S rDNA analysis through
minimize the technical, bioinformatics challenges that the use of WATERS, we aim to make the analysis of these
ic- arise when performing DNA sequence clustering, phylo- datasets more widely available and allow individuals with
A).
Check Build
sly Align
chimeras
Cluster
Tree
ers
nly
ed, Diversity
Assign Tree w/
statistics &
ed graphs
Taxonomy Taxonomy
ng
ge- Cytoscape
OTU table Unifrac
de- network files
he
a
nt Figure 1 Overview of WATERS. Schema of WATERS where white
ise boxes indicate "behind the scenes" analyses that are performed in WA-
he TERS. Quality control files are generated for white boxes, but not oth-
erwise routinely analyzed. Black arrows indicate that metadata (e.g.,
on
sample type) has been overlaid on the data for downstream interpre-
n- tation. Colored boxes indicate different types of results files that are
nd generated for the user for further use and biological interpretation.
Colors indicate different types of WATERS actors from Fig. 2 which
Figure 2 Screenshot of WATERS in Kepler software. Key features: the library of actors un-collapsed and displayed on the left-hand side, the input
eys were used: green, Diversity metrics, WriteGraphCoordinates, Diversity and output paths where the user declares the location of their input files and desired location for the results files. Each green box is an individual Kepler
graphs; blue, Taxonomy, BuildTree, Rename Trees, Save Trees; Create- actor that performs a single action on the data stream. The connectors (black arrows) direct and hook up the actors in a defined sequence. Double-
er) clicking on any actor or connector allows it to be manipulated and re-arranged.
Unifrac; yellow, CreateOtuTable, CreateCytoscape, CreateOTUFile;
16 white, remaining unnamed actors.
n-
as
chimeric sequences generated during PCR identifying
nto
tly Hartman et sets 2010. W.A.T.E.R.S.:as opera-
closely related al of sequences (also known a Workflow for the Alignment, Taxonomy, and Ecology
nc- of Ribosomal units or OTUs), removing redundant
tional taxonomic
Sequences. BMC Bioinformatics 2010, 11:317 doi:
sequences above a certain percent identity cutoff, assign-
6S 10.1186/1471-2105-11-317 each sequence or
ing putative taxonomic identifiers to
As
representative of a group, inferring a phylogenetic tree of
n-
the sequences, and comparing the phylogenetic structure
Sunday, September 16, 12
61. One Major Issue with rRNA
• Copy number varies greatly between taxa
• Can lead to significant errors in estimates
of relative abundance from numbers of
reads
Sunday, September 16, 12
62. Kembel Correction
Kembel, Wu, Eisen, Green. In press.
PLoS Computational Biology.
Incorporating 16S gene copy number
information improves estimates of
microbial diversity and abundance
Sunday, September 16, 12
63. Method 3: All in the family
Sunday, September 16, 12
66. rRNA analysis
B
A
Cluster C
Just
B E. coli Humans
Phylogeny
A
Yeast
OTUs C
OTU2 OTU1
OTU1 OTU4
OTU3
OTU2
OTU3 E. coli Humans
OTU4 Yeast
Sunday, September 16, 12
67. PhylOTU Finding Meta
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in
workflow of PhylOTU. See Results section for details.
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard KS. (2011)
doi:10.1371/journal.pcbi.1001061.g001
PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel
Taxa from Metagenomic used toPLoS Comput Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061
alignment Data. build the profile, resulting in a multiple PD versus PID clustering, 2) to explore overlap betw
sequence alignment of full-length reference sequences and clusters and recognized taxonomic designations, and
Sunday, September 16, 12 metagenomic reads. The final step of the alignment process is a the accuracy of PhylOTU clusters from shotgun re
68. RecA, RpoB in GOS
GOS 1
GOS 2
GOS 3
GOS 4
Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking
the Fourth Domain in Metagenomic Data: Searching for, Discovering,
GOS 5
and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic
Trees. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011
Sunday, September 16, 12
69. Phylosift/ pplacer
Aaron Darling, Guillaume Jospin, Holly Bik, Erik Matsen, Eric
Lowe, and others
Sunday, September 16, 12
71. Method 4: All in the genome
Sunday, September 16, 12
72. Multiple Genes?
A single tree with everything?
Sunday, September 16, 12
73. Kembel Combiner
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS
ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Sunday, September 16, 12
74. typically used as a qualitative measure because duplicate s
quences are usually removed from the tree. However, the
test may be used in a semiquantitative manner if all clone
Kembel Combiner
even those with identical or near-identical sequences, are i
cluded in the tree (13).
Here we describe a quantitative version of UniFrac that w
call “weighted UniFrac.” We show that weighted UniFrac b
haves similarly to the FST test in situations where both a
FIG. 1. Calculation of the unweighted and the weighted UniFr
measures. Squares and circles represent sequences from two differe
environments. (a) In unweighted UniFrac, the distance between t
circle and square communities is calculated as the fraction of t
branch length that has descendants from either the square or the circ
environment (black) but not both (gray). (b) In weighted UniFra
branch lengths are weighted by the relative abundance of sequences
the square and circle communities; square sequences are weight
twice as much as circle sequences because there are twice as many tot
circle sequences in the data set. The width of branches is proportion
to the degree to which each branch is weighted in the calculations, an
gray branches have no weight. Branches 1 and 2 have heavy weigh
since the descendants are biased toward the square and circles, respe
tively. Branch 3 contributes no value since it has an equal contributio
from circle and square sequences after normalization.
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS
ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Sunday, September 16, 12
75. Uses of Phylogeny
in Genomics and Metagenomics
Example 2:
Functional Diversity and
Functional Predictions
Sunday, September 16, 12
76. Phylogenomics
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
EXAMPLE A METHOD EXAMPLE B
2A CHOOSE GENE(S) OF INTEREST 5
3A 1 3 4
2B 2
IDENTIFY HOMOLOGS 5
1A 2A 1B 3B 6
ALIGN SEQUENCES
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
CALCULATE GENE TREE
Duplication?
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
OVERLAY KNOWN
FUNCTIONS ONTO TREE
Duplication?
2A 3A 1B 2B 3B 1 2 3 4 5 6
1A
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?
Species 1 Species 2 Species 3
Based on
1A 1B 2A 2B 3A 3B 1 2 3 4 5 6
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN) Eisen, 1998
Genome Res 8:
Duplication 163-167.
Sunday, September 16, 12
78. Improving Functional Predictions
• Same methods discussed for phylotyping
improve phylogenomic functional
prediction for protein families
• Increase in sequence diversity helps too
Sunday, September 16, 12
79. Phylosift/ pplacer
Aaron Darling, Guillaume Jospin, Holly Bik, Erik Matsen, Eric
Lowe, and others
Sunday, September 16, 12
81. Wu et al. 2005 PLoS Genetics 1: e65.
Sunday, September 16, 12
82. Characterizing the niche-space distributions of components NMF in Metagenomes
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .2 0 .4 0 .6 0 .8 1 .0
Polyne sia Archipe la gos_ G S 0 4 8 a _ C ora l R e e f
India n O ce a n_ G S 1 2 0 _ O pe n O ce a n
Polyne sia Archipe la gos_ G S 0 4 9 _ C oa sta l
G a la pa gos Isla nds_ G S 0 2 6 _ O pe n O ce a n
India n O ce a n_ G S 1 1 9 _ O pe n O ce a n
G e ne ra l
C a ribbe a n S e a _ G S 0 1 5 _ C oa sta l
C a ribbe a n S e a _ G S 0 1 9 _ C oa sta l
India n O ce a n_ G S 1 1 4 _ O pe n O ce a n H igh
E a ste rn Tropica l Pa cific_ G S 0 2 3 _ O pe n O ce a n M e dium
India n O ce a n_ G S 1 1 0 a _ O pe n O ce a n
India n O ce a n_ G S 1 0 8 a _ La goon R e e f Low
C a ribbe a n S e a _ G S 0 1 8 _ O pe n O ce a n NA
G a la pa gos Isla nds_ G S 0 3 4 _ C oa sta l
India n O ce a n_ G S 1 2 2 a _ O pe n O ce a n
India n O ce a n_ G S 1 2 1 _ O pe n O ce a n
C a ribbe a n S e a _ G S 0 1 7 _ O pe n O ce a n
India n O ce a n_ G S 1 1 2 a _ O pe n O ce a n
India n O ce a n_ G S 1 1 3 _ O pe n O ce a n
India n O ce a n_ G S 1 4 8 _ F ringing R e e f
C a ribbe a n S e a _ G S 0 1 6 _ C oa sta l S e a
India n O ce a n_ G S 1 2 3 _ O pe n O ce a n
India n O ce a n_ G S 1 4 9 _ H a rbor
G a la pa gos Isla nds_ G S 0 2 7 _ C oa sta l
E a ste rn Tropica l Pa cific_ G S 0 2 2 _ O pe n O ce a n W a te r de pth
S ites
S a rga sso S e a _ G S 0 0 1 c_ O pe n O ce a n
G a la pa gos Isla nds_ G S 0 3 5 _ C oa sta l
G a la pa gos Isla nds_ G S 0 3 0 _ W a rm S e e p
G a la pa gos Isla nds_ G S 0 2 9 _ C oa sta l >4000m
G a la pa gos Isla nds_ G S 0 3 1 _ C oa sta l upwe lling
India n O ce a n_ G S 1 1 7 a _ C oa sta l sa m ple
2000!4000m
G a la pa gos Isla nds_ G S 0 2 8 _ C oa sta l 900!2000m
G a la pa gos Isla nds_ G S 0 3 6 _ C oa sta l 100!200m
Polyne sia Archipe la gos_ G S 0 5 1 _ C ora l R e e f Atoll
N orth Am e rica n E a st C oa st_ G S 0 1 4 _ C oa sta l 20!100m
N orth Am e rica n E a st C oa st_ G S 0 0 6 _ E stua ry 0!20m
E a ste rn Tropica l Pa cific_ G S 0 2 1 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 9 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 1 1 _ E stua ry
N orth Am e rica n E a st C oa st_ G S 0 0 8 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 1 3 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 4 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 7 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 3 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 2 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 5 _ E m baym e nt
Co Co Co Co Co
Chlorophyll
Salinity
Temperature
Water Depth
Sample Depth
Insolation
mp mp mp mp mp
on on on on on
en en en en en
t1 t2 t3 t4 t5
(a) (b) (c)
Functional biogeography of ocean microbes
Figure 3: a) Niche-space non-negative matrix
revealed through distributions for our five components (H T );Weitz,site-
w/ b) the Dushoff,
similarity matrix (HJiang environmental variables for the sites. The matrices Neches,
factorization ˆ ˆ T H); c) et al. In press PLoS
Langille, are
aligned so that the same row corresponds to the same site in each matrix. Sites are
One. Comes out 9/18. Levin, etc
ordered by applying spectral reordering to the similarity matrix (see Materials and
Methods). Rows are aligned across the three matrices.
Sunday, September 16, 12
83. Uses of Phylogeny
in Genomics and Metagenomics
Example 3:
Selecting Organisms for Study
Sunday, September 16, 12
84. GEBA
http://www.jgi.doe.gov/programs/GEBA/pilot.html
Sunday, September 16, 12
85. GEBA
THAT
IS
SO
LAMG10
http://www.jgi.doe.gov/programs/GEBA/pilot.html
Sunday, September 16, 12
86. How To Keep Up?
• IMG
• Genomes Online
• MicrobeDB
• http://github.com/mlangill/microbedb/
• Langille MG, Laird MR, Hsiao WW, Chiu TA, Eisen
JA, Brinkman FS. MicrobeDB: a locally
maintainable database of microbial genomic
sequences. Bioinformatics. 2012 28(14):1947-8.
Sunday, September 16, 12
91. Sifting Families
Representative
Genomes
B
A Extract
Protein
New
Genomes
Annotation
Extract
All v. All
Protein
BLAST
Annotation
Homology
Screen for
(MCL) C
Clustering
Homologs
SFams HMMs
Align &
Build
Sharpton et al. submitted Figure 1
HMMs
Sunday, September 16, 12
92. Zorro - Automated Masking
9.0
8.0
Distance to True Tree
7.0
6.0
5.0
4.0
200
3.0
no masking
ce to True Tree
2.0 zorro
1.0 gblocks
0.0
200 400 800 1600 3200
Sequence Length
Wu M, Chatterji S, Eisen JA (2012) Accounting For Alignment Uncertainty
in Phylogenomics. PLoS ONE 7(1): e30288. doi:10.1371/journal.pone.
0030288
Sunday, September 16, 12