A phylogeny driven genomic encyclopedia of bacteria and archaea
1. A phylogeny driven genomic
encyclopedia of bacteria and archaea
Jonathan A. Eisen
Talk at Stanford University
April 17, 2010
Saturday, April 24, 2010
6. rRNA Tree of Life
Based on
tree by
Norm Pace
Saturday, April 24, 2010
7. The Tree is not Happy
Based on
tree by
Norm Pace
Saturday, April 24, 2010
8. As of 2002 Proteobacteria
TM6
OS-K • At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides bacteria
Chlorobi
Fibrobacteres
Marine GroupA
WS3
Gemmimonas
Firmicutes
Fusobacteria
Actinobacteria
OP9
Cyanobacteria
Synergistes
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on
OP11 Hugenholtz, 2002
Saturday, April 24, 2010
9. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides bacteria
Chlorobi
Fibrobacteres
Marine GroupA • Genome
WS3
Gemmimonas
Firmicutes
sequences are
Fusobacteria
Actinobacteria
mostly from
OP9
Cyanobacteria
Synergistes
three phyla
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on
OP11 Hugenholtz, 2002
Saturday, April 24, 2010
10. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides bacteria
Chlorobi
Fibrobacteres
Marine GroupA • Genome
WS3
Gemmimonas
Firmicutes
sequences are
Fusobacteria
Actinobacteria
mostly from
OP9
Cyanobacteria
Synergistes
three phyla
Deferribacteres
Chrysiogenetes
NKB19
• Some other
Verrucomicrobia
Chlamydia
OP3
phyla are
Planctomycetes
Spriochaetes only sparsely
Coprothmermobacter
OP10
Thermomicrobia
sampled
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on
OP11 Hugenholtz, 2002
Saturday, April 24, 2010
11. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides bacteria
Chlorobi
Fibrobacteres
Marine GroupA • Genome
WS3
Gemmimonas
Firmicutes
sequences are
Fusobacteria
Actinobacteria
mostly from
OP9
Cyanobacteria
Synergistes
three phyla
Deferribacteres
Chrysiogenetes
NKB19
• Some other
Verrucomicrobia
Chlamydia
OP3
phyla are
Planctomycetes
Spriochaetes only sparsely
Coprothmermobacter
OP10
Thermomicrobia
sampled
Chloroflexi
TM7
Deinococcus-Thermus
• Same trend in
Dictyoglomus
Aquificae
Thermudesulfobacteria
Archaea
Thermotogae
OP1 Based on
OP11 Hugenholtz, 2002
Saturday, April 24, 2010
12. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides bacteria
Chlorobi
Fibrobacteres
Marine GroupA • Genome
WS3
Gemmimonas
Firmicutes
sequences are
Fusobacteria
Actinobacteria
mostly from
OP9
Cyanobacteria
Synergistes
three phyla
Deferribacteres
Chrysiogenetes
NKB19
• Some other
Verrucomicrobia
Chlamydia
OP3
phyla are
Planctomycetes
Spriochaetes only sparsely
Coprothmermobacter
OP10
Thermomicrobia
sampled
Chloroflexi
TM7
Deinococcus-Thermus
• Same trend in
Dictyoglomus
Aquificae
Thermudesulfobacteria
Eukaryotes
Thermotogae
OP1 Based on
OP11 Hugenholtz, 2002
Saturday, April 24, 2010
13. Filling in the Genomic Phylogenetic Gaps
• Common approach within some eukaryotic
groups
• Many small projects funded to fill in some
bacterial or archaeal gaps
• Phylogenetic gaps in bacterial and archaeal
projects commonly lamented in literature
Saturday, April 24, 2010
14. Proteobacteria
• NSF-funded TM6
OS-K
• At least 40
Tree of Life Acidobacteria
Termite Group phyla of
OP8
Project Nitrospira
Bacteroides bacteria
Chlorobi
• A genome Fibrobacteres
Marine GroupA • Genome
WS3
from each of Gemmimonas sequences are
Firmicutes
eight phyla Fusobacteria
mostly from
Actinobacteria
OP9
Cyanobacteria
Synergistes
three phyla
Deferribacteres
Chrysiogenetes
NKB19
• Some other
Verrucomicrobia
Chlamydia
OP3
phyla are only
Planctomycetes
Spriochaetes sparsely
Coprothmermobacter
OP10
Thermomicrobia
sampled
Chloroflexi
TM7
Deinococcus-Thermus
• Solution I:
Dictyoglomus
Aquificae sequence more
Eisen & Ward, PIs Thermudesulfobacteria
Thermotogae
OP1 phyla
OP11
Saturday, April 24, 2010
16. Bacterial aTOL Project AIMS
• Improve resolution of deep branches in the
bacterial tree
• Launch biological studies of these phyla
• Leverage data for interpreting
environmental surveys
Saturday, April 24, 2010
21. Proteobacteria
TM6
OS-K
• At least 100 phyla of
Acidobacteria
Termite Group
OP8
bacteria
Nitrospira
Bacteroides
Chlorobi
• Genome sequences are
Fibrobacteres
Marine GroupA mostly from three phyla
WS3
Gemmimonas
Firmicutes • Most phyla with cultured
Fusobacteria
Actinobacteria species are sparsely
OP9
Cyanobacteria
Synergistes
sampled
Deferribacteres
Chrysiogenetes
NKB19 • Lineages with no cultured
Verrucomicrobia
Chlamydia
OP3
taxa even more poorly
Planctomycetes
Spriochaetes sampled
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
• Solution - use tree to really
TM7
Deinococcus-Thermus fill gaps
Dictyoglomus
Aquificae Well sampled phyla
Thermudesulfobacteria
Thermotogae
OP1
OP11
Saturday, April 24, 2010
23. GEBA Pilot Project Overview
• Identify major branches in rRNA tree for
which no genomes are available
• Identify a cultured representative for each
group
• Grow > 200 of these and prep. DNA
• Sequence and finish 100
• Annotate, analyze, release data
• Assess benefits of tree guided sequencing
Saturday, April 24, 2010
24. B:
Ac
tin
ob
ac
te
B: ria # of Genomes
Am (H
Saturday, April 24, 2010
in igh
10
15
20
25
30
35
0
5
an G
a C
B: B: er )
Ba Aq ob
ct uif ia
B: ero ica
B: e
D Ch ide
B: e ef lo te
r s
D rri ofl
ef ba e
B: e c xi
B: De B rrib ter
Ep lta : D act es
si Pr ei er
lo o n es
n te oc
Pr ob oc
ot a ci
B: e ct
G B: oba eri
am B F ct a
: ir e
B: m Fu mi ria
a
G P so cut
em ro ba e
t c s
B: ma eo te
ba ri
H tim c a
a t
B: loa ona eri
a
B: Pl nae de
an r te
Th c o s
Phyla
er B: to bia
m S m le
y s
B: od piro ce
es c te
T u h
B: he lfo ae s
rm b te
GEBA Pilot Target List
Th o a s
er de cte
m s ri
u a
A: ove lfo
H n bi
A: alo abu a
A: A b la
M rc ac e
A: et ha te
M han eo ria
et g
ha ob lob
ac i
A: no te
m r
A: The icr ia
Th rm obi
er oc a
m oc
op ci
ro
te
i
25. Why Increase Taxonomic Coverage?
• Gene discovery
• Annotation, functional prediction
• Metagenomic analysis
• Mechanisms of diversification
• Species phylogeny and classification
Saturday, April 24, 2010
26. GEBA Pilot Project: Components
• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan
Eisen, Eddy Rubin, Jim Bristow)
• Project management (David Bruce, Eileen Dalin, Lynne Goodwin)
• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)
• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla Lapidus,
Mat Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng)
• Annotation and data release (Nikos Kyrpides, Victor Markowitz, et
al)
• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor
Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik
D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati, Natalia N.
Ivanova, Athanasios Lykidis, Adam Zemla)
• Adopt a microbe education project (Cheryl Kerfeld)
• Outreach (David Gilbert)
• $$$ (DOE, Eddy Rubin, Jim Bristow)
Saturday, April 24, 2010
27. Assess Benefits of GEBA
• All genomes have some value
• But what, if any, is the benefit of tree-
guided sequencing over other selection
methods
• Lessons for other large scale microbial
genome projects?
Saturday, April 24, 2010
28. GEBA Lesson 1
rRNA Tree is Useful for Identifying
Phylogenetically Novel Genomes
rRNA Tree topology is not perfect;
Genome-based trees better
Saturday, April 24, 2010
29. rRNA Tree of Life
Based on
tree by
Norm Pace
Saturday, April 24, 2010
32. Wh
Whole genome tree
built using
AMPHORA
by Martin Wu and
Dongying Wu
Saturday, April 24, 2010
33. PD of rRNA, Genome Trees Similar
From Wu et al. 2009. http://www.nature.com/nature/journal/v462/n7276/full/nature08656.html
Saturday, April 24, 2010
36. Predicting Function
• Key step in genome projects
• More accurate predictions help guide
experimental and computational analyses
• Many diverse approaches
• Comparative and evolutionary analysis
greatly improves most predictions
Saturday, April 24, 2010
37. Most/All Functional Prediction Improves
w/ Better Phylogenetic Sampling
• Better definition of protein family sequence
“patterns”
• Conversion of hypothetical into conserved
hypotheticals
• Greatly improves “comparative” and “evolutionary”
based predictions
• Linking distantly related members of protein
families
• Improved non-homology prediction
Saturday, April 24, 2010
38. From Wu et al. 2009. http://www.nature.com/nature/journal/v462/n7276/full/nature08656.html
Saturday, April 24, 2010
39. GEBA Lesson 3
Improves analysis of genome data
from uncultured organisms
Saturday, April 24, 2010
43. Shotgun Sequencing Allows Use of
Alternative Anchors (e.g., RecA)
Venter et al., 2004
Saturday, April 24, 2010
44. Weighted % of Clones
0
0.1250
0.2500
0.3750
0.5000
Al
ph
ap
ro
t eo
Be b ac
ta
pr t er
ot
e ia
G ob
am
Saturday, April 24, 2010
ac
m t er
ap ia
ro
Ep te
si ob
lo ac
np t er
ro ia
De t eo
lta ba
ct
pr
ot e ria
eo
b
C ac
ya t er
n ob ia
ac
t er
Fi ia
rm
ic
u te
Ac s
tin
ob
ac
ter
C ia
hl
or
ob
i
C
FB
Major Phylogenetic Group
Sargasso Phylotypes
C
hl
or
of
le
Sp xi
iro
ch
ae
te
Fu
so s
De ba
in ct
er
oc ia
oc
cu
s-
Eu The
ry r
ar mu
ch s
ae
C ot
re a
na
rc
ha
eo
ta
Shotgun Sequencing Allows Use of Other Markers
Venter et al., 2004
EFG
EFTu
rRNA
RecA
RpoB
HSP70
45. Weighted % of Clones
0
0.1250
0.2500
0.3750
0.5000
Al
ph
ap
ro
t eo
Be b ac
ta
pr t er
ot
e ia
G ob
am
Saturday, April 24, 2010
ac
m t er
ap ia
ro
Ep te
si ob
lo ac
np t er
ro ia
De t eo
lta ba
ct
pr
ot e ria
eo
b
C ac
ya t er
n ob ia
ac
t er
Fi ia
rm
ic
u te
Ac s
tin
ob
ac
ter
C ia
hl
or
ob
i
without good
C
FB
Major Phylogenetic Group
Sargasso Phylotypes
C
Cannot be done
hl
or
of
le
Sp xi
iro
ch
ae
te
Fu
so s
De ba
in ct
er
oc ia
sampling of genomes
oc
cu
s-
Eu The
ry r
ar mu
ch s
ae
C ot
re a
na
rc
ha
eo
ta
Shotgun Sequencing Allows Use of Other Markers
Venter et al., 2004
EFG
EFTu
rRNA
RecA
RpoB
HSP70
46. Binning challenge
A T
B U
C V
D W
E X
F Y
G Z
Saturday, April 24, 2010
47. Binning challenge
A T
B U
C V
D W
E X
F Y
G Best binning method: reference genomes Z
Saturday, April 24, 2010
48. Binning challenge
A T
B U
C V
D W
E X
F Y
G No reference genome? What do you do? Z
Saturday, April 24, 2010
49. Binning challenge
A T
B U
C V
D W
E X
F Y
G No reference genome? What do you do? Z
Phylogeny ....
Saturday, April 24, 2010
50. Al
ph
ap
ro
Be te
ta o ba
G p
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
am ro ct
te er
m o ia
ap ba
Saturday, April 24, 2010
ro ct
D te er
el ob ia
ta
pr ac
Ep ot te
U si
lo eo ria
nc ba
la np
ct
ss ro er
ifi te ia
ed ob
Pr ac
ot te
eo ria
ba
Cy ct
an er
ob ia
ac
Ch te
ria
la
m
Ac yd
id ia
ob e
Ba ac
te
ct ria
er
Ac oi
de
tin te
ob s
ac
te
ria
Aq
Pl ui
an fic
ct
om ae
yc
Sp et
AMPHORA - each read on its own tree
iro es
ch
ae
Fi te
rm s
ic
ut
Ch es
lo
ro
U fle
nc xi
la Ch
ss lo
ifi ro
ed bi
Ba
ct
er
ia
Phylogenetic Binning Using AMPHORA
frr
tsf
pgk
rplL
rplF
rplP
rplT
rplE
infC
rpsI
rplS
rplA
rplB
rplK
rplC
rpsJ
rplN
rplD
rplM
rpsE
rpsS
rpsB
rpsK
rpsC
rpoB
rpsM
pyrG
nusA
dnaG
rpmA
smpB
51. Phylogenetic Binning Using AMPHORA
dnaG
0.7
frr
infC
0.6 nusA
pgk
pyrG
0.5
0.4
Cannot be done rplA
rplB
rplC
rplD
0.3 without good rplE
rplF
rplK
rplL
0.2
0.1
sampling of genomes rplM
rplN
rplP
rplS
rplT
rpmA
0 rpoB
rpsB
es
ia
s
es
s
ria
ia
ia
bi
ia
ia
om ae
e
ia
ria
ria
ria
xi
te
te
ia
er
er
er
er
r
er
fle
ro
et
ut
rpsC
fic
te
te
te
te
te
yd
de
ae
ct
ct
ct
ct
ct
lo
yc
ro
ic
ac
ac
ac
ac
ac
ui
m
ch
oi
ba
ba
Ch
ba
ba
Ba
rm
rpsE
lo
Aq
ob
ob
ob
ob
ob
er
la
iro
eo
Ch
o
eo
o
Fi
ed
Ch
ct
an
te
te
te
te
id
tin
ct
rpsI
Sp
ot
ot
Ba
Ac
ro
ro
ro
ro
ifi
an
Cy
Ac
Pr
pr
ss
ap
p
ap
np
rpsJ
Pl
ta
ta
ed
la
ph
m
lo
el
Be
nc
rpsK
si
ifi
am
Al
D
Ep
U
ss
rpsM
G
la
nc
rpsS
U
smpB
tsf
AMPHORA - each read on its own tree
Saturday, April 24, 2010
52. GEBA Phylogenomic Lesson 5
We have still only scratched the
surface of microbial diversity
Saturday, April 24, 2010
53. Protein Family Rarefaction Curves
• Take data set of multiple complete genomes
• Identify all protein families using MCL
• Plot # of genomes vs. # of protein families
Saturday, April 24, 2010
60. rRNA Tree of Life
Based on
tree by
Norm Pace
Saturday, April 24, 2010
61. Phylogenetic Diversity:
Sequenced Bacteria & Archaea
From Wu et al. 2009. http://www.nature.com/nature/journal/v462/n7276/full/nature08656.html
Saturday, April 24, 2010
62. Phylogenetic Diversity with GEBA
From Wu et al. 2009. http://www.nature.com/nature/journal/v462/n7276/full/nature08656.html
Saturday, April 24, 2010
63. Phylogenetic Diversity: Isolates
From Wu et al. 2009. http://www.nature.com/nature/journal/v462/n7276/full/nature08656.html
Saturday, April 24, 2010
64. Phylogenetic Diversity: All
From Wu et al. 2009. http://www.nature.com/nature/journal/v462/n7276/full/nature08656.html
Saturday, April 24, 2010
65. Proteobacteria
TM6
OS-K
• At least 40 phyla of
Acidobacteria
Termite Group
OP8
bacteria
Nitrospira
Bacteroides
Chlorobi
• Genome sequences are
Fibrobacteres
Marine GroupA mostly from three phyla
WS3
Gemmimonas
Firmicutes • Most phyla with cultured
Fusobacteria
Actinobacteria species are sparsely
OP9
Cyanobacteria
Synergistes
sampled
Deferribacteres
Chrysiogenetes
NKB19 • Lineages with no cultured
Verrucomicrobia
Chlamydia
OP3
taxa even more poorly
Planctomycetes
Spriochaetes sampled
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae Well sampled phyla
Thermudesulfobacteria
Thermotogae Poorly sampled
OP1
OP11
No cultured taxa
Saturday, April 24, 2010
66. Uncultured Lineages:
Technical Approaches
• Get into culture
• Enrichment cultures
• If abundant in low diversity ecosystems
• Flow sorting
• Microbeads
• Microfluidic sorting
• Single cell amplification
Saturday, April 24, 2010
67. GEBA Phylogenomic Lesson 6
Need Experiments from Across the
Tree of Life too
Saturday, April 24, 2010
68. As of 2002 Proteobacteria
TM6
OS-K • At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides bacteria
Chlorobi
Fibrobacteres
Marine GroupA
WS3
Gemmimonas
Firmicutes
Fusobacteria
Actinobacteria
OP9
Cyanobacteria
Synergistes
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on
OP11 Hugenholtz, 2002
Saturday, April 24, 2010
69. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides bacteria
Chlorobi
Fibrobacteres
Marine GroupA • Experimental
WS3
Gemmimonas
Firmicutes
studies are
Fusobacteria
Actinobacteria
mostly from
OP9
Cyanobacteria
Synergistes
three phyla
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on
OP11 Hugenholtz, 2002
Saturday, April 24, 2010
70. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides bacteria
Chlorobi
Fibrobacteres
Marine GroupA • Experimental
WS3
Gemmimonas
Firmicutes
studies are
Fusobacteria
Actinobacteria
mostly from
OP9
Cyanobacteria
Synergistes
three phyla
Deferribacteres
Chrysiogenetes
NKB19
• Some studies
Verrucomicrobia
Chlamydia
OP3
in other phyla
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on
OP11 Hugenholtz, 2002
Saturday, April 24, 2010