1. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Lecture 16:
EVE 161:
Microbial Phylogenomics
!
Lecture #16:
Era IV: Shotgun Metagenomics
!
UC Davis, Winter 2014
Instructor: Jonathan Eisen
!1
2. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Where we are going and where we have been
!Previous lecture:
! 15: Era IV: Shotgun Metagenomics
! Current Lecture:
! 16: Era IV: Metagenomic Diversity and
Binning
! Next Lecture:
! 17: Era IV: Metagenomic Case Study
!2
3. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Era IV: Genomes in the environment
Era IV:
Diversity from Metagenomics
4. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Era IV: Genomes in the environment
A Very Simple System
5. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Pierce’s Disease
!5
6. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
• Obligate xylem feeder
• Transmits Xylella between
plants
• Much like mosquitoes
transmit malarial pathogen
• Only animal listed as
possible “bioterror” agent
by US DHS
!6
Glassy winged sharpshooter
GLASSY-WINGED
SHARPSHOOTERASeriousThreattoCaliforniaAgriculture
FROM THE
UNIVERSITY OF CALIFORNIA’S
PIERCE’S DISEASE RESEARCH AND
EMERGENCY RESPONSE TASK FORCE
This informational brochure was produced by ANR
Communication Services for the University of Califor-
nia Pierce’s Disease Research and Emergency
Response Task Force. You may download a copy of the
brochure from the Division of Agriculture and Natural
Resources web site at http://danr.ucop.edu or from the
Communication Services web site at
http://danrcs.ucdavis.edu.
For local information, contact your UC Cooperative
Extension farm advisor:
Adults
Egg masses
Glassy-winged Sharpshooter
Generalized Lifecycle
100
80
60
40
20
0
Jan.
Mar.
May
July
Sept.
Nov.
Glassy-winged sharpshooters overwinter as adults
and begin laying egg masses in late February
through May. This first generation matures as
adults in late May through late August. Second-
generation egg masses are laid starting in mid-
June through late September, which develop into
over-wintering adults.
7. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !7
Xylem feeding insects also very successful
8. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Xylem and Phloem
From
Lodish
et al.
2000
!8
9. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Animal nutrition
• Xylem is frequently missing essential amino acids,
vitamins and Co-Factors, and has only small
amounts of carbon skeletons
!9
10. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Plant response to sap feeders
• Possible solutions to no aa, vitamins, etc in xylem
! Eat other things
! Evolve metabolic pathways to synthesize missing nutrients
! Find some poor sap to make the stuff for you
!10
12. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
DNA
extraction
PCR
Sequence
rRNA genes
Sequence alignment = Data matrixPhylogenetic tree
PCR
rRNA1
Yeast
Makes lots of
copies of the
rRNA genes
in sample
E. coli
Humans
A
T
T
A
G
A
A
C
A
T
C
A
C
A
A
C
A
G
G
A
G
T
T
C
rRNA1
E. coli Humans
Yeast
!12
rRNA1
5’ ...TACAGTATAGGTG
GAGCTAGCGATCGAT
CGA... 3’
PCR and phylogenetic analysis of rRNA genes
bacteriomes
r two obligate
bacteriomes
Moran et al. 2003 Environ. Microbiol.
Moran et al. 2005 Appl. Environ. Microbiol.
cola” (Gammaproteobacteria)
roidetes)
D Takiya
mm
domen of H. vitripennis
13. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Baumania is close relative of Buchnera symbionts of aphids
Sharpshoote
Aphids
Aphids
Aphids
Ants
Flies
!13
Wu et al. 2006 PLoS Biology 4: e188.
14. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Baumania is close relative of Buchnera symbionts of aphids
Sharpshoote
Aphids
Aphids
Aphids
Ants
Flies
!14
Wu et al. 2006 PLoS Biology 4: e188.
15. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !15
16. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !16
17. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Sharpshooter Shotgun Sequencing
shotgun
Wu et al. 2006 PLoS Biology 4: e188.
Collaboration with Nancy Moran’s
lab
18. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
19. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
20. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
21. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Higher Evolutionary Rates in
Endosymbionts
Wu et al. 2006 PLoS Biology 4: e188. Collaboration with Nancy Moran’ s Lab
22. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Variation in Evolution Rates
Correlated with Repair Gene Presence
MutS MutL
+ +
+ +
+ +
+ +
_ _
_ _
Wu et al. 2006 PLoS Biology 4: e188. Collaboration with Nancy Moran’ s Lab
23. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Polymorphisms in Metapopulation
• Data from ~200 hosts
–104 SNPs
–2 indels
• PCR surveys show that this
is between host variation
• Much lower ratio of
transitions:transversions
than in Blochmannia
• Consistent with absence of
MMR from Blochmannia
24. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Predict metabolic networks from Genome
Wu et al.
2006 PLoS
Biology 4:
e188.
!24
25. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Baumannia is a Vitamin and Cofactor Producing Machine
Wu et al.
2006 PLoS
Biology 4:
e188.
!25
VITAMIN AND
COFACTOR
PRODUCING MACHINE
26. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Baumannia is a Vitamin and Cofactor Producing Machine
Wu et al.
2006 PLoS
Biology 4:
e188.
!26
NO PATHWAYS FOR
ESSENTIAL AMINO
ACID SYNTHESIS
27. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !27
28. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
29. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
???????
30. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Commonly Used Binning Methods
Did not Work Well
• Assembly
–Only Baumannia generated good contigs
• Depth of coverage
–Everything else 0-1X coverage
• Nucleotide composition
–No detectible peaks in any vector we looked at
31. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
A
B
C
D
E
F
G
T
U
V
W
X
Y
Z
Binning challenge
!30
32. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
A
B
C
D
E
F
G
T
U
V
W
X
Y
Z
Binning challenge
Best binning method: reference genomes
!31
33. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
A
B
C
D
E
F
G
T
U
V
W
X
Y
Z
Binning challenge
No reference genome? What do you do?
!
Phylogeny ....
34. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
DNA
extraction
PCR
Sequence
rRNA genes
Sequence alignment = Data matrixPhylogenetic tree
PCR
rRNA1
rRNA2
Makes lots of
copies of the
rRNA genes
in sample
rRNA1
5’ ...ACACACATAGGTG
GAGCTAGCGATCGAT
CGA... 3’
E. coli
Humans
A
T
T
A
G
A
A
C
A
T
C
A
C
A
A
C
A
G
G
A
G
T
T
C
rRNA1
E. coli Humans
rRNA2
!33
rRNA2
5’ ...TACAGTATAGGTG
GAGCTAGCGATCGAT
CGA... 3’
PCR and phylogenetic analysis of rRNA genes
3)/,0+
bacteriomes
r two obligate
bacteriomes
Moran et al. 2003 Environ. Microbiol.
Moran et al. 2005 Appl. Environ. Microbiol.
cola” (Gammaproteobacteria)
roidetes)
D Takiya
mm
domen of H. vitripennis
35. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !34
36. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014Wu et al. 2006 PLoS Biology 4: e188.
37. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
38. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
???????
39. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
CFB Phyla
Wu et al. 2006 PLoS Biology 4: e188.
40. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Sulcia makes essential amino acids
!38
41. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Sulcia makes essential amino acids
!39
ESSENTIAL AMINO
ACID PRODUCING
MACHINE
42. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes vitamins and cofactors
Sulcia makes amino acids
43. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Era IV: Genomes in the environment
Phylogeny & Metagenomes
44. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
• Why analyze protein coding “marker” genes in
metagenomic data?
•
45. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
AMPHORA: AutoMated PHylogenOmic infeRence
Wu and Eisen, 2008. Genome Biology 2008, 9:R151 !
doi:10.1186/gb-2008-9-10-r151 http://genomebiology.com/2008/9/10/R151
46. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Choosing Marker Genes
• How choose the marker genes to use?
!
47. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Choosing Marker Genes
• How choose the marker genes to use?
!
• “We selected these proteins because they are universally
distributed in bacteria; the vast majority of them exist as
single copy genes within each genome; and they are
housekeeping genes that are involved in information
processing (replication, transcription, and translation) or
central metabolism, and thus are thought to be relatively
recalcitrant to lateral gene transfer [19].”
Wu and Eisen, 2008. Genome Biology 2008, 9:R151 !
doi:10.1186/gb-2008-9-10-r151 http://genomebiology.com/2008/9/10/R151
48. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Identification and alignment of metagenomic reads?
• How to identify and rapidly align reads that correspond to
each of the markers selected?
49. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Identification and alignment of metagenomic reads?
• How to identify and rapidly align reads that correspond to
each of the markers selected?
!
• Prealign representatives from genomes
• Manually curate to build a “mask”
• Create “Hidden markov model” of the alignment
• Use HMM to search for, and align from metagenomes
!
• For more on HMMs see “What is a Hidden Markov Model” by Sean Eddy. http://www.nature.com/nbt/journal/v22/
n10/full/nbt1004-1315.html
Wu and Eisen, 2008. Genome Biology 2008, 9:R151 !
doi:10.1186/gb-2008-9-10-r151 http://genomebiology.com/2008/9/10/R151
50. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
HMMs and Profiles
• Useful guide here http://compbio.soe.ucsc.edu/
ismb99.handouts/KK185FP.html
Profile HMM
51. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
55. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
ORF00955
gi73748443
gi57234592
gi86609871
gi86605128
gi17229086
gi75910405
gi113476515
gi16329957
gi56752516
gi81300331
gi33862041
gi78779962
gi33865147
gi78213583
gi78184182
gi113953748
gi33863774
gi72382854
gi33241089
gi22298184
gi37521852
ZAVAM73TF
Thermomicrobium roseum DSM 5159
Dehalococcoides sp. CBDB1
Dehalococcoides ethenogenes 195
Synechococcus sp. JA 2 3Bʼa 2 13
Synechococcus sp. JA 3 3Ab
Nostoc sp. PCC 7120
Anabaena variabilis ATCC 29413
Trichodesmium erythraeum IMS101
Synechocystis sp. PCC 6803
Synechococcus elongatus PCC 6301
Synechococcus elongatus PCC 7942
Prochlorococcus marinus subsp. pastoris str. CCMP1986
Prochlorococcus marinus str. MIT 9312
Synechococcus sp. WH 8102
Synechococcus sp. CC9605
Synechococcus sp. CC9902
Synechococcus sp. CC9311
Prochlorococcus marinus str. MIT 9313
Prochlorococcus marinus str. NATL2A
Prochlorococcus marinus subsp. marinus str. CCMP1375
Thermosynechococcus elongatus BP 1
Gloeobacter violaceus PCC 7421
56. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Search with HMMs
!54
57. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Search with HMMs
!55
58. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
59. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
60. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
61. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
What Else Needed for This to Be Better?
62. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
What Else Needed for This to Be Better?
• More Phylogenetically Diverse Genomes
• More Markers esp. Taxon Specific Markers
• Better / Faster Searching
• Longer reads
• Many sequences at once
• Many genes at once
• Alpha and Beta Diversity
• Automated Masking
63. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Zorro - Automated Masking
cetoTrueTree
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
200 400 800 1600 3200
DistancetoTrueTree
Sequence Length
200
no masking
zorro
gblocks
200
Wu M, Chatterji S, Eisen JA (2012) Accounting For Alignment
Uncertainty in Phylogenomics. PLoS ONE 7(1): e30288. doi:
10.1371/journal.pone.0030288
64. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Representative
Genomes
Extract
Protein
Annotation
All v. All
BLAST
Homology
Clustering
(MCL)
SFams
Align &
Build
HMMs
HMMs
Screen for
Homologs
New
Genomes
Extract
Protein
Annotation
Figure 1
Sharpton et al. 2013
A
B
C
Improving II: More Families
65. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Kembel Combiner
typically used as a qualitative measure because duplicate s
quences are usually removed from the tree. However, the
test may be used in a semiquantitative manner if all clone
even those with identical or near-identical sequences, are i
cluded in the tree (13).
Here we describe a quantitative version of UniFrac that w
call “weighted UniFrac.” We show that weighted UniFrac b
haves similarly to the FST test in situations where both a
FIG. 1. Calculation of the unweighted and the weighted UniFr
measures. Squares and circles represent sequences from two differe
environments. (a) In unweighted UniFrac, the distance between t
circle and square communities is calculated as the fraction of t
branch length that has descendants from either the square or the circ
environment (black) but not both (gray). (b) In weighted UniFra
branch lengths are weighted by the relative abundance of sequences
the square and circle communities; square sequences are weight
twice as much as circle sequences because there are twice as many tot
circle sequences in the data set. The width of branches is proportion
to the degree to which each branch is weighted in the calculations, an
gray branches have no weight. Branches 1 and 2 have heavy weigh
since the descendants are biased toward the square and circles, respe
tively. Branch 3 contributes no value since it has an equal contributio
from circle and square sequences after normalization.
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of
Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
66. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Phylosift - Mining the Global Metagenome
Jonathan
Eisen
Students and other staff:
- Eric Lowe, John Zhang, David Coil
!
Open source community:
- BLAST, LAST, HMMER, Infernal,
pplacer, Krona, metAMOS, Bioperl,
Bio::Phylo, JSON, etc. etc.
!
PhyloSift is open source software:
-http://phylosift.wordpress.org
-http://github.com/gjospin/phylosift
Erick Matsen
FHCRC
Todd Treangen
BNBI, NBACC
Holly
Bik
Tiffanie
Nelson
Mark
Brown
Aaron
Darling
Guillaume
Jospin
Supported by DHS Grant
67. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Output 1: Taxonomy
Taxonomic
summary
plots in
Krona
(Ondov et al
2011)
68. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Tree Reconciliation in PhyloSift
69. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Tree Reconciliation in PhyloSift
Environmental
Sequences
Named
Taxa
70. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Output 2: Phylogenetic Tree of Reads
Placement tree from 2 week old infant gut data
71. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Edge PCA: Identify lineages
that explain most variation
among samples
Matsen and Evans 2013, Darling et al Submitted.
Output: Edge PCA
72. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
QIIME and Edge PCA on 110
fecal metagenomes from
Yatsunenko et al 2012 Nature.
Sequenced with 454, to about
150Mbp/metagenome
Darling et al
Submitted.
Edge PCA vs. UNIFRAC PCA
Edge PCA: Matsen
and Evans 2013
73. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Output 3: Edge PCA
" Edge PCA for exploratory data analysis (Matsen and Evans 2013)
" Given E edges and S samples:
− For each edge, calculate difference in placement mass on either side of edge
− Results in E x S matrix
− Calculate E x E covariance matrix
− Calculate eigenvectors, eigenvalues of covariance matrix
" Eigenvector: each value indicates how “important” an edge is in explaining differences
among the S samples
Example calculating a matrix entry for an edge:
This edge gets 5-2=3
mass=5 mass=2
74. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Phylosift/ pplacer Workflow
Aaron Darling, Guillaume Jospin, Holly Bik, Erik Matsen, Eric Lowe,
and others
Input Sequences
rRNA workflow
protein workflow
profile HMMs used to align
candidates to reference alignment
Taxonomic
Summaries
parallel option
hmmalign
multiple alignment
LAST
fast candidate search
pplacer
phylogenetic placement
LAST
fast candidate search
LAST
fast candidate search
search input against references
hmmalign
multiple alignment
hmmalign
multiple alignment
Infernal
multiple alignment
LAST
fast candidate search
<600 bp
>600 bp
Sample Analysis &
Comparison
Krona plots,
Number of reads placed
for each marker gene
Edge PCA,
Tree visualization,
Bayes factor tests
eachinputsequencescannedagainstbothworkflows
75. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Phylosift Markers
• PMPROK – Dongying Wu’s Bac/Arch
markers
• Eukaryotic Orthologs – Parfrey 2011 paper
• 16S/18S rRNA
• Mitochondria - protein-coding genes
• Viral Markers – Markov clustering on
genomes
• Codon Subtrees – finer scale taxonomy
• Extended Markers – plastids, gene families
76. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
More Markers
Phylogenetic
group
Genome
Number
Gene
Number
Maker
Candidate
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteoba 94 347287 121
Betaproteobac 56 266362 311
Gammaproteo 126 483632 118
Deltaproteoba 25 102115 206
Epislonproteo 18 33416 455
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroflexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684
Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene
Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven
Ecological Studies of Bacteria and Archaea and Their Major
Subgroups. PLoS ONE 8(10): e77033. doi:10.1371/journal.pone.
0077033
77. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Better Reference Tree
Lang JM, Darling AE, Eisen JA (2013)
Phylogeny of Bacterial and Archaeal
Genomes Using Conserved Genes:
Supertrees and Supermatrices. PLoS
ONE 8(4): e62510. doi:10.1371/
journal.pone.0062510
78. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
79. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Stalking the Fourth Domain in Metagenomic Data:
Searching for, Discovering, and Interpreting Novel, Deep
Branches in Marker Gene Phylogenetic Trees
Dongying Wu1
, Martin Wu1,4
, Aaron Halpern2,3
, Douglas B. Rusch2,3
, Shibu Yooseph2,3
, Marvin Frazier2,3
,
J. Craig Venter2,3
, Jonathan A. Eisen1
*
1 Department of Evolution and Ecology, Department of Medical Microbiology and Immunology, University of California Davis Genome Center, University of California
Davis, Davis, California, United States of America, 2 The J. Craig Venter Institute, Rockville, Maryland, United States of America, 3 The J. Craig Venter Institute, La Jolla,
California, United States of America, 4 University of Virginia, Charlottesville, Virginia, United States of America
Abstract
Background: Most of our knowledge about the ancient evolutionary history of organisms has been derived from data
associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and
culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated
directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we
argue here, in studies of very early events in the evolution of gene families and of species.
Methodology/Principal Findings: We designed and implemented new methods for analyzing metagenomic data and used
them to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonly
used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies.
Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in
making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel
branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these
novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences.
Conclusions/Significance: Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come from
uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third
possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which
sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree
of life, we suggest that methods such as those described herein currently offer the best way to search for them.
Citation: Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and
Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011
Editor: Robert Fleischer, Smithsonian Institution National Zoological Park, United States of America
Received October 25, 2010; Accepted February 20, 2011; Published March 18, 2011
This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the public
domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
Funding: The development and main work on this project was supported by the National Science Foundation via an ‘‘Assembling the Tree of Life’’ grant
(number 0228651) to to Jonathan A. Eisen and Naomi Ward. The final work on this project was funded by the Gordon and Betty Moore Foundation (through
80. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Figure 1. Phylogenetic tree of the RecA superfamily. All RecA sequences were grouped into clusters using the Lek algorithm. Representatives
of each cluster that contained .2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment
using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RecA
superfamily are shaded and given a name on the right. Five of the proposed subfamilies contained only GOS sequences at the time of our initial
analysis (RecA-like SAR, Phage SAR1, Phage SAR2, Unknown 1 and Unknown 2) and are highlighted by colored shading. As noted on the tree and in
the text, sequences from two Archaea that were released after our initial analysis group in the Unknown 2 subfamily.
doi:10.1371/journal.pone.0018011.g001
Stalking the Fourth Domain
PLoS ONE | www.plosone.org 5 March 2011 | Volume 6 | Issue 3 | e18011
81. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Five RecA subfamilies were identified as being novel (i.e., only seen in metagenomic data) in our initial analyses. GOS metagenome assemblies that encode members of
these subfamilies were identified and the genes neighboring the novel RecAs were characterized. The neighboring gene descriptions are based on the top BLASTP hits
against the NRAA database; taxonomy assignments are based on their closest neighbor in phylogenetic trees built from the top NRAA BLASTP hits.
doi:10.1371/journal.pone.0018011.t002
Figure 2. The largest assembly from the GOS data that encodes a novel RecA subfamily member (a representative of subfamily
Unknown 2). This GOS assembly (ID 1096627390330) encodes 33 annotated genes plus 16 hypothetical proteins, including several with similarity to
known archaeal genes (e.g., DNA primase, translation initiation factor 2, Table 2). The arrow indicates a novel recA homolog from the Unknown 2
subfamily (cluster ID 9).
doi:10.1371/journal.pone.0018011.g002
PLoS ONE | www.plosone.org 7 March 2011 | Volume 6 | Issue 3 | e18011
82. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Methods
Identification of deeply-branching ss-rRNA sequences
these 340 sequences were extracted from the European Ribosomal
RNA database [66] and then manually curated to remove
columns with more than 90% gaps or with poor alignment
Figure 3. Phylogenetic tree of the RpoB superfamily. All RpoB sequences were grouped into clusters using the Lek algorithm. Representatives
of each cluster that contained .2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment
using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RpoB
superfamily are shaded and given a name on the right. The two novel RpoB clades that contain only GOS sequences are highlighted by the colored
panels.
doi:10.1371/journal.pone.0018011.g003
Stalking the Fourth Domain