2. cercozoa
alv
eo
lat
es
cercom
s
chl
s
te
t
up I
an
o
om gella
pl
rar
eugly
onads
a
u lyp
gro
ex
*p
chn
a
II
foraminiferans
ra
ap ofl
pl
up
ra
ine
si
iop
re
o
la
din
no
hi
hi
e
gr
dio
ic
nd
d
cha
mar
ph
hyt
d am
ine
al
a
lar
rap pl yt
s
r
a oe
ga
ga
a
te
hyt
ma
e a nts e a
es
ian
e
e
het
cilia
o ba
lga lg s
ds
ecid
ae
e
bs
so
bico
ero
chlorop
s
hyte a
lgae tes
oomyce
kon
diatoms
glauco
ts
phyte
brow
algae laby
opalin n alg
bozoa
ids mo
rint ae
huli re
ds
lobose cryptophyte ch
amoeba s
dictyostelid la
s
slime molds c
amoe
alg
hapto
phyte ae
e molds molds s
dial slim
plasmo e nts
slim cor
bio
telid ac
ej vahrasid s
o
tos pel ako
*pro lime
lka
bid mold
mp
s s
fiid
tes eu
zoaî
gle amoe
lla als
e
di arab
lag nim nid ba
p
retor
oxymonads
pl
nof s
no
s
fun ia
om
a
a
tr ishm
gi
o
rid
ch
le
ìchoa
yp
on
asa
po
tamon
an nia
ad
os
os
s
lids
om
a
cr
es
mi
discicristates
ads
opisthokonts excavates
root
3. Phyloinformatics Workshop Edinburgh 2007
1: Forests of trees, and loads of kindling
2: Organising principles
3: iPhy design
4: iPhy deployment
5: Nameless taxa & endless forms
4. Phyloinformatics Workshop Edinburgh 2007
1: Forests of trees, and loads of kindling
Phylogenetics is a growth area.
The raw materials (sequences)
are being added at a startling rate.
Tree databases are also growing
(both in number and size).
so how does a lab worker bee keep up?
10. Phyloinformatics Workshop Edinburgh 2007
1: Forests of trees, and loads of kindling
Phylogenetics is a growth area.
The raw materials (sequences)
are being added at a startling rate.
Tree databases are also growing
(both in number and size).
so how does a lab worker bee keep up?
11. from Rod Page “Towards a Taxonomically Intelligent Phylogenetic Database”
7000
6000
Molecular phylogenies
Cumulative number
TreeBASE studies
5000
4000
3000
2000
1000
0
1980 1985 1990 1995 2000
Year
13. Phyloinformatics Workshop Edinburgh 2007
Two modes of data acquisition
(a) wet lab - compute lab synergy
explicitly source the sequences needed
preformed ideas of
the best taxa to sample
the best genes to sample
[this is the source of most phylogenetic data]
14. Phyloinformatics Workshop Edinburgh 2007
Two modes of data acquisition
(a) wet lab - compute lab synergy
(b) magpie surfing / tree surgery
using phyloinformatic tools
to discover the set of available
genes AND taxa
to address a particular problem
15. Phyloinformatics Workshop Edinburgh 2007
2: Organising principles
On average …
• more data are better
more taxa
more genes
• multiple methods are better
17. while the NCBI taxonomy
isn’t the best in the world,
at least every sequence
is attached to a taxon,
and TAX_IDs are unique
18. The Edinburgh EST analysis Pipeline
(trace2dbest)
Process raw sequence traces
Trim off vector & low quality
(CLOBB)
Cluster into putative gene objects
Predict consensus sequence
(prot4EST)
Predict translation reading frame
Generate protein translation
(annot8r)
Annotate using BLAST GOtcha
PSort Pfam SigPep KEGG
(PartiGene)
Collate information in relational
database
19. NEMBASE3 http://www.nematodes.org/
The web portal to NEMBASE3
Mark Blaxter, James Wasmuth,
Ann Hedley & Ralf Schmid
University of Edinburgh,
Institute of Evolutionary Biology,
Edinburgh UK EH9 3JT
mark.blaxter@ed.ac.uk
20. NEMBASE3 http://www.nematodes.org/
Collectors’ curve of nematode protein families
Trichinella spiralis
50000
Brugia malayi
Number of families
Meloidogyne incognita
40000
A
Strongyloides
30000 stercoralis
Ancylostoma
caninum
20000
Caenorhabditis
10000 elegans
B
C
0
150000
75000 100000 125000
50000
0 25000
Total number of proteins
25. Generating a slice that
• maximises taxonomic coverage
• maximises present data/minimises missing data
gene->
abefgi
/taxon
1
3
7
9
26. Phyloinformatics Workshop Edinburgh 2007
2: Organising principles
• assess all relevant taxa
• assess all relevant sequence
• store aligned sequences locally
• output ‘slices’ of data in analysis-ready formats
• store trees locally
• store alternative taxonomic systems
27. Complete Including
Platyhelminthes
genome neglected
L
sequences taxa ESTs
Annelida
(Philippe et al.)
Mollusca
Tardigrada
P Nematoda
E
Arthropoda
C
Vertebrata
Urochordata
Cephalochordata
D
Echinodermata
Ctenophora
Cnidaria
Choanoflagellata
Fungi
29. sequence alignment TreeFam TreeBASE user tree systematic
AGGCT AGGCT AGGCT
ACGGT ACGGT
PheTyr AGGCT CCGGA CCGGA
ACGGT
CCGGA
Processing to Processing to Processing to
* identify relevant sequences * identify relevant sequences * capture tree data
and store locally and store locally * reconcile tree nodes
* associate sequences * capture tree data with existing systems
and taxa * reconcile tree nodes
with existing systems
30. sequence alignment TreeFam TreeBASE user tree systematic
AGGCT AGGCT AGGCT
ACGGT ACGGT
PheTyr AGGCT CCGGA CCGGA
ACGGT
CCGGA
Processing to Processing to Processing to
* identify relevant sequences * identify relevant sequences * capture tree data
and store locally and store locally * reconcile tree nodes
* associate sequences * capture tree data with existing systems
and taxa * reconcile tree nodes
with existing systems
AGGCT
ACGGT
CCGGA
POA iPhy database
Alignment Cycle
tranAlign AGGCT
ACGGT
CCGGA
AGGCT
PheTyr
AGGCT
ACGGT
CCGGA
31. sequence alignment TreeFam TreeBASE user tree systematic
AGGCT AGGCT AGGCT
ACGGT ACGGT
PheTyr AGGCT CCGGA CCGGA
ACGGT
CCGGA
Processing to Processing to Processing to
* identify relevant sequences * identify relevant sequences * capture tree data
and store locally and store locally * reconcile tree nodes
* associate sequences * capture tree data with existing systems
and taxa * reconcile tree nodes
with existing systems
AGGCT
ACGGT
CCGGA
POA iPhy database
Alignment Cycle
tranAlign AGGCT
ACGGT
CCGGA
AGGCT
PheTyr
TreeFam
AGGCT
Orthologue
ACGGT
CCGGA
Inference
Ortho-MCL AGGCT
Engine ACGGT
CCGGA
36. Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment
version 0.1: ‘TaxMan’
TaxMan automates assembly of large
sequence datasets for chosen taxa
TaxMan automates generation of aligned
sequences sets for chosen genes
37. Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment
version 0.1: ‘TaxMan’
TaxMan simplifies selection of taxa for
analysis
e.g. given a gene set, choosing one species per family
(choosing the species with the least missing data)
e.g. given a taxon set, choosing the genes
(choosing genes with less than a given % missing data)
e.g. generating custom defined alignments
38. Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment
version 0.1: ‘TaxMan’
TaxMan simplifies analysis by exporting
formatted alignments (NEXUS)
of nucleotides
(with codon positions and genes as defined partitions)
of amino acids
(with genes as defined partitions)
39. Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment
version 0.1: ‘TaxMan’
TaxMan simplifies post-phylogenetic analysis
by
saving trees
(with links to the original data)
saving analytical metadata
(algorithm, parameters, settings)
saving tree statistics
(bootstraps, branch lengths)
41. Lophotrochozoa
70,000 annotated sequences
●
630,000 EST sequences
●
21 genes (mt + 18S 28S actin H3 WG EF1A)
●
53,000 sequences extracted
●
17,000 aligned consensus sequences
●
8,700 species represented
●
One day for data collection, one for alignment
●
42. Molecular Phylogenetics and Evolution 43 (2007) 583–595
www.elsevier.com/locate/ympev
The e ect of model choice on phylogenetic inference using
mitochondrial sequence data: Lessons from the scorpions
a,¤
, Benjamin Gantenbein b, Victor Fet c, Mark Blaxter a
Martin Jones
a
Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK
b
AO Research Institute, Clavadelerstrasse 8, Davos Platz CH-7270, Switzerland
c
Department of Biological Sciences, Marshall University, Huntington, WV 25755-2510, USA
Received 25 April 2006; revised 14 November 2006; accepted 14 November 2006
Available online 29 November 2006
47. organism-size curve Eukaryotes
squillions
number of individuals (log scale)
POSSIBLE
PREDATORS
lots FOOD
ITEMS
few
miniscule tiny just visible small big
size of organism (log scale)
48. Sourhope farm
NERC quot;Soil Biodiversity
and Ecosystem Functionquot;
Programme Study Site
120 m x 75 m
of raw Scottish upland grass
13 000 000 000 nematodes
52. motu
1. to cut; to snap off
motu-á te hau, the fishing line snapped off
2. to engrave, to inscribe
letters or pictures in stone or in wood, like the motu mo rogorogo, inscrip-
tions for recitation in lines called kohau.
3. islet
some names of islets: Motu Motiro Hiva, Motu Nui, Motu Iti, Motu Kaokao,
Motu Tapu, Motu Marotiri, Motu Kau, Motu Tavake, Motu Tautara, Motu Ko
Hepa Ko Maihori, Motu Hava.
53. Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
MOTU
specimen-based surveys
CBoL Barcode of Life (CO1)
anonymous, specimen-free surveys
environmental sampling
bulk community DNA
millions of sequences
54. Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
~1.2 million described species
~10-100 million species in reality
Thus, most ‘species’
will never be formally named.
55. Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
How do we incorporate these myriad
‘nameless taxa’ into our systems?
56. Phyloinformatics Workshop Edinburgh 2007
TaxMan, iPhy & chelicerate evolution
Martin Jones
MOTU and barcoding
Robin Floyd &
Jenna Mann
PartiGene & EST analysis
Ralf Schmid,
James Wasmuth
& Ann Hedley