Course slides for computational phyloinformatics, an annual course organized by NESCent in collaboration with hosting organizations across the world. I am the teacher of the Perl section of the course, these are the slides I presented in 2010 at BGI, Shenzhen, PRC.
6. Rooted vs. unrooted trees D A B E A B C D Root E C F F Rooted tree:Has a root that denotes common ancestry Unrootedtree:Only specifies the degree of kinship among taxa but not the evolutionary path Tree terminology
7. Rooted and unrooted trees The number of rooted and unrooted trees for n species is NR = (2n - 3)!/2n-2(n - 2)! NU = (2n - 5)!/2n-3(n - 3)!
11. Monophyletic A monophyletic group is a group of organisms which forms a clade, meaning that it consists of an ancestor and all its descendants. (Most clades on our Supertree are monophyletic.)
16. A B C D E F Phylograms: Branch lengths are proportional to amount of change that occurred on that branch (these are the gene trees before r8s). Cladograms:Branch lengths are not proportional to the amount of changes (this is the Supertree from Monday). Cladograms and phylograms
17. Ultrametric trees If the distance from the root represents time (not change) we can use trees to study how fast new species form. (This is our final tree after we put it all together.)
18. Types of data What evidence are phylogenetic trees based on?
19. Distance data Example: DNA-DNA hybridization. The more closely related two species are, the more similar their DNA. The more similar the DNA, the stronger the bond between the two strands, and the shorter the distance.
21. Molecular sequence data I am sure you have all heard about DNA sequencing. Amino acid sequences are often used for more distantly related species.
22. Types of Data Two categories Numerical data Evolutionary distance between two species Usually derived from sequence data Character data Each character has a finite number of states E.g. number or legs = 1, 2, 4 DNA = {A, C, T, G}
24. Distance methods Types of data Distance matrices: DNA-DNA hybridization Computed from sequences Examples UPGMA is the oldest distance matrix method Neighbor-joining is more commonly used
25. Distance data When using sequences, distance-based methods must transform the sequence data into a pairwise similarity matrix for use during tree inference
26. Neighbor-Joining Methods Maintain a pairwise distance matrix Find the closest two taxa Collapse them into one row (internal node) and recompute distance from the merged row to every other row Loop to 2 Build tree as you go
27. Character methods Types of data Any homologized data: Morphological data Molecular sequences Examples Optimality-criterion methods: Maximum parsimony Maximum likelihood Bayesian methods: MCMC
28. What is homology? Example: forelimbs Definition Homology means any similarity between characters that is due to their shared ancestry. Anatomical structures that evolved from the same structure in some ancestor species are homologous. In genetics, homology can be observed inaligned DNA sequences.
29. What is an “optimality criterion”? An optimality criterion is simply a way to quantify, using a number, how well a tree fits the data relative to other trees. Examples are parsimony tree length (this is how the Supertree was optimized on the CIPRES cluster) and likelihood score. The posterior probability can also be seen as an optimality criterion.
30. Parsimony tree length Tree length is the minimum number of reconstructed changes. The most parsimonious tree is the tree with the fewest number of changes.
31. Finding the optimal tree Under an optimality criterion, trees need to be compared with one another to find the one that maximizes the optimality criterion. When we talk about MP and ML trees, this is usually done with hill-climbing algorithms.
32. …but this is not the whole story! Maximum Parsimony assumes a very simple model for evolutionary change – namely that change is rare. Especially molecular evolution can be modeled in more realistic ways, using substitution models. There are more complex ways to explore tree space than just hill-climbing (such as the Parsimony Ratchet). We can also sample different areas of tree space to see how optimality is distributed, using MCMC.
35. Additional parameters Gamma distribution Invariant sites Perhaps some sites never change. Maybe specify their proportion?
36. Likelihood and the number of parameters More parameters always leads to a better fit of the data
37. Likelihood and the number of parameters More parameters always leads to a higher value of the likelihood whether or not the additional parameters are providing a ‘significantly’ better fit to the data
38. Are the extra parameters justified? Maximum Likelihood | H1 ( ) Likelihood ratio statistic: 2 log Maximum Likelihood | H0 Has chi-squared distribution dof = number of additional parameters (We did this with ModelTest)
39. How did we use the substitution models? Each substitution has an associated likelihood given a branch of a certain length and the estimated model parameters. A function is derived to represent the likelihood of the data given the tree, branch-lengths and additional parameters. Optimise the branch lengths to get the maximum likelihood estimate.
41. Rate smoothing r8s methods attempt to simultaneously estimate unknown divergence times and smooth the rapidity of rate change along lineages. This is done by invoking some function that penalizes rates that change too quickly from branch to neighboring branch.
42. supertree Given a cladogram, how do we infer the divergence dates of the true tree? A B C D E NOT time A C E A B D E The relative lengths of some branches can be obtained from genes that fit an MLK model.
43. “true tree” A C E A B D E A B E D C time Simmons Hackman Estimates from multiple molecular sequences can subsequently be combined by calibrating the gene trees on a common node, and applying the resulting node depths to the supertree.
44. Where did we get the other dates? If there is no extinction and constant speciation (!), the expected waiting time from one speciation event to the next is 1/n, where n=number of lineages. This is a little more complicated if we take multiple labeled histories into account… …but we can come up with expected ages this way.
46. What is PhyloInformatics? A made up word! We’ve seen we have to deal with data of different types (trees, sequences, alignments, metadata). This are part of complex work flows or pipelines. We “do” phyloinformatics when we come up with repeatable ways to automate these pipelines.
47. The power of UNIX UNIX is very useful for phyloinformatics: Everything is text-based Everything can be scripted and called from other programs Many programs for phylogenetics are available on UNIX platforms Everything can be piped together to create larger workflows
48. The power of Perl Perl allows us to chain other UNIX tools together Many perl libraries exist for dealing with biological data Easy to learn, quick to develop
49. Join us! We do a lot more phyloinformatics: Hackathons Google Summer of Code Ongoing projects Stay in touch, we can help each other!