2. TIGRTIGR
Talk Outline
• Complete Genome Projects - history and current
status
• What have we learned about evolutionary history
and processes from recent genome projects
• Two main themes - completeness and closeness
• Coming attractions
• Why we need more genomes
3. TIGRTIGR
The Institute for Genomic
Research
• A not for profit institution, staff ~230
• Departments:
– Eukaryotic Genomics
– Microbial Genomics
– Functional Genomics
– Bioinformatics
– Sequencing Core
4. TIGRTIGR
General Steps in Analysis of
Complete Genomes
• Identification/prediction of genes
• Characterization of gene features
• Characterization of genome features
• Prediction of gene function
• Prediction of pathways
• Integration with known biological data
• Comparative genomics
6. TIGRTIGR
Limitations of Genome Analysis
• Functional predictions are PREDICTIONS
• Need to follow up all predictions with
experimental work
• Each genome sequence is a snapshots of one clone
• Genome analysis is not able to identify novel
processes
• Annotation needs to be updated
• Assembly can be wrong
• Some parts of genome may be missed (e.g., low
copy plasmids)
7. TIGRTIGR
Evolutionary Genomics I:
Selection of Species
• Phylogenetic diversity
• Relatedness to model organism
• Understanding major evolutionary
transitions
• Determining right depth
• Short branch lengths
12. TIGRTIGR
Genome sequences and evolution
• Origin of new gene function
• Gene loss
• Genome degradation
• Gene and genome duplication
• Rates and patterns of mutation,
recombination
• Gene transfer
• Species evolution
16. TIGRTIGR
Why Identify Gene Loss
• Indicates that gene is not absolutely required for
survival
• Parallel loss of same gene in different species may
indicate selective advantage of loss of that gene
• Correlated loss of genes in a pathway indicates a
conserved association among those genes
(important for phylogenetic profiles)
• Loss in organellar genomes frequently
accompanied by gain in nuclear genome
17. TIGRTIGR
Duplication and Loss of Mismatch
Repair Genes
51234*
E. coliH. influenzaeN. gonorrhoaeaH. pyloriSyn. spB. subtilisS. pyogenesM. pneumoniaeM. genitaliumA. aeolicusD. radioduransT.pallidumB.burgdorferiSyn. spB. subtilisS. pyogenesA. aeolicusD. radioduransB. burgdorferiMutS1MutS-IlineageMutS-II lineageSpecies TreeGene loss*Gene Duplications1-5Gene LossA.B.A. aeolicusS pyogenesB. subtilisSyn. spD. radioduransMutS2B.burgdorferi
19. TIGRTIGR
Why Duplications Are Useful to Identify
• Allows division into orthologs and paralogs
• Improves functional predictions
• Helps identify mechanisms of duplication
• Can be used to study mutation processes in
different parts of a genome
• Lineage specific duplications may be indicative
of species’ specific adaptations
21. TIGRTIGR
C. pneumoniae Paralogs by Position
0
250000
500000
750000
1000000
1250000
Subject Orf Position
0 250000 500000 750000 1000000 1250000
Query Orf Position
22. TIGRTIGR
C. pneumoniae Paralogs -
Lineage Specific
0
250000
500000
750000
1000000
1250000
Subject Orf Position
0 250000 500000 750000 1000000 1250000
Query Orf Position
24. TIGRTIGR
X-files
Eisen et al. 2000. Genome Biology 1(6): 11.1-11.9
Also see Tillier and Collins. 2000. Nature Genetics
26(2):195-7.
25. TIGRTIGR
V. cholerae vs. E. coli
Best Matching Proteins by Location
0
1000000
2000000
3000000
4000000
5000000
E. coli
ORF Coordinates
0 500000 1000000 1500000 2000000 2500000 3000000
V. cholerae ORF Coordinates
26. TIGRTIGR
M. leprae vs. M. tuberculosis Whole
Genome Alignment
0
1000000
2000000
3000000
4000000
Mycobacterium tuberculosis
0 1000000 2000000 3000000
Mycobacterium leprae
27. TIGRTIGR
Duplication and Gene Loss Model
A
B
CD
E
F
A
B
CD
E
F
A
B
C
D
E
F
A
B
C
D
E
F
A’
B’
C’
D’
E’
F’
A
B
C
D
E
F
A’
B’
C’
D’
E’
F’
A
C
D
F
A’
B’
E’
E. coli
E. coli
B
C
D
F
A’
B’
D’
E’
V. cholerae
A
B
C
D
E
F
A’
B’
C’
D’
E’
F’
28. TIGRTIGR C. trachomatis MoPn
C.pneumoniaeAR39
Origin
Terminus
C. trachomatis vs C. pneumoniae Dot Plot
30. TIGRTIGR
Why are Inversions Symmetrical
Around Origin
• Genetic studies in Salmonella and E. coli
suggest that there may be strong selection
against other inversions
– Mahan, Segall, Schmid and Roth
– Liu and Sanderson
– Rebollo, Francois, and, Louarn
32. TIGRTIGR
Why Gene Transfers Are Useful to Identify
• Laterally transferred genes frequently involved in
environmental adaptations and/or pathogenicity
• Helps identify transposons, integrons, and other
vectors of gene transfer
• Helps identify species associations in the
environment
35. TIGRTIGR
How to Infer Gene Transfers
• Unusual distribution patterns
• Unusual nucleotide composition
• High sequence similarity to supposedly
distantly related species
• Unusual gene trees
• Observe transfer events
41. TIGRTIGR
Lateral Transfer Inference Based
on Complete Genome Analysis II:
Bacterial to Vertebrate Transfers
Based on Analysis of the Human
Genome
42. TIGRTIGR
Lander et al. ‘Evidence’
• Genes match bacteria not non-vertebrate
eukaryotes
• Or, genes have stronger match to bacteria
than non-vertebrates
• A set of ~120 of these genes found in many
bacterial species
43. TIGRTIGR
Alternative explanations
• Gene loss from non-vertebrate eukaryotes
• Rapid divergence in non-vertebrate
eukaryotes
• Incomplete genomes (e.g., D.
melanogaster)
• Bad annotation/gene finding
• Contamination
47. TIGRTIGR
Birney et al, same issue of Nature
as complete genome
“The unfinished human genomic DNA may contain
contamination, particularly from bacteria but also
from other sources. Contaminating DNA is routinely
removed from finished sequence, but some is still
present in unfinished sequence. If the predicted gene
matches a bacterial gene more closely than any
vertebrate gene then it will almost always be a
contaminant.”
57. TIGRTIGR
Genomics does not require initial
culturing step.
• Isolate, by filtration, all bacteria in a water sample
• Extract total DNA in very large pieces
• Clone those pieces as BACs into E.coli to get enough.
• Sequence the BACs like a bacterial genome.
Natural
Water
Filter
concentrate
Extract
DNA
Clone
Into
BACs
Sequence
Gene
List
58. TIGRTIGR
Bacterial Rhodopsin:
a new photosynthesis system in the oceans
SAR86, an
uncultured
bacteria
BAC
Sequenced and
Analyzed
Beja O, et.al., Science 2000 289:1902-6
Bacterial rhodopsin: evidence for a new type of phototrophy in the sea.
Rhodopsin
found
H+
light
H+
ADP ATP
Cloned into
E. coli E. coli pumps
protons in the
light
63. TIGRTIGR
Wither Genomics? Not yet.
• Despite limitations, a great deal can still be
learned from genome sequence analysis.
64. TIGRTIGR
Evolutionary Diversity Still Poorly
Represented in Complete Genomes
Tmf-pendenR-rubrum3Azs-brasi2Rm-vannielRhb-legum8Bdr-japoniSpg-capsulRic-prowazSte-maltopSpr-volutaRub-gelat2Rcy-purpurNis-gonor1Hrh-halch2Alm-vinosmPs-aerugi3E-coliMyx-xanthuBde-stolpiDsv-desulfDsb-postgaC-leptumC-butyric4C-pasteuriEub-barkerC-quercicoHel-chlor2Acp-laidlaM-capricolC-ramosumB-stearothEco-faecalLis-monoc3B-cereus4B-subtilisStc-therm3L-delbruckL-caseiFus-nucleaGlb-violacOlst-lut_CZea mays CNost-muscrSyn-6301Tnm-lapsumFlx-litoraCy-lyticaEmb-brevi2Bac-fragilPrv-rumcolPrb-diffluCy-hutchinFlx-canadaSap-grandiChl-limicoWln-succi2Hlb-pylor6Cam-jejun5Stm-ambofaArb-globifCor-xerosiBif-bifiduCfx-aurantTmc-roseumAqu-pyrophenv-SBAR12env-SBAR16Msr-barkerTpl-acidopMsp-hungatHf-volcaniMb-formiciMt-fervid1Tc-celerArg-fulgidMpy-kandl1Mc-vannielMc-jannascenv-pJP27Sul-acaldaThp-tenaxenv-pJP89Tt-maritimFer-islandMei-ruber4D-radiodurChd-psittaAcbt-capslenv-MC18Pir-staleyLpn-illiniLps-interKSpi-stenosTrp-pallidBor-burgdoSpi-halophBrs-hyodysFib-sucS85Tmf-pendenR-rubrum3Azs-brasi2Rm-vannielRhb-legum8Bdr-japoniSpg-capsulRic-prowazSte-maltopSpr-volutaRub-gelat2Rcy-purpurNis-gonor1Hrh-halch2Alm-vinosmPs-aerugi3E-coliMyx-xanthuBde-stolpiDsv-desulfDsb-postgaC-leptumC-butyric4C-pasteuriEub-barkerC-quercicoHel-chlor2Acp-laidlaM-capricolC-ramosumB-stearothEco-faecalLis-monoc3B-cereus4B-subtilisStc-therm3L-delbruckL-caseiFus-nucleaGlb-violacOlst-lut_CZea mays CNost-muscrSyn-6301Tnm-lapsumFlx-litoraCy-lyticaEmb-brevi2Bac-fragilPrv-rumcolPrb-diffluCy-hutchinFlx-canadaSap-grandiChl-limicoWln-succi2Hlb-pylor6Cam-jejun5Stm-ambofaArb-globifCor-xerosiBif-bifiduCfx-aurantTmc-roseumAqu-pyrophenv-SBAR12env-SBAR16Msr-barkerTpl-acidopMsp-hungatHf-volcaniMb-formiciMt-fervid1Tc-celerArg-fulgidMpy-kandl1Mc-vannielMc-jannascenv-pJP27Sul-acaldaThp-tenaxenv-pJP89Tt-maritimFer-islandMei-ruber4D-radiodurChd-psittaAcbt-capslenv-MC18Pir-staleyLpn-illiniLps-interKSpi-stenosTrp-pallidBor-burgdoSpi-halophBrs-hyodysFib-sucS85
BacteriaArchaeaBacteriaArchaeaA. rRNA tree of Bacterial and Archaeal Major GroupsB. Groups with Completed Genomes Highlighted
65. TIGRTIGR
Limited Ecological and Physiological
Diversity
• All genomes from cultured species or
pathogens/symbionts
• Limited ecological diversity
– most are from pathogens or thermophiles
• Limited physiological diversity
– need whole range for particular physiologies,
not just extremes
67. TIGRTIGR
Why Completeness is Important
• Improves characterization of genome features
– Gene order, replication origins
• Better comparative genomics
– Genome duplications, inversions
• Presence and absence of particular genes can be very
important (e.g., gene loss)
• Missing sequence might be important (e.g.,
centromere)
• Allows researchers to focus on biology not sequencing
• Facilitates large scale correlation studies
68. TIGRTIGR
Acknowledgements
• Genome inversions: S. Salzberg, J. Heidelberg, O. White, A.
Stoltzfus, J. Peterson, H. Ochman
• Genome sequences and analysis: J. Heidelberg, T. Read, H.
Tettelin, K. Nelson, J. Peterson, R. Fleischmann, D. Bryant
• Horizontal transfers: K. Nelson, W. F. Doolittle
• TIGR: C. Fraser, J. Venter, M-I. Benito, S. Kaul, Seqcore
• $$$: NSF, NIH, ONR, DOE
69. TIGRTIGR
Evolutionary Studies Improve
Most Aspects of Genome Analysis
• Phylogeny of species places comparative data in perspective
• Evolution of genes and gene families
– Functional predictions
– Identification of orthologs and paralogs
– Species specific mutation patterns
• Evolution of pathways
– Convergence
– Prediction of function
• Evolution of gene order/genome rearrangements
• Phylogenetic distribution patterns
• Identification of novel features
70. TIGRTIGR
Genome Information and Analysis
Improves Studies of Evolution
• Complete genome information particularly useful
• Unbiased sampling
• More sequences of genes
• Presence/absence information needed to infer certain
events (e.g., gene loss, duplication)
• Genome wide mutation and substitution patterns (e.g.,
strand bias)
• Diversification and duplication
73. TIGRTIGR
Tracing Gene Loss
• Need presence and absence information of orthologous
genes from different species
• Determining absence requires a complete genome
• May still miss some homologs (e.g., due to rapid
divergence)
• Helps to have closely related species
• Use standard character state reconstruction methods to
infer gene gain and loss