Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
2009 11 09 UCLA Bioinformatics Talk
1. Blasting mold with the data firehose:
Comparative and evolutionary genomics of filamentous
fungi with next generation sequencing.
Jason Stajich
Plant Pathology and Microbiology
University of California, Riverside
2. Blasting mold with the data firehose:
Comparative and evolutionary genomics of filamentous
fungi with next generation sequencing.
second
Jason Stajich
Plant Pathology and Microbiology
University of California, Riverside
3. Fungi have diverse forms, ecology, and associations
Cryptococcus neoformans X. Lin Coprinopsis cinerea Ellison & Stajich Aspergillus niger. N Read Glomus sp. Univ Sydney Rozella allomycis. James et al
Puccinia graminis J. F. Hennen Batrachochytrium dendrobatidis
Laccaria bicolor Martin et al. Neurospora crassa. Hickey & Reed Phycomyces blakesleansus T. Ootaki
J. Longcore
Ustilago maydis Kai Hirdes Amanita phalloides. M Wood Xanthoria elegans. Botany POtD Rhizopus stolonifera. Blastocadiela simplex Stajich & Taylor
7. Tools for comparative genomics
• Need organized data - databases with integrated information and capability to grow and
add additional species or experiments
• Community interactive resources - Web-based often the best mix of interactive and easily
available
• Genome Browsers to see genomic context information, important for visualizing high
density data like 2nd-generation sequencing (RNA-Seq, ChIP-Seq)
• Summaries of Analyses -- “Gene Pages” with detailed information for each locus
• Other things that are needed: Community annotation and collection of information to make
sense of these comparisons
• Repository of annotations and comparative analyses: synteny, orthologs, gene families
8. Genome Browser data integration - Gbrowse
Ncra_OR74A_chrIV_contig7.20
300k 310k 320k 330k
DNA_GCContent
% gc
NCBI genes (Broad called)
NCU04433 NCU04430 NCU04426
sulfate permease II CYS-14 related to aminopeptidase Y precursor; vacuolar related to cyclin-supressing protein kinase
NCU04432 NCU04429 NCU04425
hypothetical protein conserved hypothetical protein putative protein
NCU04431 NCU04428 NCU04424
related to endo-1; 3-beta-glucanase related to spindle assembly checkpoint protein related to regulator of chromatin
NCU04427
conserved hypothetical protein
PASA updated NCBI/Broad genes
NCU04433 NCU04432
[pasa:asmbl_9429,status:12],[pasa:asmbl_9430,status:12] [pasa:asmbl_9440,status:12],[pasa:asmbl_9441,status:12],[pasa:asmbl_9442,status:12]
[pasa:asmbl_9431,status:12],[pasa:asmbl_9432,status:12] [pasa:asmbl_9443,status:12],[pasa:asmbl_9444,status:12]
[pasa:asmbl_9433,status:12],[pasa:asmbl_9434,status:12],[pasa:asmbl_9435,status:12]
[pasa:asmbl_9436,status:12],[pasa:asmbl_9437,status:12],[pasa:asmbl_9438,status:12],[pasa:asmbl_9439,statu
[pasa:asmbl_9445,status:12],[pasa:asmbl_9446,status:12
NCU04424
Named Genes (Radford laboratory)
cys-14 gh16-3
tRNA{phe}-9
miRNA Solexa histogram
miRNA
K4dime ChIP-Seq histogram (SOAP)
K4dime_Solexa
Stajich et al, unpublished
K9met3 ChIP-Seq histogram (SOAP) Smith, Freitag, et al unpublished
K9met3
10. Fungal evolution at different time scales
• Deep divergences of fungi
• How did multicellular fungi evolve?
What molecular changes allowed the transition from aquatic to
terrestrial life in fungi?
• Closer comparisons
• What are lineage specific changes that influenced evolution of
animal and plant pathogenic fungi?
How are
12. Human pathogen Coccidioides
• Coccidioides (Valley fever)
• Is a primary human pathogen - infects healthy people - most
human pathogenic fungi are opportunistic.
• Endemic in US Southwest, Mexico
• Requires laboratory BSL3 and is a Select Agent
• Difficult to reliably collect from nature.
Comparative analyses of Coccidoides spp to learn more about
dispersal.
• Can we identify potential pathogenicity genes based on
molecular signatures?
17. Population Genomics
• 20 strains sequenced, 10 from each spp. 13 via Sanger
sequencing, 7 via Solexa/Illumina resequencing
• 680 000 filtered SNPs across genomes (~28Mb genome).
• What can we learn from these data?
• Hybridization and Migration inferred from population statistics
(FST)
• (Effective) population size (Ne)
• Testing for selective sweeps in region of the genome
18. Two species of Coccidioides
C.immitis
C.posadasii
EVOLUTION
Fisher et al, 2000
19. Chrom I
• FST: 1 is complete separation,
0 is no separation
• Applied to whole genome can
estimate when regions
diverged and if there has
been recent hybridization
(migration of alleles).
Neafsey, Barker, et al. In prep
FST across the chromosomes (CU Evidence for hybridization between Ci
and Cp
20. Ci
Cp Fig. 1. Neighbor-joining tree of pairwise allele-sharing genetic distances calculated with the program MICROSAT. Tree construction was performed
in the PHYLIP package (36). The isolate marked with an asterisk signifies a patient who was diagnosed in Texas but was subsequently found to ha
infection in California (42). The tree is mid-point rooted, and the scale bar signifies 0.1 changes. CA, Californian; non-CA, non-Californian.
DYE terminators (Applied Biosystems) were used with the that isolates occur within one of two major clad
following primer combinations: deoxygenase, DO7 GAGAA- studies of multilocus gene genealogies have resu
GATCCTCGGATTCCA, DO10 GCCCTGAAGTTGCCCGC; clades being recognized as the CA and non-CA
serine proteinase, SP3 CCAGGCACCGACAAGCAGTA, SP6 species (23, 26). We have previously estimated
TAGCGTGTCCACCTTCATCG; and chitinase, CT31 CTC- genetic isolation between these two groups as 12.8
CAAACTCTTGTCCAGGC, CT4 TCAGCGAATTTCTTC- (SEM 8.0 million years; refs. 18 and 23). Fig. 1 sh
CTGCC. The sequences were aligned with the CLUSTAL V and non-CA are largely allopatric, except in southe
sequence alignment algorithm (24). Distance analyses were and Mexico where regions of sympatry occur. Wi
performed by neighbor-joining in PAUP* 4.0b2a (25). Because of non-CA, there is a strong tendency for isolate
the closely related nature of these sequences, correcting for according to where they were isolated, showing th
multiple hits was not necessary and an uncorrected p distance ically distinct populations occur. The deepest diver
measure used. Stability of the individual branches was assessed CA clade corresponds to a geographical division
by 1,000 bootstrap replicates of the data. Central Valley and the rest of southern California, d
the Tehachapi mountain range. Here, ( )2 is
Results greater than zero, demonstrating that genetic drift
North American Microsatellite Diversity. Allele distributions at the between these populations. A similar pattern of di
nine microsatellite loci were sampled from eight geographical is seen for the non-CA species. Arizona isolates
populations. From this data set of 1,424 alleles, DAS was used to pendently from Mexico, and South American isolate
group isolates phylogenetically (Fig. 1). The resulting tree shows those from Texas in a subclade, as had been prev
Fisher et al. PNAS April 10, 2001 vol. 98
Ne of 2.25 x 106 in C. immitis and 4.82
x 106 in C. posadasii - Cp has 2.15-
Effective Population Size fold larger effective population size.
Neafsey, Barker, et al. In prep
21. Coccidioides population genomics
• C. immitis is endemic to Central and
Southern California, mountain ranges likely
block its migration into Arizona.
• Smaller effective population size consistent
with smaller geographic range or perhaps the
fission of the population due to introduced
geographic barrier.
• There is evidence of inter-species
hybridization events (introgression) and
bidirectional exchange of alleles.
• Some evidence for selective sweeps as well
based on populations, ongoing work to verify
and validate these observations.
22. Evolution of a pathogen
• Comparing sequences from two Coccidioides species, closely
related outgroup, and many related species.
• Are there genes with signatures of positive selection that may
distinguish pathogen from non-pathogen?
• Are there differences in presence-absence of genes or sizes of
gene families that suggest differences in pathogen?
24. Gene family changes
• Another mechanism for adaptation may be changes in copy
number of a gene family
• Gene duplication is a source of novelty allowing for changes in
the function of one copy if the other maintains original function
• Expansions of copy number may also be an easy way to get
more protein for a particular process
• How important is copy number change in adaptation?
27. Keratinases in Onygenales
SignalP
Subtilisin_N
• Onygenales are Keratinophilic
• Domains: Peptidase S8, Subtilisin domains
• Large expansion of putative keratinases in Onygenales
32. Towards identifying genes underlying adaptation
• Coccidioides is found in desert soil and associated with animals - long
term animal association
33. Towards identifying genes underlying adaptation
• Coccidioides is found in desert soil and associated with animals - long
term animal association
• Genes under positive selection may play a role Cocci-specific
developmental stages (Spherule and Endospore) and some (as of yet)
unknown processes
34. Towards identifying genes underlying adaptation
• Coccidioides is found in desert soil and associated with animals - long
term animal association
• Genes under positive selection may play a role Cocci-specific
developmental stages (Spherule and Endospore) and some (as of yet)
unknown processes
• Loss of genes involved in plant product metabolism suggests nutritional
shift in Onygenales from relatives in Eurotiales
35. Towards identifying genes underlying adaptation
• Coccidioides is found in desert soil and associated with animals - long
term animal association
• Genes under positive selection may play a role Cocci-specific
developmental stages (Spherule and Endospore) and some (as of yet)
unknown processes
• Loss of genes involved in plant product metabolism suggests nutritional
shift in Onygenales from relatives in Eurotiales
• Expansion of a few gene families, may be involved in metabolism - none
are Coccidioides specific though.
36. Towards identifying genes underlying adaptation
• Coccidioides is found in desert soil and associated with animals - long
term animal association
• Genes under positive selection may play a role Cocci-specific
developmental stages (Spherule and Endospore) and some (as of yet)
unknown processes
• Loss of genes involved in plant product metabolism suggests nutritional
shift in Onygenales from relatives in Eurotiales
• Expansion of a few gene families, may be involved in metabolism - none
are Coccidioides specific though.
• Sampling of a closer non-pathogenic outgroup can help polarize recent
changes. Expression analyses may help assign function to some of genes
with positive selection signatures
37. Neurospora genomics
• Improving the annotation and identification of functional
elements with NGS
• Transcriptional profiling and describing the transcriptome
38. CV10 Papua New Guinea
CV80 Gabon
CV56 Haiti
CV57 Haiti
N. sitophila
CV98 Indonesia
CV93 Mexico
CV88 Hawaii
CV82 Gabon
CV43 Truk
D123 Nigeria
0.89 D72 Ivory Coast
D147 New Mexico
86 D10 Karnataka
D53 Thailand
D124 Virginia
1.00 D63 Haiti
CV79 Gabon
89 D78 Congo
N. perkinsi (PS3)
D77 Congo
D74 Congo
D82 Congo
D75 Congo
D100 Tamil Nadu
D106 Tamil Nadu
D103 Tamil Nadu
D105 Tamil Nadu
D107 Tamil Nadu
D42 Tamil Nadu
D99 Tamil Nadu
D98 Tamil Nadu
D11 Karnataka
D12 Karnataka
D70 Ivory Coast
1.00 D110 Louisiana
D114 Louisiana
68 D117 Louisiana
D115 Louisiana
D144 Panama
D60 Haiti
D24 Florida
D94 Yucatan
D61 Haiti
N. crassa
D69 Ivory Coast
D111 Louisiana
D112 Louisiana
D118 Louisiana
D119 Louisiana
D116 Louisiana
D143 Louisiana
D19 Florida
D30 Florida
D23 Florida
D59 Haiti
D29 Florida
D90 Yucatan
D88 Yucatan
D62 Haiti
D85 Yucatan
D56 Haiti
D27 Florida
D28 Florida
D140 Ivory Coast
D91 Yucatan
D96 Ivory Coast
D113 Louisiana
1.00 D68 Ivory Coast
N. tetrasperma
D13 Louisiana
91 D14 Hawaii
D15 Hawaii
1.00 D145 Unknown
CV55 Haiti
N. hispaniola (PS1)
89 D55 Haiti
D57 Haiti
D58 Haiti
CV119 Haiti
CV156 Mexico
CV152 Mexico
CV155 Mexico
CV91 Mexico
CV89 Mexico
CV148 Mexico
N. metzenbergi (PS2)
CV90 Mexico
CV153 Mexico
CV154 Mexico
1.00 D86 Yucatan
D89 Yucatan
96 D93 Yucatan
D92 Yucatan
D87 Yucatan
D120 Madagascar
D121 Madagascar
D1 Taiwan
D2 Taiwan
D3 Philippines
D102 Thailand
D18 Queensland
D4 Philippines
D31 Anhui
D6 Taiwan
D8 Java
D80 Congo
D9 Java
D33 Papua New Guinea
D84 Hawaii
D101 Tamil Nadu
D50 Tamil Nadu
D45 Tamil Nadu
D38 Tamil Nadu
D129 Karnataka
D132 Karnataka
D135 Karnataka
D44 Tamil Nadu
D46 Tamil Nadu
D48 Tamil Nadu
D49 Tamil Nadu
D47 Tamil Nadu
D134 Karnataka
D137 Karnataka
D139 Karnataka
D128 Karnataka
N. intermedia
D122 Honduras
D22 Florida
1.00 D64 Haiti
D21 Florida
84 D25 Florida
D26 Florida
D65 Ivory Coast
D73 Ivory Coast
D66 Ivory Coast
D141 Liberia
D16 Texas
D142 Fiji
D7 Java
D95 Ivory Coast
D83 Gabon
D76 Congo
D79 Congo
D81 Congo
D34 Papua New Guinea
D51 Malaysia
D52 Thailand
D127 Karnataka
D130 Karnataka
D97 Tamil Nadu
D131 Karnataka
D41 Tamil Nadu
D43 Tamil Nadu
D126 Karnataka
D136 Karnataka
D125 Karnataka
D108 Tamil Nadu
D40 Tamil Nadu
D109 Tamil Nadu
D39 Tamil Nadu
D133 Karnataka
Villalta et al, Mycologia 2009
D32 Anhui
D35 Papua New Guinea
D36 Tahiti
D146 New Mexico
D71 Ivory Coast
1.00 D37 Karnataka
N. discreta
D54 Thailand
96 D5 Papua New Guinea
D67 Ivory Coast
Dettman et al, Evolution 2003
D17 Texas
5 changes D20 Florida
Neurospora as a model for
Phylogenetic and Biological species
evolutionary biology tests
42. Next generation sequencing in Neurospora crassa
• Solexa/Illumina libraries of 35-45 bp read length, 8-12 M reads per
library
• RNA-Seq from hyphal tip (Hall, Glass, Kasuga) and a cross (C.
Ellison) - ongoing project from R. Brem, J. Taylor, NL Glass to
generate~100 RNA-Seq in N.crassa
• Small RNA-Seq from a pooled library of cross, vegetative growth
• ChIP-Seq from methylated (meDIP), Histone H3K4 & H3K9
methylation, and centromeric proteins (CenPC, CenH3) (K.Smith &
M. Freitag)
44. A GG
G
small RNA Sequencing
C
G
Map to A
C GG
A T
A T
Genome
GT
GC
A T
Extract
C
A G
Ncra_OR74A_chrV_contig7.11 A
CG
595.4k 595.5k 595.6k 595.7k 595.8k 595.9k 596k 596.1k 596.2k 596.3k
DNA_GCContent A T
% gc 99%
G T
RNA
19%
NCBI genes (Broad called) GC
NCU03749
A T
~5M 36bp
probable hydroxyacylglutathione hydrolase
PASA updated NCBI/Broad genes
[pasa:asmbl_11557,status:12]
A T
C
Named Genes (Radford laboratory) A GG
miRNA Solexa histogram A
miRNA
50 C A
sequences
25 T G
miRNA predictions
0 T A
"sRNAwindow sRNAClus128021_w4; StemLength 57" T A
1. Look for highly
N.crassa PASA cDNA
T A
Solexa (Illumina)
asmbl_4339
GC
CG
GC
expressed
CG
Sequencing
GC
A A
C A
G T
Identify conserved
CG
RNA ladder T A T
G T
GC
secondary structure
CG
G T
30 CG
T A
26 >n_crassa G T
22 CACGUGGGAUCGGGCACCCAUAAAGGGUCCGGACCCCCCGUCGUGGGCCAAAGCGGGGAACG T G
T A
18 (((((((..((((((.((......)))))))).))((((((..((...))..)))))).)))
CG
>n_tetrasperma_2508 CG T
14 CACGUGGGAUCGGGCACCCAUAAAGGGUCCGGACCCCCCGUCGUGGGCCAAAGCGGGGAACG A T
(((((((..((((((.((......)))))))).))((((((..((...))..)))))).))) GC
TG
>n_discreta_8579 C
CACGUGGGAUCGGGCGCCCAAAAAAGGUCCGGGUCCCCCGUCGUGGGCCAAAGCGGGGAACG T G
T G
C
RNA cloning
((((.((((..((....)).....((.((((.((.((((((..((...))..)))))).)). A T
>consensus A T
T G
CACGUGGGAUCGGGCACCCAUAAAGGGUCCGGACCCCCCGUCGUGGGCCAAAGCGGGGAACG CG
protocol
((((.(((.((.((.((((.....)))))).))..)))...))))...((..((((.(.((( T C
A G
G
49. SmallRNA seq also covers lots of genic regions
smallRNA-Seq Coverage 1 smallRNA-Seq Coverage 2
mRNASeq Coverage 1 mRNASeq Coverage 2 4.15
90%
%
68%
1.6
2.8 % Mb
bases
45% 5.9% Mb
1.2
2.3 Mb
23%
0%
5'UTR CDS 3'UTR NONE
~20% of reads match tRNAs
50. Size classes of sequenced smallRNA reads
N.crassa smallRNA Solexa Reads 5' base
N.crassa smallRNA Solexa Reads 5' base
1.0
1.0
T T
G G
C C
0.8
A
0.8
A
0.6
0.6
Freq of reads
Freq of reads
Enrichment of
20-22 with 5' T
0.4
0.4
0.2
0.2
0.0
0.0
17 19 21 23 17
25 19
27 21
29 23
31 25
33 27
35 29 31 33 35
Read Size Read Size
53. 3' UTR, small RNAs, and Folding
!"#$%&'()*%"+#,%"-./01(23
66732(8 66732A8 66732N8 667)8 667)268 667)278 667)238 667)2)8 667)298 667)2@8 667)2(8 667)2A8 667)2N8 66798
UU A A CC
=$!B>CC(?:%?: U
U G
G A
C
MF1" A@M A C
C G
G A
6)M U A
C C
$CGF#H%?%)#3G'(+9#<+""%9; U
C A
A
!:;<7633 GC A
AU A AAA
5B4C#-D0ECFE05=B/$5CFG:BHI.JFK&LH6 CG
G UA
2!-!#&*9+:%9#$CGFIG'(+9#H%?%) G A
U A
CGU A
GU
$+/%9#>%?%)#3@+9A('9#"+0('+:('1; A GC
5-EH6 U A
G A
A U AA
U
)/+""@$!-%J 3< U AA
U
G UA
69 CG
GC
< UG
UA
GC
2D+):C(?)#3:'+?)<'E*:7F?:'(?; UG
4+$5/"-.5 UA
A
6 GCA
<29 AU
< CG
U U
U
!""#$%&'()*('+#,-.)#!))%/0"1#2!-!#34556758754#&*9+:%;#<=$! A U G
$5=>?%6@7A C UC
U G
G A
U U
0 1 U A
@$!7-%J#K1*D+"#.E*#3-L!2; U A
A C A
'!*KCO A G
3< U A
69 G C
U C
< GA