This is my first lab presentation during my post-doc in Jonathan Eisen's lab. I discuss new features and changes with HMMER 3. Also, I discuss how I used the new version to identify PFAMs in all 80 samples of the GOS metagenomic datasets with the hope of testing of "community profiling" may work.
2. HMMER 3 – What’s new? Much Faster 100 X HMMER 2 ≈ BLAST More sensitive
3. What’s new? Alignment column confidence Each residue is given a posterior probability annotation * = 95-100% 9= 85-95% 8= 75-85% etc. fn3 2 saPenlsvsevtstsltlsWsppkdgggpitgYeveyqekgegeewqevtvprtttsvtltgLepgteYefrVqavngagegp 84 saP ++ + ++ l ++W p + +gpi+gY++++++++++ + e+ vp+ s+ +++L++gt+Y++ + +n++gegp 7LESS_DROME 439 SAPVIEHLMGLDDSHLAVHWHPGRFTNGPIEGYRLRLSSSEGNA-TSEQLVPAGRGSYIFSQLQAGTNYTLALSMINKQGEGP 520 78999999999*****************************9998.**********************************9997 PP
4. What’s new? Sequence scores, not alignment scores scoring just a single best alignment can break down if it is a remote homolog scoring sequences by integrating over alignment uncertainty
5. Single Sequence Queries phmmer ≈ BLASTP Search a sequence against a sequence database. jackhmmer≈ PSI-BLAST Iteratively search a sequence against a sequence database. Internally they produce a profile HMM from the query sequence then run an HMM search
6. Small Changes hmmpfam -> hmmscan Search a sequence against a profile HMM database hmmcalibrate -> built into hmmbuild hmmpress Creates binary hmm files so hmmscan is faster Similar idea to formatting Blast db’s using formatdb New output format options --tblout(seq score, best domain score) --domtblout(seq score, all domain scores with coordinates) Gives a tab-delimited output without alignments 1/5 file size of regular output
8. Problems/Issues hmmconvert Used to convert hmmer2 profiles into hmmer3 profiles Only converts file format Good: get hmmer3 speedup Bad: get hmmer2 sensitivity/specificity Should rebuild old HMMER2 HMMs using hmmbuild
9. Glocalvs local alignments Local Any portion of the HMM can align to any portion of the sequence Glocal The entire HMM is aligned to any portion of the sequence HMMER2 Had both, but local was not as sensitive as glocal HMMER3 Local was improved so that glocal was thought to be not needed (and was not included in HMMER3) However, some models do very poorly Short extremely diverse seed alignments such as zinc finger transcription factors may be missed
11. Phylogenetic profiling Wu, et al., PLOS Genetics, 2005 C. hydrogenoformansidentified presence or absence of homologs in all other completely sequence genomes Identified many hypothetical proteins that had the same profile as other sporulation proteins
13. Community Profiling Look across multiple metagenomic samples Gene families that have similar profiles may have similar function Similar to using co-expression to identify similar functioning genes
14. So what have I done? Downloaded the GOS peptide file 41M sequences, 80 samples 43GB -> 7GB, by removing extra information Split into ~100 smaller files Downloaded HMMER 3 Pfams (email request) Containing 11098 Pfams Ran hmmscan on genbeo 4 days later 12.5 M pfam predictions Some sequences contain >1 pfam 9643 pfams Used “cluster” to group genes and samples
15. Results GOS Metagenomic Samples Red = above avg. number of pfams Green = below avg. number of pfams Have not normalized Number of sequences per sample For number of pfams Pfams
17. Future Community Profiling Include other (all) metagenomic samples Try to group Pfams by GO category to see how strong the correlation is between branch length and function Examine if some functionality categories are more easily predicted by this profiling strategy (i.e. HGTs) Identify novel gene families and sub-families Clustering genes, building HMMs, scanning, …repeat. Community profiling may help in annotation of these