Más contenido relacionado



  1. Bioinformatics Resources and Tools on the Web: A Primer Joel H. Graber Center for Advanced Biotechnology Boston University
  2. Outline • Introduction: What is bioinformatics? • The basics – The five sites that all biologists should know • Some examples – Using the tools in a somewhat less-than-naïve manner • Questions/comments are welcome at all points • Much of this material comes from the Boston University course: BF527 Bioinformatic Applications (
  3. What is bioinformatics?
  4. Examples of Bioinformatics • Database interfaces – Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, … • Sequence alignment – BLAST, FASTA • Multiple sequence alignment – Clustal, MultAlin, DiAlign • Gene finding – Genscan, GenomeScan, GeneMark, GRAIL • Protein Domain analysis and identification – pfam, BLOCKS, ProDom, • Pattern Identification/Characterization – Gibbs Sampler, AlignACE, MEME • Protein Folding prediction – PredictProtein, SwissModeler
  5. Things to know and remember about using web server-based tools • You are using someone else’s computer • You are (probably) getting a reduced set of options or capacity • Servers are great for sporadic or proof-of- principle work, but for intensive work, the software should be obtained and run locally
  6. Five websites that all biologists should know • NCBI (The National Center for Biotechnology Information; – • EBI (The European Bioinformatics Institute) – • The Canadian Bioinformatics Resource – • SwissProt/ExPASy (Swiss Bioinformatics Resource) – • PDB (The Protein Databank) –
  7. NCBI ( • Entrez interface to databases – Medline/OMIM – Genbank/Genpept/Structures • BLAST server(s) – Five-plus flavors of blast • Draft Human Genome • Much, much more…
  8. EBI ( • SRS database interface – EMBL, SwissProt, and many more • Many server-based tools – ClustalW, DALI, …
  9. SwissProt ( • Curation!!! – Error rate in the information is greatly reduced in comparison to most other databases. • Extensive cross-linking to other data sources • SwissProt is the ‘gold-standard’ by which other databases can be measured, and is the best place to start if you have a specific protein to investigate
  10. A few more resources to be aware of • Human Genome Working Draft – • TIGR (The Institute for Genomics Research) – • Celera – • (Model) Organism specific information: – Yeast: – Arabidopis: – Mouse: – Fruitfly: – Nematode: • Nucleic Acids Research Database Issue – (First issue every year)
  11. Example 1: Searching a new genome for a specific protein • Specific problem: We want to find the closest match in C. elegans of D. melanogaster protein NTF1, a transcription factor • First- understanding the different forms of blast
  12. The different versions of BLAST
  13. 1st Step: Search the proteins • blastp is used to search for C. elegans proteins that are similar to NTF1 • Two reasonable hits are found, but the hits have suspicious characteristics – besides the fact that they weren’t included in the complete genome!
  14. 2nd Step: Search the nucleotides • tblastn is used to search for translations of C. elegans nucleotide that are similar to NTF1 • Now we have only one hit – How are they related?
  15. Conclusion: Incorrect gene prediction/annotation • The two predicted proteins have essentially identical annotation • The protein-protein alignments are disjoint and consecutive on the protein • The protein-nucleotide alignment includes both protein-protein alignments in the proper order • Why/how does this happen?
  16. Final(?) Check: Gene prediction • Genscan is the best available ab initio gene predictor – • Genscan’s prediction spans both protein- protein alignments, reinforcing our conclusion of a bad prediction
  17. Ab initio vs. similarity vs. hybrid models for gene finding • Ab initio: The gene looks like the average of many genes – Genscan, GeneMark, GRAIL… • Similarity: The gene looks like a specific known gene – Procrustes,… • Hybrid: A combination of both – Genomescan (
  18. A similar example: Fruitfly homolog of mRNA localization protein VERA • Similar procedure as just described – Tblastn search with BLOSUM45 produces an unexpected exon • Conclusion: Incomplete (as opposed to incorrect) annotation – We have verified the existence of the rare isoform through RT-PCR
  19. Another example: Find all genes with pdz domains • Multiple methods are possible • The ‘best’ method will depend on many things – How much do you know about the domain? – Do you know the exact extent of the domain? – How many examples do you expect to find?
  20. Some possible methods if the domain is a known domain: • SwissProt – text search capabilities – good annotation of known domains – crosslinks to other databases (domains) • Databases of known domains: – BLOCKS ( – Pfam ( – Others (ProDom, ProSite, DOMO,…)
  21. Determination of the nature of conservation in a domain • For new domains, multiple alignment is your best option – Global: clustalw – Local: DiAlign – Hidden Markov Model: HMMER • For known domains, this work has largely been done for you – BLOCKS – Pfam
  22. If you have a protein, and want to search it to known domains • Search/Analysis tools – Pfam – BLOCKS – PredictProtein (
  23. Different representations of conserved domains • BLOCKS – Gapless regions – Often multiple blocks for one domain • PFAM – Statistical model, based on HMM – Since gaps are allowed, most domains have only one pfam model
  24. Conclusions • We have only touched small parts of the elephant • Trial and error (intelligently) is often your best tool • Keep up with the main five sites, and you’ll have a pretty good idea of what is happening and available