This document outlines a talk on protein function and bioinformatics. It discusses why bioinformatics is needed due to the rapid increase in genomic data. It introduces various bioinformatics tools for tasks like sequence analysis, database searches, and structure prediction. As a case study, it examines the genome of the psychrophilic archaeon Methanococcoides burtonii, identifying cold-adaptation features like CSP-like proteins and modified tRNAs. It emphasizes that bioinformatics provides useful predictions but must be integrated with experimental data.
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Protein function and bioinformatics
1. Protein function and bioinformatics
Outline of talk
Why do we need bioinformatics?
●
What tools do we need?
●
Case study: The Methanococcoides burtonii genome
●
Neil Saunders
76-455
n.saunders@uq.edu.au
www.uq.edu.au/~uqnsaun1/
2. Protein function and bioinformatics
Why do we need bioinformatics?
Rapid increase in data due to genomics
●
Too much data to characterise genes/proteins individually
●
Bioinformatics = “smart use” of information
●
Ideally, computational and experimental biology are partners
●
3. Protein function and bioinformatics
The ideal computational – wet lab cycle
Biological system Biological objects
Experiments Computational objects
Biological inferences Analyses
Bioinformatics is about helping biologists solve problems
4. Protein function and bioinformatics
Introduction to genomics
Genomes Online database
www.genomesonline.org
●
Published/complete 413
Bacteria in progress 977
Eukarya in progress 629
Archaea in progress 57
Metagenomes 56
10-50% of genes in a new genome may have no known function
5. Protein function and bioinformatics
Computational skills for genomics
"So what new skills will postdocs need to ensure that
they don't become science relics? The answer is math,
statistics, and knowledge of a scripting language for
computers."
The Scientist, "Bioinformatics Knowledge Vital to Careers"
Volume 16 | Issue 17 | 53 | Sep. 2, 2002
www.thescientist.com
6. Protein function and bioinformatics
Using WWW resources
The best web resources provide:
●
- useful tools for analysis
- integrated data from many sources
Good examples
InterPro database http://www.ebi.ac.uk/interpro/
●
Expasy http://au.expasy.org
●
UniProt http://www.uniprot.org/
●
CBS Prediction servers http://www.cbs.dtu.dk/services/
●
IMG Database http://img.jgi.doe.gov/
●
But...
Web services no good for genome-scale analyses
●
Usually limits to data input (with good reason)
●
Nucleic Acids Research publishes annual database and
web servers editions: http://nar.oxfordjournals.org/
7. Protein function and bioinformatics
Computational infrastructure for genomics
Biological Analysis
objects (limitless)
Genome Sequence analysis
Assembly Regulatory motifs
Computational
objects
Gene sequence Structural modeling
Protein sequence Phylogeny
Protein structure Comparative genomics
Pathway Pathway reconstruction
Key points
Appropriate hardware: workstation v. cluster
●
Linux Linux Linux!
●
Freely-available, open source software is all you need
●
Toolkits and libraries (e.g. BioPerl) to build your own solutions
●
Philosophy of “many small tools plus glue” - scripting language
●
Website + database skills - sharing
●
8. Protein function and bioinformatics
BioPerl: a life sciences computational toolkit
Website: http://www.bioperl.org
●
A collection of Perl modules for biology
●
Handles many common tasks in sequence/structure analysis, e.g.
●
- read/write various sequence formats
- run BLAST and parse the output
- read/write/analyse sequence alignments
- access local or remote databases
9. Protein function and bioinformatics
Annotation (or not) using BLAST
BLAST: Basic Local Alignment and Search Tool
Is useful for finding similar sequences quickly
●
Not sensitive – less useful for weakly-similar sequences
●
Not much good at all for annotation
●
Why not?
“Hypothetical”: the database sequence is unique
●
“Conserved hypothetical”: several hits but no known function
●
Multi-domain proteins
●
BLAST database contains incorrect annotations
●
Annotation is at the whim of whoever deposited the sequence
●
Classic example: IMPDH
Wu et al. (2003)
Comp. Biol. Chem. 27: 37-47
10. Protein function and bioinformatics
A better annotation tool: InterProScan
IPRScan is a tool to search the InterPro database
●
It uses sequence signature profiles – more sensitive than BLAST
●
Integrates the search results from multiple databases
●
A good first step to characterise a new sequence
●
Available as standalone package and runs on clusters
●
11. Protein function and bioinformatics
Structure prediction: threading and modelling
The structure of a protein often explains how it functions
●
However, structural determination is laborious, difficult and time-consuming
●
Modelling can be useful in cases sequence is similar to a known structure
●
Threading Homology modelling
Fit query sequence to fold database Assume similar sequence = similar structure
12. Protein function and bioinformatics
Some modelling tools and databases
SwissModel: http://swissmodel.expasy.org/
●
MODELLER: http://www.salilab.org/modeller/
●
PROSPECT: http://compbio.ornl.gov/structure/prospect2/
●
ModBase: http://modbase.compbio.ucsf.edu/
●
13. Protein function and bioinformatics
Introduction to M. burtonii
M. burtonii Ace Lake, Vestfold Hills The Archaea
Methanococcoides burtonii
Isolated from Ace Lake, Antarctica (1-2 °C)
●
Grows optimally at 23 °C
●
Is an archaeon
●
Is a psychrophilic methanogen
●
14. Protein function and bioinformatics
The M. burtonii genome
What features of this genome
are related to cold adaptation?
15. Protein function and bioinformatics
Discovery of CSP-like proteins in M. burtonii
CSP = cold shock protein
●
Expressed in bacteria at low temperature
●
Functions as RNA chaperone to facilitate
●
transcription at low temperature
Present in some Archaea, including
●
M. frigidum, but not M. burtonii
16. Protein function and bioinformatics
Discovery of CSP-like proteins in M. burtonii
Protein sequences
PROSPECT
thread v. CSD folds
MODELLER d1sro__ M. burtonii YP_564958
structural model
Both proteins are expressed (proteomics)
●
Located in a putative exosome/proteasome superoperon
●
This is consistent with their proposed function
●
17. Protein function and bioinformatics
Integrating information: structural RNA study
stems
% GC
all bases
OGT (°C)
Is tRNA GC content related to OGT? Dihydrouridine in M. burtonii
tRNAScan find tRNA in genomes tRNA contains > 1 hU/tRNA
● ●
GC content calculated using Perl scripts Maintains flexibility at low temperature
● ●
DUS gene identified using iprscan
●
18. Protein function and bioinformatics
Pyrrolysine: a problem for bioinformatics
Proteomics used to identify expressed proteins
●
One is trimethylamine methyltransferase (TMA-MT)
●
It shows post-translational modification
●
It also maps to 2 ORFs in the genome sequence
●
The ORFs are actually one gene with a read-through UAG codon
●
Pyrrolysine is incorporated at the UAG
●
This is the 22nd genetically-encoded amino acid
●
19. Protein function and bioinformatics
Statistical analysis of protein properties
Archaea
27 organisms
62 338 ORFs Amino acid frequency
(bioperl)
Bacteria
52 organisms
165 192 ORFs
data matrix
organisms (rows) x
composition (columns)
PCA
principal components
(R stats package)
20. Protein function and bioinformatics
Principal components analysis of composition
2 components explain most of the variation in amino acid composition
●
PC1 correlates with genome GC content
●
PC2 correlates with optimum growth temperature
●
The psychrophilic archaea are distinguished by PC2 score
●
Their proteins contain: more Gln, Ser, Thr, His, Asp
●
less Leu, Trp and Glu
21. Protein function and bioinformatics
Conclusions
Computational biology and bioinformatics are essential to modern biology
●
Many tools are available to annotate proteins: web-based
●
standalone
Without experiments, bioinformatics is just predictions
●
Data integration is our biggest problem
●
www.uq.edu.au/~uqnsaun1/