Bioinformatics t7-protein structure-v2013_wim_vancriekinge

FBW
19-11-2013

Wim Van Criekinge

The reason for “bioinformatics” to exist ?

• empirical finding: if two biological
sequences are sufficiently similar, almost
invariably they have similar biological
functions and will be descended from a
common ancestor.
• (i) function is encoded into
sequence, this means: the sequence
provides the syntax and
• (ii) there is a redundancy in the
encoding, many positions in the
sequence may be changed without
perceptible changes in the function, thus
the semantics of the encoding is robust.

Protein Structure

Introduction
Why ?
How do proteins fold ?
Levels of protein structure
0,1,2,3,4
X-ray / NMR
The Protein Database (PDB)
Protein Modeling
Bioinformatics & Proteomics
Weblems

Why protein structure ?

• Proteins perform a variety of cellular
tasks in the living cells
• Each protein adopts a particular folding
that determines its function
• The 3D structure of a protein can bring
into close proximity residues that are far
apart in the amino acid sequence
• Catalytic site: Business End of the
molecule

Rationale for understanding protein structure and function
Protein sequence
-large numbers of
sequences, including
whole genomes

?
Protein function
- rational drug design and treatment of disease
- protein and genetic engineering
- build networks to model cellular pathways
- study organismal function and evolution

structure determination
structure prediction

Protein structure
- three dimensional
- complicated
- mediates function

homology
rational mutagenesis
biochemical analysis
model studies

About the use of protein models (Peitch)

• Structure is preserved under evolution when
sequence is not
– Interpreting the impact of mutations/SNPs and conserved
residues on protein function. Potential link to disease
• Function ?
– Biochemical: the chemical interactions occerring in a protein
– Biological: role within the cell
– Phenotypic: the role in the organism

• Gene Ontology functional classification !

– Priorisation of residues to mutate to determine protein
function
– Providing hints for protein function:Catalytic mechanisms
of enzymes often require key residues to be close
together in 3D space
– (protein-ligand complexes, rational drug design, putative
interaction interfaces)

MIS-SENSE MUTATION
e.g. Sickle Cell Anaemia
Cause: defective haemoglobin due to mutation in βglobin gene
Symptoms: severe anaemia and death in homozygote

Normal β-globin - 146 amino acids
val - his - leu - thr - pro - glu - glu - --------1

2

3

4

Normal gene (aa 6)
DNA
CTC
mRNA
GAG
Product Glu

5

6

7

Mutant gene
CAC
GUG
Valine

Mutant β-globin
val - his - leu - thr - pro - val - glu - ---------

Protein Conformation

• Christian Anfinsen
Studies on reversible denaturation
“Sequence specifies conformation”
• Chaperones and disulfide
interchange enzymes:
involved but not controlling final state, they
provide environment to refold if misfolded

• Structure implies function: The amino
acid sequence encodes the protein’s
structural information

How does a protein fold ?

• by itself:
– Anfinsen had developed what he called his
"thermodynamic hypothesis" of protein folding to explain
the native conformation of amino acid structures. He
theorized that the native or natural conformation occurs
because this particular shape is thermodynamically the
most stable in the intracellular environment. That is, it
takes this shape as a result of the constraints of the
peptide bonds as modified by the other chemical and
physical properties of the amino acids.
– To test this hypothesis, Anfinsen unfolded the RNase
enzyme under extreme chemical conditions and observed
that the enzyme's amino acid structure refolded
spontaneously back into its original form when he returned
the chemical environment to natural cellular conditions.
– "The native conformation is determined by the totality of
interatomic interactions and hence by the amino acid
sequence, in a given environment."

The Basics

• Proteins are linear heteropolymers: one or more
polypeptide chains
• Below about 40 residues the term peptide is frequently
used.
• A certain number of residues is necessary to perform a
particular biochemical function, and around 40-50
residues appears to be the lower limit for a functional
domain size.
• Protein sizes range from this lower limit to several
hundred residues in multi-functional proteins.
• Three-dimentional shapes (folds) adopted vary
enormously
• Experimental methods:
–
–
–
–

X-ray crystallography
NMR (nuclear magnetic resonance)
Electron microscopy
Ab initio calculations …


• Zeroth: amino acid composition
(proteomics, %cysteine, %glycine)

Amino Acid Residues

The basic structure of an a-amino acid is quite simple. R denotes any one of the
20 possible side chains (see table below). We notice that the Ca-atom has 4
different ligands (the H is omitted in the drawing) and is thus chiral. An easy
trick to remember the correct L-form is the CORN-rule: when the Ca-atom is
viewed with the H in front, the residues read "CO-R-N" in a clockwise
direction.


• Primary: This is simply the order of
covalent linkages along the
polypeptide chain, I.e. the sequence
itself


• Secondary
– Local organization of the protein backbone: alphahelix, Beta-strand (which assemble into Betasheets) turn and interconnecting loop.

A Practical Approach: Interpretation

• Residues with hydrophobic properties
conserved at i, i+2, i+4 separated by
unconserved or hydrophilic residues
suggest surface beta- strands.
A short run of hydrophobic amino acids
(4 residues) suggests a buried betastrand.
Pairs of conserved hydrophobic amino
acids separated by pairs of
unconserved, or hydrophilic residues
suggests an alfa-helix with one face
packing in the protein core.
Likewise, an i, i+3, i+4, i+7 pattern of
conserved hydrophobic residues.

Secondary structure prediction ?

Secondary structure prediction:CHOU-FASMAN

• Chou, P.Y. and Fasman, G.D. (1974).
Conformational parameters for amino acids in helical, sheet, and random coil regions calculated from proteins.
Biochemistry 13, 211-221.
• Chou, P.Y. and Fasman, G.D. (1974).
Prediction of protein conformation.
Biochemistry 13, 222-245.


•Method
•Assigning a set of prediction values to a
residue, based on statistic analysis of 15
proteins
• Applying a simple algorithm to those
numbers


Calculation of preference parameters
For each of the 20 residues and each secondary structure ( helix, -sheet and -turn):
observed counts
• P = Log --------------------- + 1.0
expected counts
• Preference parameter > 1.0  specific residue has a
preference for the specific secondary structure.
• Preference parameter = 1.0  specific residue does not
have a preference for, nor dislikes the specific secondary
structure.
• Preference parameter < 1.0  specific residue dislikes the
specific secondary structure.


Preference parameters
Residue

P(a)

P(b)

P(t)

f(i)

f(i+1)

f(i+2)

f(i+3)

Ala

1.45

0.97

0.57

0.049

0.049

0.034

0.029

Arg

0.79

0.90

1.00

0.051

0.127

0.025

0.101

Asn

0.73

0.65

1.68

0.101

0.086

0.216

0.065

Asp

0.98

0.80

1.26

0.137

0.088

0.069

0.059

Cys

0.77

1.30

1.17

0.089

0.022

0.111

0.089

Gln

1.17

1.23

0.56

0.050

0.089

0.030

0.089

Glu

1.53

0.26

0.44

0.011

0.032

0.053

0.021

Gly

0.53

0.81

1.68

0.104

0.090

0.158

0.113

His

1.24

0.71

0.69

0.083

0.050

0.033

0.033

Ile

1.00

1.60

0.58

0.068

0.034

0.017

0.051

Leu

1.34

1.22

0.53

0.038

0.019

0.032

0.051

Lys

1.07

0.74

1.01

0.060

0.080

0.067

0.073

Met

1.20

1.67

0.67

0.070

0.070

0.036

0.070

Phe

1.12

1.28

0.71

0.031

0.047

0.063

0.063

Pro

0.59

0.62

1.54

0.074

0.272

0.012

0.062

Ser

0.79

0.72

1.56

0.100

0.095

0.095

0.104

Thr

0.82

1.20

1.00

0.062

0.093

0.056

0.068

Trp

1.14

1.19

1.11

0.045

0.000

0.045

0.205

Tyr

0.61

1.29

1.25

0.136

0.025

0.110

0.102

Val

1.14

1.65

0.30

0.023

0.029

0.011

0.029


Applying algorithm
1.
2.

3.
4.

5.
6.

Assign parameters to residue.
Identify regions where 4 out of 6 residues have P(a)>100: -helix. Extend
helix in both directions until four contiguous residues have an average
P(a)<100: end of -helix. If segment is longer than 5 residues and P(a)>P(b):
-helix.
Repeat this procedure to locate all of the helical regions.
Identify regions where 3 out of 5 residues have P(b)>100: -sheet. Extend
sheet in both directions until four contiguous residues have an average
P(b)<100: end of -sheet. If P(b)>105 and P(b)>P(a): -helix.
Rest: P(a)>P(b)  -helix. P(b)>P(a)  -sheet.
To identify a bend at residue number i, calculate the following value:
p(t) = f(i)f(i+1)f(i+2)f(i+3)
If: (1) p(t) > 0.000075; (2) average P(t)>1.00 in the tetrapeptide; and (3)
averages for tetrapeptide obey P(a)<P(t)>P(b): -turn.


Successful method?
19 proteins evaluated:
• Successful in locating 88% of helical and 95% of
regions
• Correctly predicting 80% of helical and 86% of sheet residues
• Accuracy of predicting the three conformational
states for all residues, helix, b, and coil, is 77%
Chou & Fasman:successful method
After 1974:improvement of preference parameters

Sander-Schneider: Evolution of overall structure

• Naturally occurring sequences with more than
20% sequence identity over 80 or more
residues always adopt the same basic
structure (Sander and Schneider 1991)

Sander-Schneider

• HSSP: homology derived secondary structure

Structural Family Databases

• SCOP:
– Structural Classification of
Proteins

• FSSP:
– Family of Structurally Similar
Proteins

• CATH:
– Class, Architecture, Topology,
Homology


• Tertiary
– Packing of secondary structure
elements into a compact spatial unit
– Fold or domain – this is the level to
which structure is currently possible

Domains

• Protein Dissection into domain
• Conserved Domain Architecture
Retrieval Tool (CDART) uses
information in Pfam and SMART to
assign domains along a sequence
• (automatic when blasting)

Domains

• From the analysis of alignment of protein
families
• Conserved sequence features, usually
associate with a specific function
• PROSITE database for protein
“signature” protein (large amount of FP &
FN)
• From aligment of homologous sequences
(PRINTS/PRODOM)
• From Hidden Markov Models (PFAM)
• Meta approach: INTERPRO

Levels of protein structure: Topology

Hydrophobicity Plot
P53_HUMAN (P04637) human cellular tumor antigen p53
Kyte-Doolittle hydrophilicty, window=19

The ‘positive inside’ rule
(EMBO J. 5:3021; EJB 174:671,205:1207; FEBS lett. 282:41)

Bacterial IM
In: 16% KR out: 4% KR
Eukaryotic PM
Thylakoid membrane
Mitochondrial IM

GPCR Topology

• Membrane-bound receptors

• Transducing messages as photons, organic odorants,
nucleotides, nucleosides, peptides, lipids and proteins.
• 6 different families
• A very large number of different domains both to
bind their ligand and to activate G proteins.
• Pharmaceutically the most important class
• Challenge: Methods to find novel GCPRs in human genome
…

GPCR Topology

GPCR Structure

• Seven transmembrane regions
• Hydrophobic/ hydrophilic domains
• Conserved residues and motifs (i.e. NPXXY)

GPCR Topology

Eg. Plot conserverd residues (or multiple alignement: MSA to SSA)


• Difficult to predict
• Functional units:
Apoptosome, proteasome

What is X-ray Crystallography

• X-ray crystallography is an experimental
technique that exploits the fact that X-rays are
diffracted by crystals.
• X-rays have the proper wavelength (in the
Ångström range, ~10-8 cm) to be scattered by
the electron cloud of an atom of comparable
size.
• Based on the diffraction pattern obtained from
X-ray scattering off the periodic assembly of
molecules or atoms in the crystal, the electron
density can be reconstructed.
• A model is then progressively built into the
experimental electron density, refined against
the data and the result is a quite accurate
molecular structure.

NMR or Crystallography ?

• NMR uses protein in solution
– Can look at the dynamic properties of the protein structure
– Can look at the interactions between the protein and
ligands, substrates or other proteins
– Can look at protein folding
– Sample is not damaged in any way
– The maximum size of a protein for NMR structure determination is ~30
kDa.This elliminates ~50% of all proteins
– High solubility is a requirement

• X-ray crystallography uses protein crystals
–
–
–
–
–
–

No size limit: As long as you can crystallise it
Solubility requirement is less stringent
Simple definition of resolution
Direct calculation from data to electron density and back again
Crystallisation is the process bottleneck, Binary (all or nothing)
Phase problem Relies on heavy atom soaks or SeMet incorporation

• Both techniques require large amounts of pure protein and require
expensive equipment!

Visualizing Structures

Cn3D versie 4.0 (NCBI)


Ball: Van der Waals radius
Stick: length joins center

N, blue/O, red/S, yellow/C, gray (green)


From N to C


• Demonstration of Protein explorer
• PDB, install Chime
• Search helicase (select structure where
DNA is present)
• Stop spinning, hide water molecules
• Show basic residues, interact with
negatively charged backbone
• RASMOL / Cn3D

Protein Stucture
Molecular Modeling:
building a 3D protein structure
from its sequence

Modeling

• Finding a structural homologue
• Blast
–versus PDB database or PSIblast (E<0.005)
–Domain coverage at least 60%

• Avoid Gaps
–Choose for few gaps and
reasonable similarity scores
instead of lots of gaps and high
similarity scores

Modeling
• Extract “template” sequences and align with query
•
•

Whatch out for missing data (PDB file) and complement with additonal
templates
Try to get as much information as possible, X/NMR

•

Sequence alignment from structure comparson of templates (SSA) can be
different from a simple sequence aligment

•
•

>40% identity, any aligment method is OK
<40%, checks are essential
–
–
–
–

•

Residue conservation checks in functional regions (patterns/motifs)
Indels: combine gaps separted by few resides
Manual editing: Move gaps from secondary elements to loops
Within loops, move gaps to loop ends, i.e. turnaround point of backbone

Align templates structurally, extract the corresponding SSA or QTA
(Query/template alignment)

Modeling

Input for model building
• Query sequence (the one you want the 3D
model for)
• Template sequences and structures
• Query/Template(s) (structure) sequence
aligment

Modeling

• Methods (details on these see paper):
– WHATIF,
– SWISS-MODEL,
– MODELLER,
– ICM,
– 3D-JIGSAW,
– CPH-models,
– SDC1

Modeling

• Model evaluation (How good is the prediction,
how much can the algorithm rely/extract on
the provided templates)
– PROCHECK
– WHATIF
– ERRAT

• CASP (Critical Assessment of Structure
Prediction)
– Beste method is manual alignment editing !

Comparative modelling at CASP
BC
alignment
side chain
short loops
longer loops

CASP1

CASP2

CASP3

CASP4

excellent
~ 80%
1.0 Å
2.0 Å

poor
~ 50%
~ 3.0 Å
> 5.0 Å

fair
~ 75%
~ 1.0 Å
~ 3.0 Å

fair
~75%
~ 1.0 Å
~ 2.5 Å

fair
~75%
~ 1.0 Å
~ 2.0 Å

CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity
**T128/sodm – 1.0 Å (198 residues; 50%)

**T111/eno – 1.7 Å (430 residues; 51%)

**T122/trpa – 2.9 Å (241 residues; 33%)

**T125/sp18 – 4.4 Å (137 residues; 24%)

**T112/dhso – 4.9 Å (348 residues; 24%)

**T92/yeco – 5.6 Å (104 residues; 12%)

Protein Engineering / Protein Design

Bioinformatics t7-protein structure-v2013_wim_vancriekinge

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Bioinformatics t7-protein structure-v2013_wim_vancriekinge

Ähnlich wie Bioinformatics t7-protein structure-v2013_wim_vancriekinge (20)

Mehr von Prof. Wim Van Criekinge

Mehr von Prof. Wim Van Criekinge (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bioinformatics t7-protein structure-v2013_wim_vancriekinge