Protein sequencing and its application in bioinformatics
1. PROTEIN SEQUENCING AND
ITS APPLICATION IN
BIOINFORMATICS
BY,
ARINDAM CHAKRABORTY
M.PHARM, 2ND SEMESTER
PHARMACEUTICAL BIOTECHNOLOGY
CIPT AND AHS
2. CONTENTS
1. Introduction
2. History
3. Prepare the proteins for sequencing
4. Sequencing methods
5. N-terminal sequencing
6. C-terminal sequencing
7. DNA sequencing
8. Protein mass spectrometry
9. Bioinformatics tools
3. INTRODUCTION
1. Protein:
Polymer of amino acids
Protein structure and function depends upon amino acid sequence.
2. Protein Sequencing:
Technique to find out amino acid sequences in protein.
Important for understanding cellular functions.
Important in targeting drugs to specific metabolic pathways
4. HISTORY
1951: The very first sequence of insulin protein were characterized by Fred Sanger. The
method used in this study , which is called “SANGER METHOD” was a milestone in
sequencing long strand molecule such as DNA. This method was eventually used in
human genome project.
1969: Analysis of sequence of tRNA were used to infer residues interactions from
corelated changes in nucleotide sequence, giving rise to tRNA secondary structure.
1970: Saul B.Needleman and Christain D.Wunsh published the first computer algorithm
for aligning two sequences.
1977: Publication of first complete genome of bacteriophage.
5. PREPARE THE PROTEINS FOR SEQUENCING
If the protein contains more than one polypeptide chain, the chains are separated and purified.
Intrachain S--S (disulfide) cross-bridges between cysteine residues in the polypeptide chain are
cleaved. If these disulfides are interchain linkages, then step 2 precedes step 1.
The amino acid composition of each polypeptide chain is determined.
The N-terminal and C-terminal residues are identified.
Each polypeptide chain is cleaved into smaller fragments.
Sequence determination of peptide fragments.
The overall amino acid sequence of the protein is reconstructed from the sequences in
overlapping fragments.
The positions of S--S cross-bridges formed between cysteine residues are located.
6. Separation of Polypeptide Chains:
Subunit associations in multimeric proteins are typically maintained solely by
noncovalent forces, and therefore most multimeric proteins can usually be
dissociated by exposure to pH extremes, 8 M urea, 6 M guanidinium hydrochloride,
or high salt concentrations.
Cleavage of Disulfide Bridges:
Oxidation of a disulfide by performic acid results in the formation of two
equivalents of cysteic acid.
9. EDMAN ‘S DEGRADATION
METHOD
Principle :
It sequentially remove one residue at a time from amino end of a peptide.
Mechanism :
Phenyl isothiocyanate is reacted with uncharged N-terminal amino group to form phenylthiocarbamoyl
derivative.
• Then under acidic conditions it is cleaved to form thiazolinone derivative.
• This thiazolinone derivative is extracted into organic solvent and treated with acid to form more stable
phenylthiohydantoin that can be identified using chromatography.
10. SANGER’S METHOD
• Treat with DNFB to form a derivative of amino terminal amino acid.
• Acid hydrolysis.
• Extraction of DNP-derivative with organic solvent.
• Identification of DNP-derivative by chromatography and comparison with
standards.
11. DANSYL CHLORIDE METHOD
• Reagent:1-dimethyl aminophthalene-5-sulfonyl chloride (dansyl chloride)
• Dansyl polypeptide chain is prepared.
• Acidic hydrolysis liberates all amino acid and N terminal dansyl amino acid.
• Amino acids are separated.
• Fluorescence of dansyl amino acid is detected.
• Types of amino acid is obtained from comparison with standard dansylated amino
acids.
12. C-TERMINAL SEQUENCING
Add carboxypeptidases to a solution of protein.
Take sample at regular intervals.
Determine the terminal amino acid by analyzing a plot of amino
acid concentration against time.
13. DNA SEQUENCING
• Protein sequence can also be determined indirectly from mRNa
• Design primers from the amino acid sequene and amplify the gene.
• Sequence the gene and determine the amino acid sequence of proteins.
14. MASS SPECTROMETRY
It is an important method for accurate mass determination and characterization of protein.
Basic Principle: This technique basically studies the effect of ionizing energy on molecules . It depends upon
chemical reactions in the gas phase in which sample molecules are consumed during the formation of ionic
and neutral species.
Components: The instrument consists of three major components:
1. Ion source: For producing gaseous ions from the substance being studied.
2. Analyzer: For resolving the ions into their characteristics mass components according to their mass to
charge ratio.
3. Detector system : For detecting the ions and recording the relative abundance of each of resolved ionic
species.
15. BIOINFORMATICS TOOLS
Bioinformatics:
The collection, classification ,storage and analysis of biochemical and biological
information using computers especially as applied to moleculer genetics and genomics.
It is an interdisciplinary field that develops method and software tools for
understanding biological data.
It combines biology, computer, science, information engineering, mathematics and
statistics to analyze and interpret biological data.
17. TYPES
On the basis of number of comparing sequencing strand, it is of two types:
Pairwise alignment
Multiple alignment Types
18. PAIRWISE SEQUENCE ALIGNMENT
Pairwise sequences alignment only compares two sequences at a time.
a b a c d
a b _ c d
Optimality is based on SCORE.
A pairwise alignment consist of series of paired bases, one base from each sequence.
There are three types of pairs:
1. I. Matches: the same nucleotide appears in both sequence.
2. II. Mismatches: different nucleotides are found in two sequences.
3. III. Gaps: a base in one sequence and null base in the other.
19. ALGORITHM used are Needleman-Wunsh algorithm and the Smith-Waterman algorithm.
BLAST (Basic Local Alignment Search Tool)
BLAST encompasses many different implementations and enhancements to a search algorithm that finds
“High Scoring Pairs” of sequence alignment in databases.
It is a Fast way to find similar sequences.
It is not the most sensitive way to search.
It is by a wide margin the most commonly used tool in bioinformatics.
20. BLAST STEPS
Seeding: Prepare a list of short, fixed length segments from the query.
Searching: Find highly similar or exact match for each word.
Extension: Extend each match to a longer match.
Evaluation: Evaluation the results using E values.
21. MULTIPLE SEQUENCE ALIGNMENT
Multiple Sequence Alignment can be seen as a generalization of Pairwise Sequence Alignment . Instead of
aligning just two sequences, three or more sequences are aligned simultaneously.
a b a c d
a b _ c d
x b a c e
MSA is used for:
a. Detection of conserved domains in a group of genes or proteins.
b. Construction of a phylogenetic tree.
c. Prediction of protein structure.
d. Determination of consensus sequences.
22. CLUSTAL
A popular heuristic algorithm is CLUSTAL, by Des Higgins and Paul Sharp(1988)
CLUSTAL makes a global multiple alignment using a “progressive alignment”
approach.
First computes all pairwise alignments and calculates sequence similarity between
pairs.
These similarities are used to build a rough guide tree.
23. BASIC INFORMATION COMES
FROM SEQUENCE
One sequence -can get some information eg-amino acid properties.
More than one sequence- get more info on conserved residues , fold and
function.
Multiple alignments of related sequence- can build up consensus
sequences of known families , domains , motifs or sites.
Sequence alignments can give information on loops, families and
function from conserved regions.
24. APPLICATIONS OF PROTEIN
SEQUENCING
Recombinant protein synthesis.
Drugs production.
Antibiotic production.
Functional genomics.
Determination of protein folding patterns.
In bioinformatics.
It plays vital role in proteomics.
Used for the prediction of final structure, function and location of protein.
To find out location of gene coding for that protein.
Genetic diseases.
Identification of sequence differences and variations such as point mutations.
Revealing the evolution and genetic diversity of sequence and organisms.