1. PREPARED FOR : Dr. Md. Khademul Islam
(Course Teacher)
PREPARED By : Naima Thahsin
ID : 13376001
Course : BTC 509: Genomics (Bioinformatics)
PRACTICAL NOTEBOOK ON:
BIOINFORMATICS
2. Page | 2
CHAPTER CONTENTS TOOLS PAGE NO.
Chapter-1
DNA sequence
analysis
1.1 General 03
1.2 Finding protein coding regions GeneMark 04-07
GENSCAN 07-09
1.3 Prediction of Promoters SoftBerry 10-12
Promoter 2.0 12-14
1.4 Detection of Tandem Repeat Tandem repeat
finder
14-18
1.5 Masking interspersed repeats RepeatMasker 18-22
1.6 Finding UTR location UTRScan 22-25
1.7 Searching CpG Islands CpG Islands 25-27
1.8 Predictioning Transcription Factor
Binding Sites
TFSEARCH 28-31
1.9 Designing PCR Primer and Calculating
Standard Properties
Primer3Plus 31-35
OligoCalc 36-37
1.10 Restriction Mapping BioTools 38-43
Chapter-2
Phylogenetic
relation
Analysis
2.1 General 44
2.2 Sequence alignment Clustal Omega 44-47
T-Coffee 48-50
2.3 Constructing phylogenetic tree MEGA 51-57
Chapter-3
Protein
Sequence
Analysis
3.1 General 58
3.2 Primary Structure Analysis ProtParam 58-63
3.3 Finding cleavage sites PeptideCutter 63-68
3.4 Computing profile produced by any
amino acid scale
ProtScale 68-74
3.5 Predicting post-translational
modifications
ScanProsite 74-79
3.6 Predicting functional domain InterProScan 79-81
3.7 Predicting secondary structure PSIPRED 83-86
3.8 Retrieving 3D structure of a protein
from PDB
Protein Data Bank
(PDB)
86-89
3. Page | 3
1. DNA Sequence Analysis
1.1 General
A gene is the molecular unit of heredity of a living organism. Genes hold the information to
build and maintain an organism's cells and pass genetic traits to offspring.
Basically a gene is a sequence of nucleic acids (DNA or, in the case of certain viruses RNA).
The vast majority of living organisms encode their genes in long strands of DNA
(deoxyribonucleic acid). Most DNA molecules are double-stranded helices, consisting of two
long biopolymers made of simpler units called nucleotides—each nucleotide is composed of
a nucleobase (guanine, adenine, thymine, and cytosine), recorded using the letters G, A, T,
and C, as well as a backbone made of alternating sugars (deoxyribose) and phosphate
groups (related to phosphoric acid), with the nucleobases (G, A, T, C) attached to the sugars.
The two strands of DNA run in opposite directions to each other and are therefore anti-
parallel (a strand running 5'-3' pairs with a complementary strand running 3'-5').
In biological systems, nucleic acids contain information which is used by a living cell to
construct specific proteins. Genes that encode proteins are composed of a series of three-
nucleotide sequences called codons, which serve as the words in the genetic language. Each
codon corresponds to a single amino acid, and there is a specific genetic code by which each
possible combination of three bases corresponds to a specific amino acid. However, a
significant portion of DNA (more than 98% for humans) is non-coding, meaning that these
sections do not serve a function of encoding proteins.
All genes have regulatory regions in addition to regions that explicitly code for a protein or
RNA product. A regulatory region shared by almost all genes is known as the promoter,
which provides a position that is recognized by the transcription machinery when a gene is
about to be transcribed and expressed. Other possible regulatory regions include enhancers,
which can compensate for a weak promoter. Most regulatory regions are "upstream"—that
is, before or toward the 5' end of the transcription initiation site. Eukaryotic promoter
regions are much more complex and difficult to identify than prokaryotic promoters.
In bioinformatics, the term genetic sequence analysis refers to the process of subjecting a
DNA sequence to any of a wide range of analytical methods to understand its features,
function, structure, or evolution. Methodologies used include sequence alignment, searches
against biological databases, and others.
4. Page | 4
1.2 Finding protein coding regions in a DNA sequence
Protein coding genes have different structures in microbes and multicellular organisms. In
microbes, each protein is encoded by a simple DNA segment-from start to end-called open
readings frame (ORF). In animal and plant genes, proteins are encoded in several pieces
called exons, separated by noncoding segments called introns. There are many sites which
provide tools for finding ORF or coding regions.
a) GeneMark
GeneMark is a family of ab initio gene prediction programs developed at the Georgia
Institute of Technology in Atlanta. GeneMark developed in 1993 was the first gene finding
method recognized as an efficient and accurate tool for genome projects.
The GeneMark algorithm uses species specific inhomogeneous Markov chain models of
protein-coding DNA sequence as well as homogeneous Markov chain models of non- coding
DNA. Parameters of the models are estimated from training sets of sequences of known
type. The major step of the algorithm computes a posteriory probability of a sequence
fragment to carry on a genetic code in one of six possible frames (including three frames in
complementary DNA strand) or to be "non-coding".
Procedure
Go to the homepage
of GeneMark,
http://exon.gatech.
edu/genemark
Click on
“GeneMark” on the
right panel.
Choose appropriate
model from given
options (e.g.
Models for
prokaryotes)
Paste the sequence
to be checked or the
sequences can be
uploaded
Change the
parameters if it is
needed
Click on “Start
GeneMark” button.
7. Page | 7
Result Interpretation
The result has provided information on the G+C content (54.78 %), 3 possible coding
sequences (CDS), their position on strand, length and left starting & right stopping end and
the possible protein sequences translated from the exons.
b) GENSCAN
In bioinformatics GENSCAN is a program to identify complete gene structures in genomic
DNA. It is a GHMM-based program that can be used to predict the location of genes and
their exon-intron boundaries in genomic sequences from a variety of organisms. It is a
eukaryotic ab initio gene finder that has achieved notable success. The GENSCAN Web
server can be found at MIT.
Procedure
Go to GENSCAN home page through the
link, http://genes.mit.edu/GENSCAN.html
Paste the nucleotide sequence of interest
Click on the 'Run GENSCAN' button
9. Page | 9
ii. Predicted peptide sequences:
Result Interpretation
The result from GENSCAN provided following information on the sequence provided-
G+C contents 41.22%
The strand type, beginning position, end point, length, reading frame and exon score
of initial, internal and terminal exons and poly-A-signal
Predicted peptide sequence
The suboptimal exon cutoff value was set at 1.00. From the result the exon scores were
found to be above the cutoff value which was quite good. So it can be said that the
prediction was good.
1.3 Prediction of Promoters
A promoter is a region of DNA that initiates transcription of a particular gene. Promoters are
located near the genes they transcribe, on the same strand and upstream on the DNA.
Promoters can be about 100–1000 base pairs long.
For the transcription to take place, the enzyme that synthesizes RNA, known as RNA
polymerase, must attach to the DNA near a gene. Promoters contain specific DNA
sequences and response elements that provide a secure initial binding site for RNA
polymerase and for proteins called transcription factors that recruit RNA polymerase. These
transcription factors have specific activator or repressor sequences of corresponding
nucleotides that attach to specific promoters and regulate gene expressions.
10. Page | 10
a) SoftBerry
Through SoftBerry program we can recognize bacterial promoter with 80% accuracy and
specificity. In bacteria, the promoter contains two short sequence elements approximately -
10 and -35 nucleotides upstream from the transcription start site.
Procedure
Go to SoftBerry home page,
http://www.softberry.com
From left panel select
‘OPERON AND GENE FINDING
IN BACTERIA’ and click on
‘BPROM’
Paste the sequence of
interest
Click on the ‘PROCESS’ button
12. Page | 12
Result Interpretation
In BPROM program the threshold level for promoters is 0.20. The scores from the result for -
10 and -35 box were 25 and 41, respectively, both of which were above the threshold level.
So, the prediction was quite good. The result also provided the position of the boxes at 154
and 134.
The result also provided information about the transcription factor binding sites for –
rpoS17, ihf, g1pR, crp and rpoD19 – the sequences of the sites, their positions and scores.
b) Promoter 2.0 Prediction Server
Promoter2.0 predicts transcription start sites of vertebrate PolII promoters in DNA
sequences. It has been developed as an evolution of simulated transcription factors that
interact with sequences in promoter regions. It builds on principles that are common to
neural networks and genetic algorithms.
Procedure
Go to Promoter 2.0 home page,
http://www.cbs.dtu.dk/service
s/Promoter/
Paste the nucleotide
sequence of interest
Click on the 'Submit' button
14. Page | 14
Result Interpretation
According to the result the transcription start site was predicted to be at 800 position.
The score table for Promoter 2.0 is -
For the provided nucleotide sequence, the score was found to be 0.592 which depicts
marginal prediction.
1.4 Detection of Tandem Repeat
Tandem repeats occur in DNA when a pattern of two or more nucleotides is repeated and
the repetitions are directly adjacent to each other. When between 10 and 60 nucleotides
are repeated, it is called a minisatellite. Those with fewer are known as microsatellites or
short tandem repeats. Tandem repeat describes a pattern that helps determine an
individual's inherited traits. Tandem repeats can be very useful in determining parentage.
Tandem repeat finder
Tandem Repeats Finder is a program to locate and display tandem repeats in DNA
sequences. In order to use the program, the user submits a sequence in FASTA format. The
program is very fast, analyzing sequences on the order of .5Mb in just a few seconds.
Submitted sequences may be of arbitrary length. Repeats with pattern size in the range
from 1 to 2000 bases are detected.
Procedure
Go to Tandem repeat finder
home page,
http://tandem.bu.edu/trf/t
rf.html
Click on ‘Submit a Sequence
for Analysis’
Select the option ‘Basic’ to
use default parameters
Choose the option ‘cut
and paste sequence’
Paste the sequence to the
box provided
Click on the ‘Submit
sequence’ button
18. Page | 18
Result Interpretation
The result indicates that
1 repeat was found in the given nucleotide sequence.
The indices were within 126-186.
The consensus size was 4 and the pattern was “GATA”.
The score was 104 which was quite good.
1.5 Masking interspersed repeats in a sequence
In the mid 1960's scientists discovered that many genomes contain stretches of highly
repetitive DNA sequences. These sequences were later characterized and placed into five
categories: Simple Repeats, Tandem Repeats, Segmental Duplications and Interspersed
Repeats. Interspersed repetitive DNA is found in all eukaryotic genomes and comprises of-
Processed Pseudogenes,
Retrotranscripts,
SINES,
DNA Transposons,
Retrovirus Retrotransposons and
Non-Retrovirus Retrotransposons (LINES )
Currently up to 50% of the human genome is repetitive in nature and as improvements are
made in detection methods this number is expected to increase.
RepeatMasker
RepeatMasker is a program that screens DNA sequences for interspersed repeats and low
complexity DNA sequences. The output of the program is a detailed annotation of the
repeats that are present in the query sequence as well as a modified version of the query
sequence in which all the annotated repeats have been masked (default: replaced by Ns).
19. Page | 19
Procedure
Go to RepeatMasker home page through the
link, http://www.repeatmasker.org/
Select the option 'RepeatMasking' from the
left panel
Paste the nucleotide sequence to the box
provided
Click on the 'Reset' button
22. Page | 22
Result Interpretation
In the analyzed nucleotide sequence only one interspersed repetitive sequence was found.
The sequence was SINE that contained 51 base pairs. The repetitive sequence was detected
and masked.
1.6 Finding UTR location
In molecular genetics, an untranslated region (or UTR) refers to either of two sections (5'
UTR or 3'-UTR), one on each side of a coding sequence on a strand of mRNA.
The five prime untranslated region (5' UTR) (also known as a Leader Sequence or Leader
RNA) is the region of an mRNA that is directly upstream from the initiation codon. This
region is important for the regulation of translation of a transcript.
On the other hand, the three prime untranslated region (3'-UTR) is the section of messenger
RNA (mRNA) that immediately follows the translation termination codon. The 3'-UTR often
contains regulatory regions that influence post-transcriptional gene expression. Regulatory
regions within the 3'-untranslated region can influence polyadenylation, translation
efficiency, localization, and stability of the mRNA.
UTRScan
UTRscan is a pattern matcher which searches protein or nucleotide (DNA, RNA, tRNA)
sequences in order to find UTR motifs. It is able to find, in a given sequence, motifs that
characterize 3'UTR and 5'UTR sequences. Such motifs are defined in the UTRSite Database, a
collection of functional sequence patterns located in the 5'- or 3'-UTR sequences.
The UTRsite entries describe the various regulatory elements present in UTR regions and
whose functional role has been established on experimental basis. UTRsite database could
reveal very useful for automatic annotation of anonymous sequences generated by
sequencing projects as well as for finding previously undetected signals in known gene
sequences.
23. Page | 23
Procedure
Go to UTRScan home page through the
link, http://itbtools.ba.itb.cnr.it/
Paste the nucleotide sequence in FASTA
format
Insert a valid email address
Click on the 'Submit' button
25. Page | 25
b) Status of provided sequence
Result Interpretation
The UTRScan program found following UTR motifs in the provided sequences-
IRES Iron Responsive Element
K-B K-Box
uORF Upstream Open Reading Frame
MBE Musashi binding element
A total of 9 matches for 4 signals were found in the sequence. The position and sequence of
the UTR motifs were also detected by UTRScan.
1.7 Search for CpG Islands
In genetics, CpG islands or CG islands (CGI) are genomic regions with at least 200 bp that
contain a high frequency of CpG sites. The "p" in CpG refers to the phosphodiester bond
between the cytosine and the guanine, which indicates that the C and the G are next to each
other in sequence, regardless of being single- or double- stranded. In a CpG site, both C and
G are found on the same strand of DNA or RNA and are connected by a phosphodiester
bond.
26. Page | 26
CpG Islands
CpG Islands reports potential CpG island regions using the method described by Gardiner-
Garden and Frommer (1987). The calculation is performed using a 200 bp window moving
across the sequence at 1 bp intervals.
CpG islands are defined as sequence ranges where the Obs/Exp value is greater than 0.6 and
the GC content is greater than 50%. The expected number of CpG dimers in a window is
calculated as the number of 'C's in the window multiplied by the number of 'G's in the
window, divided by the window length.
CpG islands are often found in the 5' regions of vertebrate genes, therefore this program
can be used to highlight potential genes in genomic sequences.
Procedure
Go to CpG Islands
homepage,
http://www.bioinforma
tics.org/sms2/cpg_islan
ds.html
Paste the sequence of
interest in FASTA format
Click on 'Submit' button
27. Page | 27
Result Interpretation
The range of GC content was found to be 54.50-64 % in the given sequence which was
greater than the cutoff value (50%).
28. Page | 28
1.8 Prediction of Transcription Factor Binding Sites
In molecular biology and genetics, a transcription factor (sometimes called a sequence-
specific DNA-binding factor) is a protein that binds to specific DNA sequences, thereby
controlling the flow (or transcription) of genetic information from DNA to messenger RNA.
Transcription factors perform this function alone or with other proteins in a complex, by
promoting (as an activator), or blocking (as a repressor) the recruitment of RNA polymerase
(the enzyme that performs the transcription of genetic information from DNA to RNA) to
specific genes.
A defining feature of transcription factors is that they contain one or more DNA-binding
domains (DBDs), which attach to specific sequences of DNA adjacent to the genes that they
regulate.
TFSEARCH
TFSEARCH program was written by Yutaka Akiyama (Kyoto University, currently at RWCP) in
1995. TFSEARCH searches highly correlated sequence fragments versus TFMATRIX
transcription factor binding site profile database in 'TRANSFAC' databases developed at GBF-
Braunschweig, Germany.
Procedure
Go to TFSEARCH home page through the link,
http://www.cbrc.jp/research/db/TFSEARCH.html
Enter any label for the sequence into top field
Paste the nucleotide sequence in FASTA
format into second field
Set 'Threshold score' if necessary
Click on 'Exec' button to submit the query
sequence to the server
31. Page | 31
Result Interpretation
The given sequence was analyzed for transcription factor binding sites. A total of 12 high
scoring sites were found in the sequence. All of them were above the threshold level (85.0).
The maximum score was 95.4 and minimum score was 85.3. The sequence was predicted to
be associated with following transcription factors-
HSF (Heat shock factor1)
HSF2 (Heat shock factor2)
ADR1 (alcohol dehydrogenase1)
GATA 1 (globin transcription factor 1)
GATA 2 (globin transcription factor 2)
1.9 Designing PCR Primer and Calculating Standard Properties
The polymerase chain reaction, usually referred to as PCR, is an extremely powerful
procedure that allows the amplification of a selected DNA sequence in a genome a million-
fold or more in vitro-without the use of living cells during the cloning process. In this
technique, the known part of the DNA is used to design two synthetic DNA oligonucleotides,
one complementary to each strand of the DNA double helix and lying on opposite sides of
the region to be amplified. These oligonucleotides serve as primers for in vitro DNA
synthesis, which is catalyzed by DNA polymerase. Primers are required for DNA replication
because the enzymes that catalyze this process, DNA polymerases, can only add new
nucleotides to an existing strand of DNA.
a) Primer3Plus
The Internet site of University of Massachusetts Medical School
(biotools.umassmed.edu) provides a link to a very complete and easy to use tool for
primer designing, eg., Primer3Plus. Primer3Plus picks primers for PCR reactions,
according to the conditions specified by the user. Primer considers things like melting
temperature, concentrations of various solutions in PCR reactions, primer bending and
folding, and many other conditions when attempting to choose the optimal pair of primers
for a reaction. All of these conditions are user-specifiable, and can vary from reaction to
reaction.
32. Page | 32
Procedure
Go to Bio Tools home page through the link,
http://biotools.umassmed.edu/cgi-
bin/primer3plus/primer3plus.cgi
Select 'Primer3Plus' from the 'DNA Sequence
Analysis' tools
Paste the nucleotide sequence and Change
the parameters as necessary
Click on 'Pick Primers' button to submit the
query sequence to the server
35. Page | 35
Result Interpretation
Primer3Plus provided 5 pairs of primers for the given nucleotide sequence. Each pair (left
and right primers) has suitable features like length, temperature and GC content that fit to
the provided settings. The first pair contains 20 bp long primers-
Left Primer 1: GCCTCCTAATTCGGGCAGAA
Right Primer 1: AAGGATGGGGTCTCCTCCTC
The pair of primer is capable of amplifying 590 bp of the nucleotide sequence.
36. Page | 36
b) OligoCalc
OligoCalc is a web-accessible, client-based computational engine for reporting DNA and RNA
single-stranded and double-stranded properties, including molecular weight, solution
concentration, melting temperature, estimated absorbance coefficients, inter-molecular
self-complementarity estimation and intra-molecular hairpin loop formation. OligoCalc has a
familiar ‘calculator’ look and feel, making it readily understandable and usable.
Method
Go to Oligo Calc home
page through the link,
http://www.basic.north
western.edu/biotools/oli
gocalc.html
Paste the oligonucleotide
sequence of primer
Click anywhere to get the
properties of the given
sequence
38. Page | 38
1.10 Restriction Mapping
A restriction map is a map of known restriction sites within a sequence of DNA. Restriction
mapping requires the use of restriction enzymes. Restriction enzymes are enzymes that cut
DNA at specific recognition sequences called "sites." They probably evolved as a bacterial
defense against DNA bacteriophage. DNA invading a bacterial cell defended by these
enzymes will be digested into small, non-functional pieces. The name "restriction enzyme"
comes from the enzyme's function of restricting access to the cell.
There are hundreds of restriction enzymes that have been isolated and each one recognizes
its own specific nucleotide sequence. Sites for each restriction enzyme are distributed
randomly throughout a particular DNA stretch. Digestion of DNA by restriction enzymes is
very reproducible; every time a specific piece of DNA is cut by a specific enzyme, the same
pattern of digestion will occur. Restriction enzymes are commercially available and their use
has made manipulating DNA very easy.
BioTools-Restriction mapping tool
One approach in constructing a restriction map of a DNA molecule is to sequence the whole
molecule and to run the sequence through a computer program that will find the
recognition sites that are present for every restriction enzyme known. ‘BioTools’ provides an
application, Restriction mapping tool, which allows the user to supply both DNA sequence
and (optionally) his own file of Restriction Enzymes or other IUPAC patterns in GCG for
Restriction Enzyme Mapping and Analysis, using Harry Mangalam's tacg 4.3 program as the
analysis engine.
Procedure
Go to BioTools home
page
(http://biotools.uma
ssmed.edu/)
Select 'Restriction
mapping tool' from
the panel
Paste the DNA
sequence in the
'Sequence Entry' box
Select restriction
enzymes from the list
Change other
parameters as
necessary
Click on 'Submit
Sequence to
WWWtacg' button
43. Page | 43
Result Interpretation
The Restriction Enzyme Tool of ‘Bio Tools’ server analyzed the given nucleotide sequence
and exhibited 6 hits for the selected 3 restriction enzymes- EcoRI, HindIII, BamHI. 3 hits were
found for BamHI, 2 for HindIII and 1 for EcoRI. These enzymes specify and cut at the
following sites of the nucleotide sequence-
Restriction Enzyme Site Position
BamHI GGATCC 1240,1865,2085
HindIII AAGCTT 1466,2115
EcoRI GAATTC 2064
44. Page | 44
2. Phylogenetic relation Analysis
2.1 General
Phylogenetics is the study of the evolutionary relationships of living organisms using treelike
diagrams to represent pedigrees of these organisms. Phylogenetics can be studied in various
ways. Molecular data that are in the form of DNA or protein sequences can provide very
useful evolutionary perspectives of existing organisms because, as organisms evolve, the
genetic materials accumulate mutations over time causing phenotypic changes. Through
comparative analysis of these biological molecules from a number of related organisms, the
evolutionary history of the genes or proteins and even the organisms can be revealed.
Usually Similarities and divergence among related biological sequences revealed by
sequence alignment are rationalized and visualized in the context of phylogenetic trees.
Therefore the study of phylogenetic relationship, in general, involves sequence alignment
and establishing phylogenetic tree.
2.2 Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA,
or protein to identify regions of similarity that may be a consequence of functional,
structural, or evolutionary relationships between the sequences.[1] Aligned sequences of
nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps
are inserted between the residues so that identical or similar characters are aligned in
successive columns.
Objectives:
To understand the similarities among group of sequences
To determine conserved regions
To understand the evolutional relationship among related sequences.
To do so 10 protein sequences of Small Membrane Protein for different species of
Coronaviridae were retrieved from NCBI and analyzed through both Clustal Omega and T-
Coffee. The comparison between the results from both tools is given later.
a) Clustal Omega
Clustal Omega is a new multiple sequence alignment program that uses seeded guide trees
and HMM profile-profile techniques to generate alignments. It produces biologically
meaningful multiple sequence alignments of divergent sequences. Evolutionary
relationships can be seen via viewing Cladograms or Phylograms.
Procedure
45. Page | 45
Result
Go to Clustal Omega home page,
http://www.ebi.ac.uk/Tools/msa/clustalo/
Paste the protein sequences retrieved in
multifasta format
Click on the 'Submit' button to submit the
sequences to the server
47. Page | 47
b) T-Coffee
T-Coffee (Tree-based Consistency Objective Function For alignment Evaluation) is a multiple
sequence alignment software using a progressive approach. It generates a library of
pairwise alignments to guide the multiple sequence alignment. It can also combine multiple
sequences alignments obtained previously and in the latest versions can use structural
information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of
the alignments and some capacity for identifying occurrence of motifs
Procedure
Go to T-Coffee home page (http://tcoffee.vital-
it.ch/apps/atcoffee/index.html)
Select 'T-Coffee' tool from the panel
Paste the protein sequences retrieved in
multifasta format
Click on the 'Submit' button to submit the
sequences to the server
50. Page | 50
Result Interpretation and Comparison between results from Clustal Omega
and T-Coffee
The sequence alignment was found to be better with T-Coffee than Clustal Omega. Along
with aligned sequences, T-Coffee also provides the user alignment score for the input
sequences. For the given sequences following scores were found-
gi|530341189|gb : 47
gi|530802146|gb : 43
gi|148728344|gb : 46
gi|530802593|gb : 39
gi|56807328|ref : 42
gi|126030129|re : 42
gi|211907043|gb : 32
gi|212681391|re : 32
gi|187251957|re : 32
gi|33304216|gb| : 44
cons : 44
T-Coffee exhibited 2 conserved regions, whereas 1 was found in Clustal Omega. Number of
regions with matches was also greater in T-Coffee than Clustal Omega. However, the
advantage with Clustal Omega is that it provides a tool for building phylogenetic tree which
would be available if ‘Java’ is present.
51. Page | 51
2.3 Constructing phylogenetic tree
A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the
inferred evolutionary relationships among various biological species or other entities —
their phylogeny — based upon similarities and differences in their physical and/or genetic
characteristics. The taxa joined together in the tree are implied to have descended from a
common ancestor.
MEGA
MEGA (MOLECULAR EVOLUTIONARY GENETICS ANALYSIS- Windows v5.2.2) is a software
that provides tools for both multiple sequence alignment and constructing phylogenetic
tree.
Procedure
a) MEGA was downloaded from http://www.megasoftware.net/ and installed in the
Windows 7 environment.
b) 10 protein sequences of Nucleocapsid protein for different species of Coronaviridae
were retrieved from NCBI
c) The Multifasta file containing protein sequences was run by MEGA.
The flowchart of the procedure is as follows:
Open MEGA 5.2.2
Open a file in FASTA
format
Select the option
'Align'
Select 'Muscle' from
upper panel to align
protein sequences
Set the parameters
as default in settings
window and click on
'compute'
Save session in MAS
format
Click on 'Phylogeny'
option from upper
panel and select
'Maximum Likelihood'
Open a file
containing protein
sequences saved in
'mas' format
Click on the
'Compute' button
56. Page | 56
Result
Result Interpretation
According to the inferred phylogenetic tree based on protein sequences from different
species of Coronaviridae-
Two broad subgroups (B and C) have descended from a common ancestor A.
In the subgroup B, Bulbul coronavirus HKU11 and Munia coronavirus HKU13 are the
closely related groups which are related to Beluga whale coronavirus SW1 and they
are descended from the ancestor F. The group F is related to another group E that
includes two closely related virus species, Human coronavirus OC43 and Human
coronavirus HKU1. The groups F and E are descendants of D which is descended from
B. the group B gives rise to an outgroup, Pipistrellus bat coronavirus HKU5, which is
more close to group E than F.
57. Page | 57
In the subgroup C, Human coronavirus 229E and Human coronavirus NL63 are the
closely related groups which are related to Porcine epidemic diarrhea virus and they
are descended from the ancestor H. The group H is descended from the ancestor C
which gives rise to an outgroup, Transmissible gastroenteritis virus.
58. Page | 58
3. Protein sequence Analysis
3.1. General
Proteins are one of the important fundamental units of all living cells. Proteins have a wide
range of functions within all the living beings. Some of the important functions such as DNA
replication, catalysis of metabolic reactions, transportation of molecules from one location
to another etc. are performed with the help of proteins.
The building blocks of proteins are amino acids. Amino acids are made from an amine (-
NH2) and a carboxylic acid (-COOH) functional groups as well as a side chain which is specific
to each amino acid. There are almost 20 amino acids found in human body that usually
varies in their R groups. In proteins, the amino acids are linked to each other by means of
peptide bonds. A peptide bond is formed when the carboxyl group of one amino acid is
linked to the amino group of another molecule through a covalent bond.
Proteins differ from one another in their structure, primarily in their sequence of amino
acids. The structure explains the different levels of organization of a protein molecule. The
protein structure is classified into primary, secondary, tertiary, and quaternary. The linear
sequence the polypeptide chain of amino acid refers to the primary structure of proteins.
The intermolecular and intra-molecular hydrogen bonding between the amide groups in
primary structure of protein form secondary structure. Alpha helices and beta sheets are
the two important secondary structures in protein. The three dimensional structure of a
single protein molecule refers to the tertiary structure. The quaternary structure is formed
by several protein molecules or polypeptide chains.
3.2. Primary Structure Analysis of a Protein
There are different tools available through ExPasy server to analyze a protein sequence.
ExPASy is the SIB Bioinformatics Resource Portal. It provides access to several scientific
databases and software tools in many areas of life sciences including proteomics, genomics,
phylogeny, systems biology, population genetics, transcriptomics etc.
ProtParam is one among the protein analysis tools available on the ExPasy server and can
be accessible through the link, http://www.expasy.org/tools/protparam.html. It is used for
calculating various physiochemical parameters of a provided protein. The protein sequence
is the only input provided to calculate such parameters.
In ProtParam, the protein can be specified as -
UniProtKB/Swiss-Prot accession number,
59. Page | 59
UniProtKB/TrEMBL accession number,
ID or
Amino acid sequences.
The various parameters computed by ProtParam are molecular weight, amino acid
composition, extinction coefficient, estimated half-life, theoretical pI, and grand average of
hydropathicity (GRAVY), aliphatic index and instability index.
Objectives
To compute the various physical and chemical parameters of a protein.
To perform primary structure analysis of proteins.
Procedure
Go to ProtParam home page,
http://www.expasy.org/tools
/protparam.html
Paste the FASTA sequence of
protein of interest
Click on the ‘Compute
parameters’ button
60. Page | 60
ProtParam home page
Paste the FASTA sequence of protein
63. Page | 63
Resut Interpretation
From the result of ProtParam we found that-
The estimated half-life is 30 hours which indicates that half of the amount of protein
in a cell disappears 30 hours after its synthesis in the cell.
The instability index of the analyzed protein is 37.87 which is less than the cut off
value (40). So the protein is considered as stable.
According to the computed aliphatic index, the protein has greater volume of amino
acids having aliphatic side chains in their structures.
The grand average of hydropathicity of the protein is 0.327. The positive score
indicates that the protein has greater hydrophobicity.
3.3 Finding cleavage sites in a given protein sequence
PeptideCutter searches a protein sequence from the SWISS-PROT and/or TrEMBL databases
or a user-entered protein sequence for protease cleavage sites. Single proteases and
chemicals, a selection or the whole list of proteases and chemicals can be used. Most of the
cleavage rules for individual enzymes were deduced from specificity data summed up by Keil
(1992).
Different forms of output of the results are available: Tables of cleavage sites either grouped
alphabetically according to enzyme names or sequentially according to the amino acid
number. A third option for output is a map of cleavage sites. The sequence and the cleavage
sites mapped onto it are grouped in blocks, the size of which can be chosen by the user to
provide a convenient form of print-out.
Method
Go to PeptideCutter home
page,
http://web.expasy.org/pepti
de_cutter/
Paste the FASTA sequence of
protein of interest
Select enzymes and
chemicals if necessary
Click on the ‘Perform’ button
67. Page | 67
Map of cleavage sites
The cleavage sites for a single enzyme, e.g. Trypsin, mapped onto the entered protein
sequence are shown below:
68. Page | 68
Discussion
We can predict the potential cleavage sites cleaved by proteases or chemicals in a given
protein sequence with the help of a bioinformatic tool, PeptideCutter.
If we know the cleavage sites of a protein, we can use an enzyme to cut input protein in
specific ways. This can be useful if we are interested in carrying out experiments on a
portion of our protein.
PeptideCutter can also help us in following aspects:
If we want to separate the domains in our protein
Identify potential post-translational modification by mass spectrometry
Remove a tag protein when we want to express a fusion protein
Make sure that the protein we are cloning is not sensitive to some endogenous
proteases
3.4 Computing profile produced by any amino acid scale
ProtScale allows to compute and represent (in the form of a two-dimensional plot) the
profile produced by any amino acid scale on a selected protein.
An amino acid scale is defined by a numerical value assigned to each type of amino acid. The
most frequently used scales are hydrophobicity scales, most of which were derived from
experimental studies on partitioning of peptides in apolar and polar solvents, with the goal
of predicting membrane-spanning segments that are highly hydrophobic, and secondary
structure conformational parameter scales. In addition, many other scales exist which are
based on different chemical and physical properties of the amino acids.
ProtScale can be used with 50 predefined scales entered from the literature. The scale
values for the 20 amino acids, as well as a literature reference, are provided on ExPASy for
69. Page | 69
each of these scales. To generate data for a plot, the protein sequence is scanned with a
sliding window of a given size. At each position, the mean scale value of the amino acids
within the window is calculated, and that value is plotted for the midpoint of the window.
We can set several parameters that control the computation of a scale profile, such as the
window size, the weight variation model, the window edge relative weight value, and scale
normalization.
Objective
Use the scale of hydrophobicity to identify the groups of hydrophobic segments
within the protein sequence.
Predicting transmembrane segments in the given protein.
Method
Go to ProtScale home page,
http://web.expasy.org/prot
scale/
Past FASTA sequence of
desired protein
Choose an amino acid scale
from the list (e.g., Hphob. /
Kyte & Doolittle)
Set window size at 19Normalize scale, if
necessary
Click on the ‘Submit’ button
74. Page | 74
Discussion
Hydrophobicity scales are values that define relative hydrophobicity of amino acid residues.
The more positive the value, the more hydrophobic are the amino acids located in that
region of the protein and hydrophobic segments characterize transmembrane proteins.
The desired protein sequence was analyzed using Kyte & Doolittle (hydrophobicity) scale.
The recommended threshold value when using Kyte and Doolittle is 1.6. From the result,
four regions of the given protein sequence was found above the threshold level. The highest
peak has been found at the N-terminus of the sequence which indicates the presence of a
transmembrane segment and predicts that the protein is secreted.
3.5 Predicting post-translational modifications in protein
Proteins often need to be modified before they become active in the cell. This is called post-
translational modifications. It may involve adding sugars, modifying amino acids, or
removing pieces of the newly synthesized protein. If we are studying a new protein, we may
want to know about such matters. It is also important if we want to clone and express a
human protein in bacteria, because, in order to be active, protein may require some post-
translational modifications that the bacterium itself cannot make.
PROSITE is a database that contains a list of short sequence motifs (also some named
patterns) that experiments have associated with particular biological properties. Many of
these patterns are associated with post-translational modifications. On the ExPASy server
(www.expasy.org), we can compare our protein sequence with the collection of patterns in
PROSITE and find out which modifications our protein is likely to undergo.
Objective
Scan our protein of interest for matches against the PROSITE collection of motifs and
Find out post-translational modifications in that protein.
Methods
Go to ScanProsite home page,
http://prosite.expasy.org/sca
nprosite/
Give UniProtKB accession
number
Select ‘Exclude profiles from
the scan’
Click on the ‘Start the scan’
button
79. Page | 79
Resut Interpretation & Discussion
The result from ScanProsite represents 6 hits (by 3 distinct patterns) for 3 types of short
sequence motifs which are predicted to be associated with post translational modifications.
The sequence motifs are-
Multicopper oxidase 1
FA58C 1 (Coagulation factor 5/8 type C domain)
FA58C 2
PDB structure viewer shows the 3D structure of FA58C 1 and FA58C 2 associated with the A
chain of the protein structure.
3.6 Predicting functional domain in protein sequence
InterPro is a database of protein families, domains and functional sites in which identifiable
features found in known proteins can be applied to new protein sequences in order to
functionally characterize them. The contents of InterPro are based around diagnostic
signatures and the proteins that they significantly match. The signatures consist of models
(simple types, such as regular expressions or more complex ones, such as Hidden Markov
models) which describe protein families, domains or sites. Models are built from the amino
acid sequences of known families or domains and they are subsequently used to search
unknown sequences (such as those arising from novel genome sequencing) in order to
classify them.
InterProScan is a bioinformatics tool that is available in InterPro via a webserver. It provides
a one-stop-shop for automated sequence analysis of both protein and nucleic acid. It offers
the researcher the ability to identify both structural and functional regions of interest and to
quickly characterize a new or novel sequence with considerable confidence.
Objective
A protein domain is a conserved part of a given protein sequence and structure that can
evolve, function, and exist independently of the rest of the protein chain. Domains vary in
length from between about 25 amino acids up to 500 amino acids in length. Here our
objective is to find out functional domains in a given protein sequence.
80. Page | 80
Method
InterProScan home page
Go to InterProScan home page through the
link www.ebi.ac.uk/InterProScan/
Paste the sequence of protein of interest
Click on the 'Submit' button
81. Page | 81
Result
Discussion
A number of algorithms (14) available in InterProScan tool were selected to find out
functional domain in the provided protein sequence.
According to PRINTS the protein sequence contains LEUZIPPRFOS domain which is a 5-
element fingerprint that provides a signature for the leucine zipper and DNA-binding
domains characteristic of the fos oncogenes and fos-related proteins. PFAM, SMART,
PROSITE and PROFILE also ensured the presence of leucine zipper domain in the protein
sequence.
The DNA binding region comprises a number of basic amino acids such as arginine
and lysine.
The `leucine zipper' is a structure that is believed to mediate the function of several
eukaryotic gene regulatory proteins. The zipper consists of a periodic repetition of
leucine residues at every seventh position, and regions containing them appear to
span 8 turns of alpha-helix. The leucine side chains that extend from one helix
interact with those from a similar helix, hence facilitating dimerisation in the form of
a coiled-coil.
Proteins containing this domain are transcription factors.
82. Page | 82
3.7 Predicting secondary structure of a protein sequence
Protein secondary structure can be described by the hydrogen-bonding pattern of the
peptide backbone of the protein. The most common secondary structures are alpha helices
and beta sheets. Other extended structures such as the polyproline helix and alpha sheet
are rare in native state proteins but are often hypothesized as important protein folding
intermediates. Tight turns and loose, flexible loops link the more "regular" secondary
structure elements. The random coil is not a true secondary structure, but is the class of
conformations that indicate an absence of regular secondary structure.
Accurate secondary-structure prediction is a key element in the prediction of tertiary
structure, in all but the simplest (homology modeling) cases. At present there are several
secondary-structure prediction methods such as PSIPRED, SAM, PORTER, PROF and SABLE.
PSIPRED is a simple and accurate secondary structure prediction method, incorporating two
feed-forward neural networks which perform an analysis on output obtained from PSI-
BLAST (Position Specific Iterated - BLAST). Using a very stringent cross validation method to
evaluate the method's performance, PSIPRED 3.2 achieves an average Q3 score of 81.6%.
Method
Go to the home page of
PsiPred,
http://bioinf.cs.ucl.ac.u
k/psipred
Choose a prediction
method,
PSIPRED v3.3 (Predict
Secondary Structure)
Paste the sequence of
interest
write the email address
and short identifier for
submission to the boxes
provided
Click on the 'Predict'
button and wait for the
result
86. Page | 86
Result Interpretation
From the prediction result obtained from PsiPred it was evident that the secondary
structure of the provided protein sequence consists of alpha helices and coil structures, but
there is no beta sheet. The confidence of prediction was quite good.
3.8 Retrieving 3D structure of a protein from PDB
Protein tertiary structure refers to three-dimensional structure of a single, double, or triple
bonded protein molecule. The alpha-helixes and beta pleated-sheets are folded into a
compact globular structure. The folding is driven by the non-specific hydrophobic
interactions (the burial of hydrophobic residues from water), but the structure is stable only
when the parts of a protein domain are locked into place by specific tertiary interactions,
such as salt bridges, hydrogen bonds, and the tight packing of side chains and disulfide
bonds.
The Protein Data Bank (PDB) is a repository for the three-dimensional structural data of
large biological molecules, such as proteins and nucleic acids. The data, typically obtained by
X-ray crystallography or NMR spectroscopy and submitted by biologists and biochemists
from around the world, are freely accessible on the Internet. The file formats used by the
PDB are PDB format files and PDBML (XML) files. The structure files may be viewed using
VMD, MDL Chime, Pymol, UCSF Chimera, Rasmol, Swiss-PDB Viewer, StarBiochem, Sirius,
and VisProt3DS. The PDB database is updated weekly.
Procedure
Go to PDB home page,
http://www.rcsb.org/pdb/hom
e/home.do
Write the PDB ID of the
desired protein sequence
Click to search the protein 3D
structure
89. Page | 89
Result Interpretation
The 3D structure of the selected protein (Tumor Suppressor protein, TP53) is
composed of a monomer containing- alpha helices, beta strands and coils.
The protein contains several motifs like-
Interaction with HRMT1L2
Transcription activation (acidic)
Interaction with WWOX
DNA-binding region
Required for interaction with FBXO42
Required for interaction with ZNF385A
Interaction with AXIN1
Interaction with E4F1
Interaction with CARM1
Interaction with HIPK2
Bipartite nuclear localization signal
Nuclear export signal
Oligomerization
Basic (repression of DNA-binding)
The transcription factor binding sites are also provided PDB search result.