SlideShare ist ein Scribd-Unternehmen logo
1 von 89
Downloaden Sie, um offline zu lesen
PREPARED FOR : Dr. Md. Khademul Islam
(Course Teacher)
PREPARED By : Naima Thahsin
ID : 13376001
Course : BTC 509: Genomics (Bioinformatics)
PRACTICAL NOTEBOOK ON:
BIOINFORMATICS
Page | 2
CHAPTER CONTENTS TOOLS PAGE NO.
Chapter-1
DNA sequence
analysis
1.1 General 03
1.2 Finding protein coding regions GeneMark 04-07
GENSCAN 07-09
1.3 Prediction of Promoters SoftBerry 10-12
Promoter 2.0 12-14
1.4 Detection of Tandem Repeat Tandem repeat
finder
14-18
1.5 Masking interspersed repeats RepeatMasker 18-22
1.6 Finding UTR location UTRScan 22-25
1.7 Searching CpG Islands CpG Islands 25-27
1.8 Predictioning Transcription Factor
Binding Sites
TFSEARCH 28-31
1.9 Designing PCR Primer and Calculating
Standard Properties
Primer3Plus 31-35
OligoCalc 36-37
1.10 Restriction Mapping BioTools 38-43
Chapter-2
Phylogenetic
relation
Analysis
2.1 General 44
2.2 Sequence alignment Clustal Omega 44-47
T-Coffee 48-50
2.3 Constructing phylogenetic tree MEGA 51-57
Chapter-3
Protein
Sequence
Analysis
3.1 General 58
3.2 Primary Structure Analysis ProtParam 58-63
3.3 Finding cleavage sites PeptideCutter 63-68
3.4 Computing profile produced by any
amino acid scale
ProtScale 68-74
3.5 Predicting post-translational
modifications
ScanProsite 74-79
3.6 Predicting functional domain InterProScan 79-81
3.7 Predicting secondary structure PSIPRED 83-86
3.8 Retrieving 3D structure of a protein
from PDB
Protein Data Bank
(PDB)
86-89
Page | 3
1. DNA Sequence Analysis
1.1 General
A gene is the molecular unit of heredity of a living organism. Genes hold the information to
build and maintain an organism's cells and pass genetic traits to offspring.
Basically a gene is a sequence of nucleic acids (DNA or, in the case of certain viruses RNA).
The vast majority of living organisms encode their genes in long strands of DNA
(deoxyribonucleic acid). Most DNA molecules are double-stranded helices, consisting of two
long biopolymers made of simpler units called nucleotides—each nucleotide is composed of
a nucleobase (guanine, adenine, thymine, and cytosine), recorded using the letters G, A, T,
and C, as well as a backbone made of alternating sugars (deoxyribose) and phosphate
groups (related to phosphoric acid), with the nucleobases (G, A, T, C) attached to the sugars.
The two strands of DNA run in opposite directions to each other and are therefore anti-
parallel (a strand running 5'-3' pairs with a complementary strand running 3'-5').
In biological systems, nucleic acids contain information which is used by a living cell to
construct specific proteins. Genes that encode proteins are composed of a series of three-
nucleotide sequences called codons, which serve as the words in the genetic language. Each
codon corresponds to a single amino acid, and there is a specific genetic code by which each
possible combination of three bases corresponds to a specific amino acid. However, a
significant portion of DNA (more than 98% for humans) is non-coding, meaning that these
sections do not serve a function of encoding proteins.
All genes have regulatory regions in addition to regions that explicitly code for a protein or
RNA product. A regulatory region shared by almost all genes is known as the promoter,
which provides a position that is recognized by the transcription machinery when a gene is
about to be transcribed and expressed. Other possible regulatory regions include enhancers,
which can compensate for a weak promoter. Most regulatory regions are "upstream"—that
is, before or toward the 5' end of the transcription initiation site. Eukaryotic promoter
regions are much more complex and difficult to identify than prokaryotic promoters.
In bioinformatics, the term genetic sequence analysis refers to the process of subjecting a
DNA sequence to any of a wide range of analytical methods to understand its features,
function, structure, or evolution. Methodologies used include sequence alignment, searches
against biological databases, and others.
Page | 4
1.2 Finding protein coding regions in a DNA sequence
Protein coding genes have different structures in microbes and multicellular organisms. In
microbes, each protein is encoded by a simple DNA segment-from start to end-called open
readings frame (ORF). In animal and plant genes, proteins are encoded in several pieces
called exons, separated by noncoding segments called introns. There are many sites which
provide tools for finding ORF or coding regions.
a) GeneMark
GeneMark is a family of ab initio gene prediction programs developed at the Georgia
Institute of Technology in Atlanta. GeneMark developed in 1993 was the first gene finding
method recognized as an efficient and accurate tool for genome projects.
The GeneMark algorithm uses species specific inhomogeneous Markov chain models of
protein-coding DNA sequence as well as homogeneous Markov chain models of non- coding
DNA. Parameters of the models are estimated from training sets of sequences of known
type. The major step of the algorithm computes a posteriory probability of a sequence
fragment to carry on a genetic code in one of six possible frames (including three frames in
complementary DNA strand) or to be "non-coding".
Procedure
Go to the homepage
of GeneMark,
http://exon.gatech.
edu/genemark
Click on
“GeneMark” on the
right panel.
Choose appropriate
model from given
options (e.g.
Models for
prokaryotes)
Paste the sequence
to be checked or the
sequences can be
uploaded
Change the
parameters if it is
needed
Click on “Start
GeneMark” button.
Page | 5
2
3
1
Page | 6
Result of GeneMark
5
5
4
6
Page | 7
Result Interpretation
The result has provided information on the G+C content (54.78 %), 3 possible coding
sequences (CDS), their position on strand, length and left starting & right stopping end and
the possible protein sequences translated from the exons.
b) GENSCAN
In bioinformatics GENSCAN is a program to identify complete gene structures in genomic
DNA. It is a GHMM-based program that can be used to predict the location of genes and
their exon-intron boundaries in genomic sequences from a variety of organisms. It is a
eukaryotic ab initio gene finder that has achieved notable success. The GENSCAN Web
server can be found at MIT.
Procedure
Go to GENSCAN home page through the
link, http://genes.mit.edu/GENSCAN.html
Paste the nucleotide sequence of interest
Click on the 'Run GENSCAN' button
Page | 8
GENSCAN result
i. Predicted exons:
Page | 9
ii. Predicted peptide sequences:
Result Interpretation
The result from GENSCAN provided following information on the sequence provided-
 G+C contents 41.22%
 The strand type, beginning position, end point, length, reading frame and exon score
of initial, internal and terminal exons and poly-A-signal
 Predicted peptide sequence
The suboptimal exon cutoff value was set at 1.00. From the result the exon scores were
found to be above the cutoff value which was quite good. So it can be said that the
prediction was good.
1.3 Prediction of Promoters
A promoter is a region of DNA that initiates transcription of a particular gene. Promoters are
located near the genes they transcribe, on the same strand and upstream on the DNA.
Promoters can be about 100–1000 base pairs long.
For the transcription to take place, the enzyme that synthesizes RNA, known as RNA
polymerase, must attach to the DNA near a gene. Promoters contain specific DNA
sequences and response elements that provide a secure initial binding site for RNA
polymerase and for proteins called transcription factors that recruit RNA polymerase. These
transcription factors have specific activator or repressor sequences of corresponding
nucleotides that attach to specific promoters and regulate gene expressions.
Page | 10
a) SoftBerry
Through SoftBerry program we can recognize bacterial promoter with 80% accuracy and
specificity. In bacteria, the promoter contains two short sequence elements approximately -
10 and -35 nucleotides upstream from the transcription start site.
Procedure

Go to SoftBerry home page,
http://www.softberry.com
From left panel select
‘OPERON AND GENE FINDING
IN BACTERIA’ and click on
‘BPROM’
Paste the sequence of
interest
Click on the ‘PROCESS’ button
Page | 11
Result
Page | 12
Result Interpretation
In BPROM program the threshold level for promoters is 0.20. The scores from the result for -
10 and -35 box were 25 and 41, respectively, both of which were above the threshold level.
So, the prediction was quite good. The result also provided the position of the boxes at 154
and 134.
The result also provided information about the transcription factor binding sites for –
rpoS17, ihf, g1pR, crp and rpoD19 – the sequences of the sites, their positions and scores.
b) Promoter 2.0 Prediction Server
Promoter2.0 predicts transcription start sites of vertebrate PolII promoters in DNA
sequences. It has been developed as an evolution of simulated transcription factors that
interact with sequences in promoter regions. It builds on principles that are common to
neural networks and genetic algorithms.
Procedure
Go to Promoter 2.0 home page,
http://www.cbs.dtu.dk/service
s/Promoter/
Paste the nucleotide
sequence of interest
Click on the 'Submit' button
Page | 13
Page | 14
Result Interpretation
According to the result the transcription start site was predicted to be at 800 position.
The score table for Promoter 2.0 is -
For the provided nucleotide sequence, the score was found to be 0.592 which depicts
marginal prediction.
1.4 Detection of Tandem Repeat
Tandem repeats occur in DNA when a pattern of two or more nucleotides is repeated and
the repetitions are directly adjacent to each other. When between 10 and 60 nucleotides
are repeated, it is called a minisatellite. Those with fewer are known as microsatellites or
short tandem repeats. Tandem repeat describes a pattern that helps determine an
individual's inherited traits. Tandem repeats can be very useful in determining parentage.
Tandem repeat finder
Tandem Repeats Finder is a program to locate and display tandem repeats in DNA
sequences. In order to use the program, the user submits a sequence in FASTA format. The
program is very fast, analyzing sequences on the order of .5Mb in just a few seconds.
Submitted sequences may be of arbitrary length. Repeats with pattern size in the range
from 1 to 2000 bases are detected.
Procedure
Go to Tandem repeat finder
home page,
http://tandem.bu.edu/trf/t
rf.html
Click on ‘Submit a Sequence
for Analysis’
Select the option ‘Basic’ to
use default parameters
Choose the option ‘cut
and paste sequence’
Paste the sequence to the
box provided
Click on the ‘Submit
sequence’ button
Page | 15
Page | 16
Result
Page | 17
Page | 18
Result Interpretation
The result indicates that
 1 repeat was found in the given nucleotide sequence.
 The indices were within 126-186.
 The consensus size was 4 and the pattern was “GATA”.
 The score was 104 which was quite good.
1.5 Masking interspersed repeats in a sequence
In the mid 1960's scientists discovered that many genomes contain stretches of highly
repetitive DNA sequences. These sequences were later characterized and placed into five
categories: Simple Repeats, Tandem Repeats, Segmental Duplications and Interspersed
Repeats. Interspersed repetitive DNA is found in all eukaryotic genomes and comprises of-
 Processed Pseudogenes,
 Retrotranscripts,
 SINES,
 DNA Transposons,
 Retrovirus Retrotransposons and
 Non-Retrovirus Retrotransposons (LINES )
Currently up to 50% of the human genome is repetitive in nature and as improvements are
made in detection methods this number is expected to increase.
RepeatMasker
RepeatMasker is a program that screens DNA sequences for interspersed repeats and low
complexity DNA sequences. The output of the program is a detailed annotation of the
repeats that are present in the query sequence as well as a modified version of the query
sequence in which all the annotated repeats have been masked (default: replaced by Ns).
Page | 19
Procedure
Go to RepeatMasker home page through the
link, http://www.repeatmasker.org/
Select the option 'RepeatMasking' from the
left panel
Paste the nucleotide sequence to the box
provided
Click on the 'Reset' button
Page | 20
Page | 21
Result
Page | 22
Result Interpretation
In the analyzed nucleotide sequence only one interspersed repetitive sequence was found.
The sequence was SINE that contained 51 base pairs. The repetitive sequence was detected
and masked.
1.6 Finding UTR location
In molecular genetics, an untranslated region (or UTR) refers to either of two sections (5'
UTR or 3'-UTR), one on each side of a coding sequence on a strand of mRNA.
The five prime untranslated region (5' UTR) (also known as a Leader Sequence or Leader
RNA) is the region of an mRNA that is directly upstream from the initiation codon. This
region is important for the regulation of translation of a transcript.
On the other hand, the three prime untranslated region (3'-UTR) is the section of messenger
RNA (mRNA) that immediately follows the translation termination codon. The 3'-UTR often
contains regulatory regions that influence post-transcriptional gene expression. Regulatory
regions within the 3'-untranslated region can influence polyadenylation, translation
efficiency, localization, and stability of the mRNA.
UTRScan
UTRscan is a pattern matcher which searches protein or nucleotide (DNA, RNA, tRNA)
sequences in order to find UTR motifs. It is able to find, in a given sequence, motifs that
characterize 3'UTR and 5'UTR sequences. Such motifs are defined in the UTRSite Database, a
collection of functional sequence patterns located in the 5'- or 3'-UTR sequences.
The UTRsite entries describe the various regulatory elements present in UTR regions and
whose functional role has been established on experimental basis. UTRsite database could
reveal very useful for automatic annotation of anonymous sequences generated by
sequencing projects as well as for finding previously undetected signals in known gene
sequences.
Page | 23
Procedure
Go to UTRScan home page through the
link, http://itbtools.ba.itb.cnr.it/
Paste the nucleotide sequence in FASTA
format
Insert a valid email address
Click on the 'Submit' button
Page | 24
Result
a) List of UTR motifs defined in the UTRSite Database
Page | 25
b) Status of provided sequence
Result Interpretation
The UTRScan program found following UTR motifs in the provided sequences-
IRES Iron Responsive Element
K-B K-Box
uORF Upstream Open Reading Frame
MBE Musashi binding element
A total of 9 matches for 4 signals were found in the sequence. The position and sequence of
the UTR motifs were also detected by UTRScan.
1.7 Search for CpG Islands
In genetics, CpG islands or CG islands (CGI) are genomic regions with at least 200 bp that
contain a high frequency of CpG sites. The "p" in CpG refers to the phosphodiester bond
between the cytosine and the guanine, which indicates that the C and the G are next to each
other in sequence, regardless of being single- or double- stranded. In a CpG site, both C and
G are found on the same strand of DNA or RNA and are connected by a phosphodiester
bond.
Page | 26
CpG Islands
CpG Islands reports potential CpG island regions using the method described by Gardiner-
Garden and Frommer (1987). The calculation is performed using a 200 bp window moving
across the sequence at 1 bp intervals.
CpG islands are defined as sequence ranges where the Obs/Exp value is greater than 0.6 and
the GC content is greater than 50%. The expected number of CpG dimers in a window is
calculated as the number of 'C's in the window multiplied by the number of 'G's in the
window, divided by the window length.
CpG islands are often found in the 5' regions of vertebrate genes, therefore this program
can be used to highlight potential genes in genomic sequences.
Procedure
Go to CpG Islands
homepage,
http://www.bioinforma
tics.org/sms2/cpg_islan
ds.html
Paste the sequence of
interest in FASTA format
Click on 'Submit' button
Page | 27
Result Interpretation
The range of GC content was found to be 54.50-64 % in the given sequence which was
greater than the cutoff value (50%).
Page | 28
1.8 Prediction of Transcription Factor Binding Sites
In molecular biology and genetics, a transcription factor (sometimes called a sequence-
specific DNA-binding factor) is a protein that binds to specific DNA sequences, thereby
controlling the flow (or transcription) of genetic information from DNA to messenger RNA.
Transcription factors perform this function alone or with other proteins in a complex, by
promoting (as an activator), or blocking (as a repressor) the recruitment of RNA polymerase
(the enzyme that performs the transcription of genetic information from DNA to RNA) to
specific genes.
A defining feature of transcription factors is that they contain one or more DNA-binding
domains (DBDs), which attach to specific sequences of DNA adjacent to the genes that they
regulate.
TFSEARCH
TFSEARCH program was written by Yutaka Akiyama (Kyoto University, currently at RWCP) in
1995. TFSEARCH searches highly correlated sequence fragments versus TFMATRIX
transcription factor binding site profile database in 'TRANSFAC' databases developed at GBF-
Braunschweig, Germany.
Procedure
Go to TFSEARCH home page through the link,
http://www.cbrc.jp/research/db/TFSEARCH.html
Enter any label for the sequence into top field
Paste the nucleotide sequence in FASTA
format into second field
Set 'Threshold score' if necessary
Click on 'Exec' button to submit the query
sequence to the server
Page | 29
Page | 30
Result
Page | 31
Result Interpretation
The given sequence was analyzed for transcription factor binding sites. A total of 12 high
scoring sites were found in the sequence. All of them were above the threshold level (85.0).
The maximum score was 95.4 and minimum score was 85.3. The sequence was predicted to
be associated with following transcription factors-
 HSF (Heat shock factor1)
 HSF2 (Heat shock factor2)
 ADR1 (alcohol dehydrogenase1)
 GATA 1 (globin transcription factor 1)
 GATA 2 (globin transcription factor 2)
1.9 Designing PCR Primer and Calculating Standard Properties
The polymerase chain reaction, usually referred to as PCR, is an extremely powerful
procedure that allows the amplification of a selected DNA sequence in a genome a million-
fold or more in vitro-without the use of living cells during the cloning process. In this
technique, the known part of the DNA is used to design two synthetic DNA oligonucleotides,
one complementary to each strand of the DNA double helix and lying on opposite sides of
the region to be amplified. These oligonucleotides serve as primers for in vitro DNA
synthesis, which is catalyzed by DNA polymerase. Primers are required for DNA replication
because the enzymes that catalyze this process, DNA polymerases, can only add new
nucleotides to an existing strand of DNA.
a) Primer3Plus
The Internet site of University of Massachusetts Medical School
(biotools.umassmed.edu) provides a link to a very complete and easy to use tool for
primer designing, eg., Primer3Plus. Primer3Plus picks primers for PCR reactions,
according to the conditions specified by the user. Primer considers things like melting
temperature, concentrations of various solutions in PCR reactions, primer bending and
folding, and many other conditions when attempting to choose the optimal pair of primers
for a reaction. All of these conditions are user-specifiable, and can vary from reaction to
reaction.
Page | 32
Procedure
Go to Bio Tools home page through the link,
http://biotools.umassmed.edu/cgi-
bin/primer3plus/primer3plus.cgi
Select 'Primer3Plus' from the 'DNA Sequence
Analysis' tools
Paste the nucleotide sequence and Change
the parameters as necessary
Click on 'Pick Primers' button to submit the
query sequence to the server
Page | 33
Page | 34
Result
Page | 35
Result Interpretation
Primer3Plus provided 5 pairs of primers for the given nucleotide sequence. Each pair (left
and right primers) has suitable features like length, temperature and GC content that fit to
the provided settings. The first pair contains 20 bp long primers-
Left Primer 1: GCCTCCTAATTCGGGCAGAA
Right Primer 1: AAGGATGGGGTCTCCTCCTC
The pair of primer is capable of amplifying 590 bp of the nucleotide sequence.
Page | 36
b) OligoCalc
OligoCalc is a web-accessible, client-based computational engine for reporting DNA and RNA
single-stranded and double-stranded properties, including molecular weight, solution
concentration, melting temperature, estimated absorbance coefficients, inter-molecular
self-complementarity estimation and intra-molecular hairpin loop formation. OligoCalc has a
familiar ‘calculator’ look and feel, making it readily understandable and usable.
Method
Go to Oligo Calc home
page through the link,
http://www.basic.north
western.edu/biotools/oli
gocalc.html
Paste the oligonucleotide
sequence of primer
Click anywhere to get the
properties of the given
sequence
Page | 37
Result
Page | 38
1.10 Restriction Mapping
A restriction map is a map of known restriction sites within a sequence of DNA. Restriction
mapping requires the use of restriction enzymes. Restriction enzymes are enzymes that cut
DNA at specific recognition sequences called "sites." They probably evolved as a bacterial
defense against DNA bacteriophage. DNA invading a bacterial cell defended by these
enzymes will be digested into small, non-functional pieces. The name "restriction enzyme"
comes from the enzyme's function of restricting access to the cell.
There are hundreds of restriction enzymes that have been isolated and each one recognizes
its own specific nucleotide sequence. Sites for each restriction enzyme are distributed
randomly throughout a particular DNA stretch. Digestion of DNA by restriction enzymes is
very reproducible; every time a specific piece of DNA is cut by a specific enzyme, the same
pattern of digestion will occur. Restriction enzymes are commercially available and their use
has made manipulating DNA very easy.
BioTools-Restriction mapping tool
One approach in constructing a restriction map of a DNA molecule is to sequence the whole
molecule and to run the sequence through a computer program that will find the
recognition sites that are present for every restriction enzyme known. ‘BioTools’ provides an
application, Restriction mapping tool, which allows the user to supply both DNA sequence
and (optionally) his own file of Restriction Enzymes or other IUPAC patterns in GCG for
Restriction Enzyme Mapping and Analysis, using Harry Mangalam's tacg 4.3 program as the
analysis engine.
Procedure
Go to BioTools home
page
(http://biotools.uma
ssmed.edu/)
Select 'Restriction
mapping tool' from
the panel
Paste the DNA
sequence in the
'Sequence Entry' box
Select restriction
enzymes from the list
Change other
parameters as
necessary
Click on 'Submit
Sequence to
WWWtacg' button
Page | 39
Page | 40
Page | 41
Result
Page | 42
Page | 43
Result Interpretation
The Restriction Enzyme Tool of ‘Bio Tools’ server analyzed the given nucleotide sequence
and exhibited 6 hits for the selected 3 restriction enzymes- EcoRI, HindIII, BamHI. 3 hits were
found for BamHI, 2 for HindIII and 1 for EcoRI. These enzymes specify and cut at the
following sites of the nucleotide sequence-
Restriction Enzyme Site Position
BamHI GGATCC 1240,1865,2085
HindIII AAGCTT 1466,2115
EcoRI GAATTC 2064
Page | 44
2. Phylogenetic relation Analysis
2.1 General
Phylogenetics is the study of the evolutionary relationships of living organisms using treelike
diagrams to represent pedigrees of these organisms. Phylogenetics can be studied in various
ways. Molecular data that are in the form of DNA or protein sequences can provide very
useful evolutionary perspectives of existing organisms because, as organisms evolve, the
genetic materials accumulate mutations over time causing phenotypic changes. Through
comparative analysis of these biological molecules from a number of related organisms, the
evolutionary history of the genes or proteins and even the organisms can be revealed.
Usually Similarities and divergence among related biological sequences revealed by
sequence alignment are rationalized and visualized in the context of phylogenetic trees.
Therefore the study of phylogenetic relationship, in general, involves sequence alignment
and establishing phylogenetic tree.
2.2 Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA,
or protein to identify regions of similarity that may be a consequence of functional,
structural, or evolutionary relationships between the sequences.[1] Aligned sequences of
nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps
are inserted between the residues so that identical or similar characters are aligned in
successive columns.
Objectives:
 To understand the similarities among group of sequences
 To determine conserved regions
 To understand the evolutional relationship among related sequences.
To do so 10 protein sequences of Small Membrane Protein for different species of
Coronaviridae were retrieved from NCBI and analyzed through both Clustal Omega and T-
Coffee. The comparison between the results from both tools is given later.
a) Clustal Omega
Clustal Omega is a new multiple sequence alignment program that uses seeded guide trees
and HMM profile-profile techniques to generate alignments. It produces biologically
meaningful multiple sequence alignments of divergent sequences. Evolutionary
relationships can be seen via viewing Cladograms or Phylograms.
Procedure
Page | 45
Result
Go to Clustal Omega home page,
http://www.ebi.ac.uk/Tools/msa/clustalo/
Paste the protein sequences retrieved in
multifasta format
Click on the 'Submit' button to submit the
sequences to the server
Page | 46
Page | 47
b) T-Coffee
T-Coffee (Tree-based Consistency Objective Function For alignment Evaluation) is a multiple
sequence alignment software using a progressive approach. It generates a library of
pairwise alignments to guide the multiple sequence alignment. It can also combine multiple
sequences alignments obtained previously and in the latest versions can use structural
information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of
the alignments and some capacity for identifying occurrence of motifs
Procedure
Go to T-Coffee home page (http://tcoffee.vital-
it.ch/apps/atcoffee/index.html)
Select 'T-Coffee' tool from the panel
Paste the protein sequences retrieved in
multifasta format
Click on the 'Submit' button to submit the
sequences to the server
Page | 48
Page | 49
Result
Page | 50
Result Interpretation and Comparison between results from Clustal Omega
and T-Coffee
The sequence alignment was found to be better with T-Coffee than Clustal Omega. Along
with aligned sequences, T-Coffee also provides the user alignment score for the input
sequences. For the given sequences following scores were found-
gi|530341189|gb : 47
gi|530802146|gb : 43
gi|148728344|gb : 46
gi|530802593|gb : 39
gi|56807328|ref : 42
gi|126030129|re : 42
gi|211907043|gb : 32
gi|212681391|re : 32
gi|187251957|re : 32
gi|33304216|gb| : 44
cons : 44
T-Coffee exhibited 2 conserved regions, whereas 1 was found in Clustal Omega. Number of
regions with matches was also greater in T-Coffee than Clustal Omega. However, the
advantage with Clustal Omega is that it provides a tool for building phylogenetic tree which
would be available if ‘Java’ is present.
Page | 51
2.3 Constructing phylogenetic tree
A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the
inferred evolutionary relationships among various biological species or other entities —
their phylogeny — based upon similarities and differences in their physical and/or genetic
characteristics. The taxa joined together in the tree are implied to have descended from a
common ancestor.
MEGA
MEGA (MOLECULAR EVOLUTIONARY GENETICS ANALYSIS- Windows v5.2.2) is a software
that provides tools for both multiple sequence alignment and constructing phylogenetic
tree.
Procedure
a) MEGA was downloaded from http://www.megasoftware.net/ and installed in the
Windows 7 environment.
b) 10 protein sequences of Nucleocapsid protein for different species of Coronaviridae
were retrieved from NCBI
c) The Multifasta file containing protein sequences was run by MEGA.
The flowchart of the procedure is as follows:
Open MEGA 5.2.2
Open a file in FASTA
format
Select the option
'Align'
Select 'Muscle' from
upper panel to align
protein sequences
Set the parameters
as default in settings
window and click on
'compute'
Save session in MAS
format
Click on 'Phylogeny'
option from upper
panel and select
'Maximum Likelihood'
Open a file
containing protein
sequences saved in
'mas' format
Click on the
'Compute' button
Page | 52
Page | 53
Page | 54
Page | 55
Page | 56
Result
Result Interpretation
According to the inferred phylogenetic tree based on protein sequences from different
species of Coronaviridae-
 Two broad subgroups (B and C) have descended from a common ancestor A.
 In the subgroup B, Bulbul coronavirus HKU11 and Munia coronavirus HKU13 are the
closely related groups which are related to Beluga whale coronavirus SW1 and they
are descended from the ancestor F. The group F is related to another group E that
includes two closely related virus species, Human coronavirus OC43 and Human
coronavirus HKU1. The groups F and E are descendants of D which is descended from
B. the group B gives rise to an outgroup, Pipistrellus bat coronavirus HKU5, which is
more close to group E than F.
Page | 57
 In the subgroup C, Human coronavirus 229E and Human coronavirus NL63 are the
closely related groups which are related to Porcine epidemic diarrhea virus and they
are descended from the ancestor H. The group H is descended from the ancestor C
which gives rise to an outgroup, Transmissible gastroenteritis virus.
Page | 58
3. Protein sequence Analysis
3.1. General
Proteins are one of the important fundamental units of all living cells. Proteins have a wide
range of functions within all the living beings. Some of the important functions such as DNA
replication, catalysis of metabolic reactions, transportation of molecules from one location
to another etc. are performed with the help of proteins.
The building blocks of proteins are amino acids. Amino acids are made from an amine (-
NH2) and a carboxylic acid (-COOH) functional groups as well as a side chain which is specific
to each amino acid. There are almost 20 amino acids found in human body that usually
varies in their R groups. In proteins, the amino acids are linked to each other by means of
peptide bonds. A peptide bond is formed when the carboxyl group of one amino acid is
linked to the amino group of another molecule through a covalent bond.
Proteins differ from one another in their structure, primarily in their sequence of amino
acids. The structure explains the different levels of organization of a protein molecule. The
protein structure is classified into primary, secondary, tertiary, and quaternary. The linear
sequence the polypeptide chain of amino acid refers to the primary structure of proteins.
The intermolecular and intra-molecular hydrogen bonding between the amide groups in
primary structure of protein form secondary structure. Alpha helices and beta sheets are
the two important secondary structures in protein. The three dimensional structure of a
single protein molecule refers to the tertiary structure. The quaternary structure is formed
by several protein molecules or polypeptide chains.
3.2. Primary Structure Analysis of a Protein
There are different tools available through ExPasy server to analyze a protein sequence.
ExPASy is the SIB Bioinformatics Resource Portal. It provides access to several scientific
databases and software tools in many areas of life sciences including proteomics, genomics,
phylogeny, systems biology, population genetics, transcriptomics etc.
ProtParam is one among the protein analysis tools available on the ExPasy server and can
be accessible through the link, http://www.expasy.org/tools/protparam.html. It is used for
calculating various physiochemical parameters of a provided protein. The protein sequence
is the only input provided to calculate such parameters.
In ProtParam, the protein can be specified as -
 UniProtKB/Swiss-Prot accession number,
Page | 59
 UniProtKB/TrEMBL accession number,
 ID or
 Amino acid sequences.
The various parameters computed by ProtParam are molecular weight, amino acid
composition, extinction coefficient, estimated half-life, theoretical pI, and grand average of
hydropathicity (GRAVY), aliphatic index and instability index.
Objectives
 To compute the various physical and chemical parameters of a protein.
 To perform primary structure analysis of proteins.
Procedure
Go to ProtParam home page,
http://www.expasy.org/tools
/protparam.html
Paste the FASTA sequence of
protein of interest
Click on the ‘Compute
parameters’ button
Page | 60
 ProtParam home page
 Paste the FASTA sequence of protein
Page | 61
Result
Page | 62
Page | 63
Resut Interpretation
From the result of ProtParam we found that-
 The estimated half-life is 30 hours which indicates that half of the amount of protein
in a cell disappears 30 hours after its synthesis in the cell.
 The instability index of the analyzed protein is 37.87 which is less than the cut off
value (40). So the protein is considered as stable.
 According to the computed aliphatic index, the protein has greater volume of amino
acids having aliphatic side chains in their structures.
 The grand average of hydropathicity of the protein is 0.327. The positive score
indicates that the protein has greater hydrophobicity.
3.3 Finding cleavage sites in a given protein sequence
PeptideCutter searches a protein sequence from the SWISS-PROT and/or TrEMBL databases
or a user-entered protein sequence for protease cleavage sites. Single proteases and
chemicals, a selection or the whole list of proteases and chemicals can be used. Most of the
cleavage rules for individual enzymes were deduced from specificity data summed up by Keil
(1992).
Different forms of output of the results are available: Tables of cleavage sites either grouped
alphabetically according to enzyme names or sequentially according to the amino acid
number. A third option for output is a map of cleavage sites. The sequence and the cleavage
sites mapped onto it are grouped in blocks, the size of which can be chosen by the user to
provide a convenient form of print-out.
Method
Go to PeptideCutter home
page,
http://web.expasy.org/pepti
de_cutter/
Paste the FASTA sequence of
protein of interest
Select enzymes and
chemicals if necessary
Click on the ‘Perform’ button
Page | 64
PeptideCutter home page
Selection of parameters
Page | 65
Result
Page | 66
Page | 67
Map of cleavage sites
The cleavage sites for a single enzyme, e.g. Trypsin, mapped onto the entered protein
sequence are shown below:
Page | 68
Discussion
We can predict the potential cleavage sites cleaved by proteases or chemicals in a given
protein sequence with the help of a bioinformatic tool, PeptideCutter.
If we know the cleavage sites of a protein, we can use an enzyme to cut input protein in
specific ways. This can be useful if we are interested in carrying out experiments on a
portion of our protein.
PeptideCutter can also help us in following aspects:
 If we want to separate the domains in our protein
 Identify potential post-translational modification by mass spectrometry
 Remove a tag protein when we want to express a fusion protein
 Make sure that the protein we are cloning is not sensitive to some endogenous
proteases
3.4 Computing profile produced by any amino acid scale
ProtScale allows to compute and represent (in the form of a two-dimensional plot) the
profile produced by any amino acid scale on a selected protein.
An amino acid scale is defined by a numerical value assigned to each type of amino acid. The
most frequently used scales are hydrophobicity scales, most of which were derived from
experimental studies on partitioning of peptides in apolar and polar solvents, with the goal
of predicting membrane-spanning segments that are highly hydrophobic, and secondary
structure conformational parameter scales. In addition, many other scales exist which are
based on different chemical and physical properties of the amino acids.
ProtScale can be used with 50 predefined scales entered from the literature. The scale
values for the 20 amino acids, as well as a literature reference, are provided on ExPASy for
Page | 69
each of these scales. To generate data for a plot, the protein sequence is scanned with a
sliding window of a given size. At each position, the mean scale value of the amino acids
within the window is calculated, and that value is plotted for the midpoint of the window.
We can set several parameters that control the computation of a scale profile, such as the
window size, the weight variation model, the window edge relative weight value, and scale
normalization.
Objective
 Use the scale of hydrophobicity to identify the groups of hydrophobic segments
within the protein sequence.
 Predicting transmembrane segments in the given protein.
Method
Go to ProtScale home page,
http://web.expasy.org/prot
scale/
Past FASTA sequence of
desired protein
Choose an amino acid scale
from the list (e.g., Hphob. /
Kyte & Doolittle)
Set window size at 19Normalize scale, if
necessary
Click on the ‘Submit’ button
Page | 70
ProtScale home page
Selection of parameters
Page | 71
Page | 72
Result
Using Hphob. / Kyte & Doolittle scale
Page | 73
With normalized scale
Page | 74
Discussion
Hydrophobicity scales are values that define relative hydrophobicity of amino acid residues.
The more positive the value, the more hydrophobic are the amino acids located in that
region of the protein and hydrophobic segments characterize transmembrane proteins.
The desired protein sequence was analyzed using Kyte & Doolittle (hydrophobicity) scale.
The recommended threshold value when using Kyte and Doolittle is 1.6. From the result,
four regions of the given protein sequence was found above the threshold level. The highest
peak has been found at the N-terminus of the sequence which indicates the presence of a
transmembrane segment and predicts that the protein is secreted.
3.5 Predicting post-translational modifications in protein
Proteins often need to be modified before they become active in the cell. This is called post-
translational modifications. It may involve adding sugars, modifying amino acids, or
removing pieces of the newly synthesized protein. If we are studying a new protein, we may
want to know about such matters. It is also important if we want to clone and express a
human protein in bacteria, because, in order to be active, protein may require some post-
translational modifications that the bacterium itself cannot make.
PROSITE is a database that contains a list of short sequence motifs (also some named
patterns) that experiments have associated with particular biological properties. Many of
these patterns are associated with post-translational modifications. On the ExPASy server
(www.expasy.org), we can compare our protein sequence with the collection of patterns in
PROSITE and find out which modifications our protein is likely to undergo.
Objective
 Scan our protein of interest for matches against the PROSITE collection of motifs and
 Find out post-translational modifications in that protein.
Methods

Go to ScanProsite home page,
http://prosite.expasy.org/sca
nprosite/
Give UniProtKB accession
number
Select ‘Exclude profiles from
the scan’
Click on the ‘Start the scan’
button
Page | 75
ScanProsite home page
Selecting Parameters
Page | 76
Result
Page | 77
Page | 78
Page | 79
Resut Interpretation & Discussion
The result from ScanProsite represents 6 hits (by 3 distinct patterns) for 3 types of short
sequence motifs which are predicted to be associated with post translational modifications.
The sequence motifs are-
 Multicopper oxidase 1
 FA58C 1 (Coagulation factor 5/8 type C domain)
 FA58C 2
PDB structure viewer shows the 3D structure of FA58C 1 and FA58C 2 associated with the A
chain of the protein structure.
3.6 Predicting functional domain in protein sequence
InterPro is a database of protein families, domains and functional sites in which identifiable
features found in known proteins can be applied to new protein sequences in order to
functionally characterize them. The contents of InterPro are based around diagnostic
signatures and the proteins that they significantly match. The signatures consist of models
(simple types, such as regular expressions or more complex ones, such as Hidden Markov
models) which describe protein families, domains or sites. Models are built from the amino
acid sequences of known families or domains and they are subsequently used to search
unknown sequences (such as those arising from novel genome sequencing) in order to
classify them.
InterProScan is a bioinformatics tool that is available in InterPro via a webserver. It provides
a one-stop-shop for automated sequence analysis of both protein and nucleic acid. It offers
the researcher the ability to identify both structural and functional regions of interest and to
quickly characterize a new or novel sequence with considerable confidence.
Objective
A protein domain is a conserved part of a given protein sequence and structure that can
evolve, function, and exist independently of the rest of the protein chain. Domains vary in
length from between about 25 amino acids up to 500 amino acids in length. Here our
objective is to find out functional domains in a given protein sequence.
Page | 80
Method
InterProScan home page
Go to InterProScan home page through the
link www.ebi.ac.uk/InterProScan/
Paste the sequence of protein of interest
Click on the 'Submit' button
Page | 81
Result
Discussion
A number of algorithms (14) available in InterProScan tool were selected to find out
functional domain in the provided protein sequence.
According to PRINTS the protein sequence contains LEUZIPPRFOS domain which is a 5-
element fingerprint that provides a signature for the leucine zipper and DNA-binding
domains characteristic of the fos oncogenes and fos-related proteins. PFAM, SMART,
PROSITE and PROFILE also ensured the presence of leucine zipper domain in the protein
sequence.
 The DNA binding region comprises a number of basic amino acids such as arginine
and lysine.
 The `leucine zipper' is a structure that is believed to mediate the function of several
eukaryotic gene regulatory proteins. The zipper consists of a periodic repetition of
leucine residues at every seventh position, and regions containing them appear to
span 8 turns of alpha-helix. The leucine side chains that extend from one helix
interact with those from a similar helix, hence facilitating dimerisation in the form of
a coiled-coil.
Proteins containing this domain are transcription factors.
Page | 82
3.7 Predicting secondary structure of a protein sequence
Protein secondary structure can be described by the hydrogen-bonding pattern of the
peptide backbone of the protein. The most common secondary structures are alpha helices
and beta sheets. Other extended structures such as the polyproline helix and alpha sheet
are rare in native state proteins but are often hypothesized as important protein folding
intermediates. Tight turns and loose, flexible loops link the more "regular" secondary
structure elements. The random coil is not a true secondary structure, but is the class of
conformations that indicate an absence of regular secondary structure.
Accurate secondary-structure prediction is a key element in the prediction of tertiary
structure, in all but the simplest (homology modeling) cases. At present there are several
secondary-structure prediction methods such as PSIPRED, SAM, PORTER, PROF and SABLE.
PSIPRED is a simple and accurate secondary structure prediction method, incorporating two
feed-forward neural networks which perform an analysis on output obtained from PSI-
BLAST (Position Specific Iterated - BLAST). Using a very stringent cross validation method to
evaluate the method's performance, PSIPRED 3.2 achieves an average Q3 score of 81.6%.
Method
Go to the home page of
PsiPred,
http://bioinf.cs.ucl.ac.u
k/psipred
Choose a prediction
method,
PSIPRED v3.3 (Predict
Secondary Structure)
Paste the sequence of
interest
write the email address
and short identifier for
submission to the boxes
provided
Click on the 'Predict'
button and wait for the
result
Page | 83
Page | 84
Result
Page | 85
Page | 86
Result Interpretation
From the prediction result obtained from PsiPred it was evident that the secondary
structure of the provided protein sequence consists of alpha helices and coil structures, but
there is no beta sheet. The confidence of prediction was quite good.
3.8 Retrieving 3D structure of a protein from PDB
Protein tertiary structure refers to three-dimensional structure of a single, double, or triple
bonded protein molecule. The alpha-helixes and beta pleated-sheets are folded into a
compact globular structure. The folding is driven by the non-specific hydrophobic
interactions (the burial of hydrophobic residues from water), but the structure is stable only
when the parts of a protein domain are locked into place by specific tertiary interactions,
such as salt bridges, hydrogen bonds, and the tight packing of side chains and disulfide
bonds.
The Protein Data Bank (PDB) is a repository for the three-dimensional structural data of
large biological molecules, such as proteins and nucleic acids. The data, typically obtained by
X-ray crystallography or NMR spectroscopy and submitted by biologists and biochemists
from around the world, are freely accessible on the Internet. The file formats used by the
PDB are PDB format files and PDBML (XML) files. The structure files may be viewed using
VMD, MDL Chime, Pymol, UCSF Chimera, Rasmol, Swiss-PDB Viewer, StarBiochem, Sirius,
and VisProt3DS. The PDB database is updated weekly.
Procedure
Go to PDB home page,
http://www.rcsb.org/pdb/hom
e/home.do
Write the PDB ID of the
desired protein sequence
Click to search the protein 3D
structure
Page | 87
Result
Page | 88
Page | 89
Result Interpretation
 The 3D structure of the selected protein (Tumor Suppressor protein, TP53) is
composed of a monomer containing- alpha helices, beta strands and coils.
 The protein contains several motifs like-
 Interaction with HRMT1L2
 Transcription activation (acidic)
 Interaction with WWOX
 DNA-binding region
 Required for interaction with FBXO42
 Required for interaction with ZNF385A
 Interaction with AXIN1
 Interaction with E4F1
 Interaction with CARM1
 Interaction with HIPK2
 Bipartite nuclear localization signal
 Nuclear export signal
 Oligomerization
 Basic (repression of DNA-binding)
The transcription factor binding sites are also provided PDB search result.

Weitere ähnliche Inhalte

Was ist angesagt?

Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databasesPranavathiyani G
 
sequence alignment
sequence alignmentsequence alignment
sequence alignmentammar kareem
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fastaALLIENU
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Vijay Hemmadi
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENTMariya Raju
 
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...Prasenjit Mitra
 
protein sequence analysis
protein sequence analysisprotein sequence analysis
protein sequence analysisRamikaSingla
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)Ariful Islam Sagar
 
Genomics and proteomics (Bioinformatics)
Genomics and proteomics (Bioinformatics)Genomics and proteomics (Bioinformatics)
Genomics and proteomics (Bioinformatics)Sijo A
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure predictionSamvartika Majumdar
 

Was ist angesagt? (20)

Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
NCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology InformationNCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology Information
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...
 
Finding ORF
Finding ORFFinding ORF
Finding ORF
 
protein sequence analysis
protein sequence analysisprotein sequence analysis
protein sequence analysis
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)
 
Genomics and proteomics (Bioinformatics)
Genomics and proteomics (Bioinformatics)Genomics and proteomics (Bioinformatics)
Genomics and proteomics (Bioinformatics)
 
NCBI
NCBINCBI
NCBI
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Dynamic programming
Dynamic programming Dynamic programming
Dynamic programming
 
Types of genomics ppt
Types of genomics pptTypes of genomics ppt
Types of genomics ppt
 
FASTA
FASTAFASTA
FASTA
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure prediction
 

Andere mochten auch

Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformaticsavrilcoghlan
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq DataPhil Ewels
 
Dna binding protein(motif)
Dna binding protein(motif)Dna binding protein(motif)
Dna binding protein(motif)mamad416
 
DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club avrilcoghlan
 
BITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignmentKubuldinho
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsNikesh Narayanan
 
Phylogenetic trees
Phylogenetic treesPhylogenetic trees
Phylogenetic treesmartyynyyte
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 

Andere mochten auch (13)

Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq Data
 
Dna binding protein(motif)
Dna binding protein(motif)Dna binding protein(motif)
Dna binding protein(motif)
 
DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
DNA Motif Finding 2010
DNA Motif Finding 2010DNA Motif Finding 2010
DNA Motif Finding 2010
 
BITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS: Basics of Sequence similarity
BITS: Basics of Sequence similarity
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Phylogeny
PhylogenyPhylogeny
Phylogeny
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
BLAST
BLASTBLAST
BLAST
 
Phylogenetic trees
Phylogenetic treesPhylogenetic trees
Phylogenetic trees
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 

Ähnlich wie Bioinformatics.Practical Notebook

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxChijiokeNsofor
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGLong Pei
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityMonica Munoz-Torres
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing ResearchTanmay Ghai
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination NetworkMonica Munoz-Torres
 
Present status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPresent status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPrabhatSingh628463
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisMonica Munoz-Torres
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationIJAEMSJORNAL
 
Gene identification and discovery
Gene identification and discoveryGene identification and discovery
Gene identification and discoveryAmit Ruchi Yadav
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisdrelamuruganvet
 

Ähnlich wie Bioinformatics.Practical Notebook (20)

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
 
Gene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptxGene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptx
 
genomeannotation-160822182432.pdf
genomeannotation-160822182432.pdfgenomeannotation-160822182432.pdf
genomeannotation-160822182432.pdf
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing Research
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
 
31931 31941
31931 3194131931 31941
31931 31941
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Present status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPresent status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptx
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
 
An26247254
An26247254An26247254
An26247254
 
Genome comparision
Genome comparisionGenome comparision
Genome comparision
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferation
 
Gene identification and discovery
Gene identification and discoveryGene identification and discovery
Gene identification and discovery
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysis
 

Bioinformatics.Practical Notebook

  • 1. PREPARED FOR : Dr. Md. Khademul Islam (Course Teacher) PREPARED By : Naima Thahsin ID : 13376001 Course : BTC 509: Genomics (Bioinformatics) PRACTICAL NOTEBOOK ON: BIOINFORMATICS
  • 2. Page | 2 CHAPTER CONTENTS TOOLS PAGE NO. Chapter-1 DNA sequence analysis 1.1 General 03 1.2 Finding protein coding regions GeneMark 04-07 GENSCAN 07-09 1.3 Prediction of Promoters SoftBerry 10-12 Promoter 2.0 12-14 1.4 Detection of Tandem Repeat Tandem repeat finder 14-18 1.5 Masking interspersed repeats RepeatMasker 18-22 1.6 Finding UTR location UTRScan 22-25 1.7 Searching CpG Islands CpG Islands 25-27 1.8 Predictioning Transcription Factor Binding Sites TFSEARCH 28-31 1.9 Designing PCR Primer and Calculating Standard Properties Primer3Plus 31-35 OligoCalc 36-37 1.10 Restriction Mapping BioTools 38-43 Chapter-2 Phylogenetic relation Analysis 2.1 General 44 2.2 Sequence alignment Clustal Omega 44-47 T-Coffee 48-50 2.3 Constructing phylogenetic tree MEGA 51-57 Chapter-3 Protein Sequence Analysis 3.1 General 58 3.2 Primary Structure Analysis ProtParam 58-63 3.3 Finding cleavage sites PeptideCutter 63-68 3.4 Computing profile produced by any amino acid scale ProtScale 68-74 3.5 Predicting post-translational modifications ScanProsite 74-79 3.6 Predicting functional domain InterProScan 79-81 3.7 Predicting secondary structure PSIPRED 83-86 3.8 Retrieving 3D structure of a protein from PDB Protein Data Bank (PDB) 86-89
  • 3. Page | 3 1. DNA Sequence Analysis 1.1 General A gene is the molecular unit of heredity of a living organism. Genes hold the information to build and maintain an organism's cells and pass genetic traits to offspring. Basically a gene is a sequence of nucleic acids (DNA or, in the case of certain viruses RNA). The vast majority of living organisms encode their genes in long strands of DNA (deoxyribonucleic acid). Most DNA molecules are double-stranded helices, consisting of two long biopolymers made of simpler units called nucleotides—each nucleotide is composed of a nucleobase (guanine, adenine, thymine, and cytosine), recorded using the letters G, A, T, and C, as well as a backbone made of alternating sugars (deoxyribose) and phosphate groups (related to phosphoric acid), with the nucleobases (G, A, T, C) attached to the sugars. The two strands of DNA run in opposite directions to each other and are therefore anti- parallel (a strand running 5'-3' pairs with a complementary strand running 3'-5'). In biological systems, nucleic acids contain information which is used by a living cell to construct specific proteins. Genes that encode proteins are composed of a series of three- nucleotide sequences called codons, which serve as the words in the genetic language. Each codon corresponds to a single amino acid, and there is a specific genetic code by which each possible combination of three bases corresponds to a specific amino acid. However, a significant portion of DNA (more than 98% for humans) is non-coding, meaning that these sections do not serve a function of encoding proteins. All genes have regulatory regions in addition to regions that explicitly code for a protein or RNA product. A regulatory region shared by almost all genes is known as the promoter, which provides a position that is recognized by the transcription machinery when a gene is about to be transcribed and expressed. Other possible regulatory regions include enhancers, which can compensate for a weak promoter. Most regulatory regions are "upstream"—that is, before or toward the 5' end of the transcription initiation site. Eukaryotic promoter regions are much more complex and difficult to identify than prokaryotic promoters. In bioinformatics, the term genetic sequence analysis refers to the process of subjecting a DNA sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others.
  • 4. Page | 4 1.2 Finding protein coding regions in a DNA sequence Protein coding genes have different structures in microbes and multicellular organisms. In microbes, each protein is encoded by a simple DNA segment-from start to end-called open readings frame (ORF). In animal and plant genes, proteins are encoded in several pieces called exons, separated by noncoding segments called introns. There are many sites which provide tools for finding ORF or coding regions. a) GeneMark GeneMark is a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. GeneMark developed in 1993 was the first gene finding method recognized as an efficient and accurate tool for genome projects. The GeneMark algorithm uses species specific inhomogeneous Markov chain models of protein-coding DNA sequence as well as homogeneous Markov chain models of non- coding DNA. Parameters of the models are estimated from training sets of sequences of known type. The major step of the algorithm computes a posteriory probability of a sequence fragment to carry on a genetic code in one of six possible frames (including three frames in complementary DNA strand) or to be "non-coding". Procedure Go to the homepage of GeneMark, http://exon.gatech. edu/genemark Click on “GeneMark” on the right panel. Choose appropriate model from given options (e.g. Models for prokaryotes) Paste the sequence to be checked or the sequences can be uploaded Change the parameters if it is needed Click on “Start GeneMark” button.
  • 6. Page | 6 Result of GeneMark 5 5 4 6
  • 7. Page | 7 Result Interpretation The result has provided information on the G+C content (54.78 %), 3 possible coding sequences (CDS), their position on strand, length and left starting & right stopping end and the possible protein sequences translated from the exons. b) GENSCAN In bioinformatics GENSCAN is a program to identify complete gene structures in genomic DNA. It is a GHMM-based program that can be used to predict the location of genes and their exon-intron boundaries in genomic sequences from a variety of organisms. It is a eukaryotic ab initio gene finder that has achieved notable success. The GENSCAN Web server can be found at MIT. Procedure Go to GENSCAN home page through the link, http://genes.mit.edu/GENSCAN.html Paste the nucleotide sequence of interest Click on the 'Run GENSCAN' button
  • 8. Page | 8 GENSCAN result i. Predicted exons:
  • 9. Page | 9 ii. Predicted peptide sequences: Result Interpretation The result from GENSCAN provided following information on the sequence provided-  G+C contents 41.22%  The strand type, beginning position, end point, length, reading frame and exon score of initial, internal and terminal exons and poly-A-signal  Predicted peptide sequence The suboptimal exon cutoff value was set at 1.00. From the result the exon scores were found to be above the cutoff value which was quite good. So it can be said that the prediction was good. 1.3 Prediction of Promoters A promoter is a region of DNA that initiates transcription of a particular gene. Promoters are located near the genes they transcribe, on the same strand and upstream on the DNA. Promoters can be about 100–1000 base pairs long. For the transcription to take place, the enzyme that synthesizes RNA, known as RNA polymerase, must attach to the DNA near a gene. Promoters contain specific DNA sequences and response elements that provide a secure initial binding site for RNA polymerase and for proteins called transcription factors that recruit RNA polymerase. These transcription factors have specific activator or repressor sequences of corresponding nucleotides that attach to specific promoters and regulate gene expressions.
  • 10. Page | 10 a) SoftBerry Through SoftBerry program we can recognize bacterial promoter with 80% accuracy and specificity. In bacteria, the promoter contains two short sequence elements approximately - 10 and -35 nucleotides upstream from the transcription start site. Procedure  Go to SoftBerry home page, http://www.softberry.com From left panel select ‘OPERON AND GENE FINDING IN BACTERIA’ and click on ‘BPROM’ Paste the sequence of interest Click on the ‘PROCESS’ button
  • 12. Page | 12 Result Interpretation In BPROM program the threshold level for promoters is 0.20. The scores from the result for - 10 and -35 box were 25 and 41, respectively, both of which were above the threshold level. So, the prediction was quite good. The result also provided the position of the boxes at 154 and 134. The result also provided information about the transcription factor binding sites for – rpoS17, ihf, g1pR, crp and rpoD19 – the sequences of the sites, their positions and scores. b) Promoter 2.0 Prediction Server Promoter2.0 predicts transcription start sites of vertebrate PolII promoters in DNA sequences. It has been developed as an evolution of simulated transcription factors that interact with sequences in promoter regions. It builds on principles that are common to neural networks and genetic algorithms. Procedure Go to Promoter 2.0 home page, http://www.cbs.dtu.dk/service s/Promoter/ Paste the nucleotide sequence of interest Click on the 'Submit' button
  • 14. Page | 14 Result Interpretation According to the result the transcription start site was predicted to be at 800 position. The score table for Promoter 2.0 is - For the provided nucleotide sequence, the score was found to be 0.592 which depicts marginal prediction. 1.4 Detection of Tandem Repeat Tandem repeats occur in DNA when a pattern of two or more nucleotides is repeated and the repetitions are directly adjacent to each other. When between 10 and 60 nucleotides are repeated, it is called a minisatellite. Those with fewer are known as microsatellites or short tandem repeats. Tandem repeat describes a pattern that helps determine an individual's inherited traits. Tandem repeats can be very useful in determining parentage. Tandem repeat finder Tandem Repeats Finder is a program to locate and display tandem repeats in DNA sequences. In order to use the program, the user submits a sequence in FASTA format. The program is very fast, analyzing sequences on the order of .5Mb in just a few seconds. Submitted sequences may be of arbitrary length. Repeats with pattern size in the range from 1 to 2000 bases are detected. Procedure Go to Tandem repeat finder home page, http://tandem.bu.edu/trf/t rf.html Click on ‘Submit a Sequence for Analysis’ Select the option ‘Basic’ to use default parameters Choose the option ‘cut and paste sequence’ Paste the sequence to the box provided Click on the ‘Submit sequence’ button
  • 18. Page | 18 Result Interpretation The result indicates that  1 repeat was found in the given nucleotide sequence.  The indices were within 126-186.  The consensus size was 4 and the pattern was “GATA”.  The score was 104 which was quite good. 1.5 Masking interspersed repeats in a sequence In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences. These sequences were later characterized and placed into five categories: Simple Repeats, Tandem Repeats, Segmental Duplications and Interspersed Repeats. Interspersed repetitive DNA is found in all eukaryotic genomes and comprises of-  Processed Pseudogenes,  Retrotranscripts,  SINES,  DNA Transposons,  Retrovirus Retrotransposons and  Non-Retrovirus Retrotransposons (LINES ) Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase. RepeatMasker RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns).
  • 19. Page | 19 Procedure Go to RepeatMasker home page through the link, http://www.repeatmasker.org/ Select the option 'RepeatMasking' from the left panel Paste the nucleotide sequence to the box provided Click on the 'Reset' button
  • 22. Page | 22 Result Interpretation In the analyzed nucleotide sequence only one interspersed repetitive sequence was found. The sequence was SINE that contained 51 base pairs. The repetitive sequence was detected and masked. 1.6 Finding UTR location In molecular genetics, an untranslated region (or UTR) refers to either of two sections (5' UTR or 3'-UTR), one on each side of a coding sequence on a strand of mRNA. The five prime untranslated region (5' UTR) (also known as a Leader Sequence or Leader RNA) is the region of an mRNA that is directly upstream from the initiation codon. This region is important for the regulation of translation of a transcript. On the other hand, the three prime untranslated region (3'-UTR) is the section of messenger RNA (mRNA) that immediately follows the translation termination codon. The 3'-UTR often contains regulatory regions that influence post-transcriptional gene expression. Regulatory regions within the 3'-untranslated region can influence polyadenylation, translation efficiency, localization, and stability of the mRNA. UTRScan UTRscan is a pattern matcher which searches protein or nucleotide (DNA, RNA, tRNA) sequences in order to find UTR motifs. It is able to find, in a given sequence, motifs that characterize 3'UTR and 5'UTR sequences. Such motifs are defined in the UTRSite Database, a collection of functional sequence patterns located in the 5'- or 3'-UTR sequences. The UTRsite entries describe the various regulatory elements present in UTR regions and whose functional role has been established on experimental basis. UTRsite database could reveal very useful for automatic annotation of anonymous sequences generated by sequencing projects as well as for finding previously undetected signals in known gene sequences.
  • 23. Page | 23 Procedure Go to UTRScan home page through the link, http://itbtools.ba.itb.cnr.it/ Paste the nucleotide sequence in FASTA format Insert a valid email address Click on the 'Submit' button
  • 24. Page | 24 Result a) List of UTR motifs defined in the UTRSite Database
  • 25. Page | 25 b) Status of provided sequence Result Interpretation The UTRScan program found following UTR motifs in the provided sequences- IRES Iron Responsive Element K-B K-Box uORF Upstream Open Reading Frame MBE Musashi binding element A total of 9 matches for 4 signals were found in the sequence. The position and sequence of the UTR motifs were also detected by UTRScan. 1.7 Search for CpG Islands In genetics, CpG islands or CG islands (CGI) are genomic regions with at least 200 bp that contain a high frequency of CpG sites. The "p" in CpG refers to the phosphodiester bond between the cytosine and the guanine, which indicates that the C and the G are next to each other in sequence, regardless of being single- or double- stranded. In a CpG site, both C and G are found on the same strand of DNA or RNA and are connected by a phosphodiester bond.
  • 26. Page | 26 CpG Islands CpG Islands reports potential CpG island regions using the method described by Gardiner- Garden and Frommer (1987). The calculation is performed using a 200 bp window moving across the sequence at 1 bp intervals. CpG islands are defined as sequence ranges where the Obs/Exp value is greater than 0.6 and the GC content is greater than 50%. The expected number of CpG dimers in a window is calculated as the number of 'C's in the window multiplied by the number of 'G's in the window, divided by the window length. CpG islands are often found in the 5' regions of vertebrate genes, therefore this program can be used to highlight potential genes in genomic sequences. Procedure Go to CpG Islands homepage, http://www.bioinforma tics.org/sms2/cpg_islan ds.html Paste the sequence of interest in FASTA format Click on 'Submit' button
  • 27. Page | 27 Result Interpretation The range of GC content was found to be 54.50-64 % in the given sequence which was greater than the cutoff value (50%).
  • 28. Page | 28 1.8 Prediction of Transcription Factor Binding Sites In molecular biology and genetics, a transcription factor (sometimes called a sequence- specific DNA-binding factor) is a protein that binds to specific DNA sequences, thereby controlling the flow (or transcription) of genetic information from DNA to messenger RNA. Transcription factors perform this function alone or with other proteins in a complex, by promoting (as an activator), or blocking (as a repressor) the recruitment of RNA polymerase (the enzyme that performs the transcription of genetic information from DNA to RNA) to specific genes. A defining feature of transcription factors is that they contain one or more DNA-binding domains (DBDs), which attach to specific sequences of DNA adjacent to the genes that they regulate. TFSEARCH TFSEARCH program was written by Yutaka Akiyama (Kyoto University, currently at RWCP) in 1995. TFSEARCH searches highly correlated sequence fragments versus TFMATRIX transcription factor binding site profile database in 'TRANSFAC' databases developed at GBF- Braunschweig, Germany. Procedure Go to TFSEARCH home page through the link, http://www.cbrc.jp/research/db/TFSEARCH.html Enter any label for the sequence into top field Paste the nucleotide sequence in FASTA format into second field Set 'Threshold score' if necessary Click on 'Exec' button to submit the query sequence to the server
  • 31. Page | 31 Result Interpretation The given sequence was analyzed for transcription factor binding sites. A total of 12 high scoring sites were found in the sequence. All of them were above the threshold level (85.0). The maximum score was 95.4 and minimum score was 85.3. The sequence was predicted to be associated with following transcription factors-  HSF (Heat shock factor1)  HSF2 (Heat shock factor2)  ADR1 (alcohol dehydrogenase1)  GATA 1 (globin transcription factor 1)  GATA 2 (globin transcription factor 2) 1.9 Designing PCR Primer and Calculating Standard Properties The polymerase chain reaction, usually referred to as PCR, is an extremely powerful procedure that allows the amplification of a selected DNA sequence in a genome a million- fold or more in vitro-without the use of living cells during the cloning process. In this technique, the known part of the DNA is used to design two synthetic DNA oligonucleotides, one complementary to each strand of the DNA double helix and lying on opposite sides of the region to be amplified. These oligonucleotides serve as primers for in vitro DNA synthesis, which is catalyzed by DNA polymerase. Primers are required for DNA replication because the enzymes that catalyze this process, DNA polymerases, can only add new nucleotides to an existing strand of DNA. a) Primer3Plus The Internet site of University of Massachusetts Medical School (biotools.umassmed.edu) provides a link to a very complete and easy to use tool for primer designing, eg., Primer3Plus. Primer3Plus picks primers for PCR reactions, according to the conditions specified by the user. Primer considers things like melting temperature, concentrations of various solutions in PCR reactions, primer bending and folding, and many other conditions when attempting to choose the optimal pair of primers for a reaction. All of these conditions are user-specifiable, and can vary from reaction to reaction.
  • 32. Page | 32 Procedure Go to Bio Tools home page through the link, http://biotools.umassmed.edu/cgi- bin/primer3plus/primer3plus.cgi Select 'Primer3Plus' from the 'DNA Sequence Analysis' tools Paste the nucleotide sequence and Change the parameters as necessary Click on 'Pick Primers' button to submit the query sequence to the server
  • 35. Page | 35 Result Interpretation Primer3Plus provided 5 pairs of primers for the given nucleotide sequence. Each pair (left and right primers) has suitable features like length, temperature and GC content that fit to the provided settings. The first pair contains 20 bp long primers- Left Primer 1: GCCTCCTAATTCGGGCAGAA Right Primer 1: AAGGATGGGGTCTCCTCCTC The pair of primer is capable of amplifying 590 bp of the nucleotide sequence.
  • 36. Page | 36 b) OligoCalc OligoCalc is a web-accessible, client-based computational engine for reporting DNA and RNA single-stranded and double-stranded properties, including molecular weight, solution concentration, melting temperature, estimated absorbance coefficients, inter-molecular self-complementarity estimation and intra-molecular hairpin loop formation. OligoCalc has a familiar ‘calculator’ look and feel, making it readily understandable and usable. Method Go to Oligo Calc home page through the link, http://www.basic.north western.edu/biotools/oli gocalc.html Paste the oligonucleotide sequence of primer Click anywhere to get the properties of the given sequence
  • 38. Page | 38 1.10 Restriction Mapping A restriction map is a map of known restriction sites within a sequence of DNA. Restriction mapping requires the use of restriction enzymes. Restriction enzymes are enzymes that cut DNA at specific recognition sequences called "sites." They probably evolved as a bacterial defense against DNA bacteriophage. DNA invading a bacterial cell defended by these enzymes will be digested into small, non-functional pieces. The name "restriction enzyme" comes from the enzyme's function of restricting access to the cell. There are hundreds of restriction enzymes that have been isolated and each one recognizes its own specific nucleotide sequence. Sites for each restriction enzyme are distributed randomly throughout a particular DNA stretch. Digestion of DNA by restriction enzymes is very reproducible; every time a specific piece of DNA is cut by a specific enzyme, the same pattern of digestion will occur. Restriction enzymes are commercially available and their use has made manipulating DNA very easy. BioTools-Restriction mapping tool One approach in constructing a restriction map of a DNA molecule is to sequence the whole molecule and to run the sequence through a computer program that will find the recognition sites that are present for every restriction enzyme known. ‘BioTools’ provides an application, Restriction mapping tool, which allows the user to supply both DNA sequence and (optionally) his own file of Restriction Enzymes or other IUPAC patterns in GCG for Restriction Enzyme Mapping and Analysis, using Harry Mangalam's tacg 4.3 program as the analysis engine. Procedure Go to BioTools home page (http://biotools.uma ssmed.edu/) Select 'Restriction mapping tool' from the panel Paste the DNA sequence in the 'Sequence Entry' box Select restriction enzymes from the list Change other parameters as necessary Click on 'Submit Sequence to WWWtacg' button
  • 43. Page | 43 Result Interpretation The Restriction Enzyme Tool of ‘Bio Tools’ server analyzed the given nucleotide sequence and exhibited 6 hits for the selected 3 restriction enzymes- EcoRI, HindIII, BamHI. 3 hits were found for BamHI, 2 for HindIII and 1 for EcoRI. These enzymes specify and cut at the following sites of the nucleotide sequence- Restriction Enzyme Site Position BamHI GGATCC 1240,1865,2085 HindIII AAGCTT 1466,2115 EcoRI GAATTC 2064
  • 44. Page | 44 2. Phylogenetic relation Analysis 2.1 General Phylogenetics is the study of the evolutionary relationships of living organisms using treelike diagrams to represent pedigrees of these organisms. Phylogenetics can be studied in various ways. Molecular data that are in the form of DNA or protein sequences can provide very useful evolutionary perspectives of existing organisms because, as organisms evolve, the genetic materials accumulate mutations over time causing phenotypic changes. Through comparative analysis of these biological molecules from a number of related organisms, the evolutionary history of the genes or proteins and even the organisms can be revealed. Usually Similarities and divergence among related biological sequences revealed by sequence alignment are rationalized and visualized in the context of phylogenetic trees. Therefore the study of phylogenetic relationship, in general, involves sequence alignment and establishing phylogenetic tree. 2.2 Sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1] Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Objectives:  To understand the similarities among group of sequences  To determine conserved regions  To understand the evolutional relationship among related sequences. To do so 10 protein sequences of Small Membrane Protein for different species of Coronaviridae were retrieved from NCBI and analyzed through both Clustal Omega and T- Coffee. The comparison between the results from both tools is given later. a) Clustal Omega Clustal Omega is a new multiple sequence alignment program that uses seeded guide trees and HMM profile-profile techniques to generate alignments. It produces biologically meaningful multiple sequence alignments of divergent sequences. Evolutionary relationships can be seen via viewing Cladograms or Phylograms. Procedure
  • 45. Page | 45 Result Go to Clustal Omega home page, http://www.ebi.ac.uk/Tools/msa/clustalo/ Paste the protein sequences retrieved in multifasta format Click on the 'Submit' button to submit the sequences to the server
  • 47. Page | 47 b) T-Coffee T-Coffee (Tree-based Consistency Objective Function For alignment Evaluation) is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can also combine multiple sequences alignments obtained previously and in the latest versions can use structural information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of the alignments and some capacity for identifying occurrence of motifs Procedure Go to T-Coffee home page (http://tcoffee.vital- it.ch/apps/atcoffee/index.html) Select 'T-Coffee' tool from the panel Paste the protein sequences retrieved in multifasta format Click on the 'Submit' button to submit the sequences to the server
  • 50. Page | 50 Result Interpretation and Comparison between results from Clustal Omega and T-Coffee The sequence alignment was found to be better with T-Coffee than Clustal Omega. Along with aligned sequences, T-Coffee also provides the user alignment score for the input sequences. For the given sequences following scores were found- gi|530341189|gb : 47 gi|530802146|gb : 43 gi|148728344|gb : 46 gi|530802593|gb : 39 gi|56807328|ref : 42 gi|126030129|re : 42 gi|211907043|gb : 32 gi|212681391|re : 32 gi|187251957|re : 32 gi|33304216|gb| : 44 cons : 44 T-Coffee exhibited 2 conserved regions, whereas 1 was found in Clustal Omega. Number of regions with matches was also greater in T-Coffee than Clustal Omega. However, the advantage with Clustal Omega is that it provides a tool for building phylogenetic tree which would be available if ‘Java’ is present.
  • 51. Page | 51 2.3 Constructing phylogenetic tree A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the inferred evolutionary relationships among various biological species or other entities — their phylogeny — based upon similarities and differences in their physical and/or genetic characteristics. The taxa joined together in the tree are implied to have descended from a common ancestor. MEGA MEGA (MOLECULAR EVOLUTIONARY GENETICS ANALYSIS- Windows v5.2.2) is a software that provides tools for both multiple sequence alignment and constructing phylogenetic tree. Procedure a) MEGA was downloaded from http://www.megasoftware.net/ and installed in the Windows 7 environment. b) 10 protein sequences of Nucleocapsid protein for different species of Coronaviridae were retrieved from NCBI c) The Multifasta file containing protein sequences was run by MEGA. The flowchart of the procedure is as follows: Open MEGA 5.2.2 Open a file in FASTA format Select the option 'Align' Select 'Muscle' from upper panel to align protein sequences Set the parameters as default in settings window and click on 'compute' Save session in MAS format Click on 'Phylogeny' option from upper panel and select 'Maximum Likelihood' Open a file containing protein sequences saved in 'mas' format Click on the 'Compute' button
  • 56. Page | 56 Result Result Interpretation According to the inferred phylogenetic tree based on protein sequences from different species of Coronaviridae-  Two broad subgroups (B and C) have descended from a common ancestor A.  In the subgroup B, Bulbul coronavirus HKU11 and Munia coronavirus HKU13 are the closely related groups which are related to Beluga whale coronavirus SW1 and they are descended from the ancestor F. The group F is related to another group E that includes two closely related virus species, Human coronavirus OC43 and Human coronavirus HKU1. The groups F and E are descendants of D which is descended from B. the group B gives rise to an outgroup, Pipistrellus bat coronavirus HKU5, which is more close to group E than F.
  • 57. Page | 57  In the subgroup C, Human coronavirus 229E and Human coronavirus NL63 are the closely related groups which are related to Porcine epidemic diarrhea virus and they are descended from the ancestor H. The group H is descended from the ancestor C which gives rise to an outgroup, Transmissible gastroenteritis virus.
  • 58. Page | 58 3. Protein sequence Analysis 3.1. General Proteins are one of the important fundamental units of all living cells. Proteins have a wide range of functions within all the living beings. Some of the important functions such as DNA replication, catalysis of metabolic reactions, transportation of molecules from one location to another etc. are performed with the help of proteins. The building blocks of proteins are amino acids. Amino acids are made from an amine (- NH2) and a carboxylic acid (-COOH) functional groups as well as a side chain which is specific to each amino acid. There are almost 20 amino acids found in human body that usually varies in their R groups. In proteins, the amino acids are linked to each other by means of peptide bonds. A peptide bond is formed when the carboxyl group of one amino acid is linked to the amino group of another molecule through a covalent bond. Proteins differ from one another in their structure, primarily in their sequence of amino acids. The structure explains the different levels of organization of a protein molecule. The protein structure is classified into primary, secondary, tertiary, and quaternary. The linear sequence the polypeptide chain of amino acid refers to the primary structure of proteins. The intermolecular and intra-molecular hydrogen bonding between the amide groups in primary structure of protein form secondary structure. Alpha helices and beta sheets are the two important secondary structures in protein. The three dimensional structure of a single protein molecule refers to the tertiary structure. The quaternary structure is formed by several protein molecules or polypeptide chains. 3.2. Primary Structure Analysis of a Protein There are different tools available through ExPasy server to analyze a protein sequence. ExPASy is the SIB Bioinformatics Resource Portal. It provides access to several scientific databases and software tools in many areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc. ProtParam is one among the protein analysis tools available on the ExPasy server and can be accessible through the link, http://www.expasy.org/tools/protparam.html. It is used for calculating various physiochemical parameters of a provided protein. The protein sequence is the only input provided to calculate such parameters. In ProtParam, the protein can be specified as -  UniProtKB/Swiss-Prot accession number,
  • 59. Page | 59  UniProtKB/TrEMBL accession number,  ID or  Amino acid sequences. The various parameters computed by ProtParam are molecular weight, amino acid composition, extinction coefficient, estimated half-life, theoretical pI, and grand average of hydropathicity (GRAVY), aliphatic index and instability index. Objectives  To compute the various physical and chemical parameters of a protein.  To perform primary structure analysis of proteins. Procedure Go to ProtParam home page, http://www.expasy.org/tools /protparam.html Paste the FASTA sequence of protein of interest Click on the ‘Compute parameters’ button
  • 60. Page | 60  ProtParam home page  Paste the FASTA sequence of protein
  • 63. Page | 63 Resut Interpretation From the result of ProtParam we found that-  The estimated half-life is 30 hours which indicates that half of the amount of protein in a cell disappears 30 hours after its synthesis in the cell.  The instability index of the analyzed protein is 37.87 which is less than the cut off value (40). So the protein is considered as stable.  According to the computed aliphatic index, the protein has greater volume of amino acids having aliphatic side chains in their structures.  The grand average of hydropathicity of the protein is 0.327. The positive score indicates that the protein has greater hydrophobicity. 3.3 Finding cleavage sites in a given protein sequence PeptideCutter searches a protein sequence from the SWISS-PROT and/or TrEMBL databases or a user-entered protein sequence for protease cleavage sites. Single proteases and chemicals, a selection or the whole list of proteases and chemicals can be used. Most of the cleavage rules for individual enzymes were deduced from specificity data summed up by Keil (1992). Different forms of output of the results are available: Tables of cleavage sites either grouped alphabetically according to enzyme names or sequentially according to the amino acid number. A third option for output is a map of cleavage sites. The sequence and the cleavage sites mapped onto it are grouped in blocks, the size of which can be chosen by the user to provide a convenient form of print-out. Method Go to PeptideCutter home page, http://web.expasy.org/pepti de_cutter/ Paste the FASTA sequence of protein of interest Select enzymes and chemicals if necessary Click on the ‘Perform’ button
  • 64. Page | 64 PeptideCutter home page Selection of parameters
  • 67. Page | 67 Map of cleavage sites The cleavage sites for a single enzyme, e.g. Trypsin, mapped onto the entered protein sequence are shown below:
  • 68. Page | 68 Discussion We can predict the potential cleavage sites cleaved by proteases or chemicals in a given protein sequence with the help of a bioinformatic tool, PeptideCutter. If we know the cleavage sites of a protein, we can use an enzyme to cut input protein in specific ways. This can be useful if we are interested in carrying out experiments on a portion of our protein. PeptideCutter can also help us in following aspects:  If we want to separate the domains in our protein  Identify potential post-translational modification by mass spectrometry  Remove a tag protein when we want to express a fusion protein  Make sure that the protein we are cloning is not sensitive to some endogenous proteases 3.4 Computing profile produced by any amino acid scale ProtScale allows to compute and represent (in the form of a two-dimensional plot) the profile produced by any amino acid scale on a selected protein. An amino acid scale is defined by a numerical value assigned to each type of amino acid. The most frequently used scales are hydrophobicity scales, most of which were derived from experimental studies on partitioning of peptides in apolar and polar solvents, with the goal of predicting membrane-spanning segments that are highly hydrophobic, and secondary structure conformational parameter scales. In addition, many other scales exist which are based on different chemical and physical properties of the amino acids. ProtScale can be used with 50 predefined scales entered from the literature. The scale values for the 20 amino acids, as well as a literature reference, are provided on ExPASy for
  • 69. Page | 69 each of these scales. To generate data for a plot, the protein sequence is scanned with a sliding window of a given size. At each position, the mean scale value of the amino acids within the window is calculated, and that value is plotted for the midpoint of the window. We can set several parameters that control the computation of a scale profile, such as the window size, the weight variation model, the window edge relative weight value, and scale normalization. Objective  Use the scale of hydrophobicity to identify the groups of hydrophobic segments within the protein sequence.  Predicting transmembrane segments in the given protein. Method Go to ProtScale home page, http://web.expasy.org/prot scale/ Past FASTA sequence of desired protein Choose an amino acid scale from the list (e.g., Hphob. / Kyte & Doolittle) Set window size at 19Normalize scale, if necessary Click on the ‘Submit’ button
  • 70. Page | 70 ProtScale home page Selection of parameters
  • 72. Page | 72 Result Using Hphob. / Kyte & Doolittle scale
  • 73. Page | 73 With normalized scale
  • 74. Page | 74 Discussion Hydrophobicity scales are values that define relative hydrophobicity of amino acid residues. The more positive the value, the more hydrophobic are the amino acids located in that region of the protein and hydrophobic segments characterize transmembrane proteins. The desired protein sequence was analyzed using Kyte & Doolittle (hydrophobicity) scale. The recommended threshold value when using Kyte and Doolittle is 1.6. From the result, four regions of the given protein sequence was found above the threshold level. The highest peak has been found at the N-terminus of the sequence which indicates the presence of a transmembrane segment and predicts that the protein is secreted. 3.5 Predicting post-translational modifications in protein Proteins often need to be modified before they become active in the cell. This is called post- translational modifications. It may involve adding sugars, modifying amino acids, or removing pieces of the newly synthesized protein. If we are studying a new protein, we may want to know about such matters. It is also important if we want to clone and express a human protein in bacteria, because, in order to be active, protein may require some post- translational modifications that the bacterium itself cannot make. PROSITE is a database that contains a list of short sequence motifs (also some named patterns) that experiments have associated with particular biological properties. Many of these patterns are associated with post-translational modifications. On the ExPASy server (www.expasy.org), we can compare our protein sequence with the collection of patterns in PROSITE and find out which modifications our protein is likely to undergo. Objective  Scan our protein of interest for matches against the PROSITE collection of motifs and  Find out post-translational modifications in that protein. Methods  Go to ScanProsite home page, http://prosite.expasy.org/sca nprosite/ Give UniProtKB accession number Select ‘Exclude profiles from the scan’ Click on the ‘Start the scan’ button
  • 75. Page | 75 ScanProsite home page Selecting Parameters
  • 79. Page | 79 Resut Interpretation & Discussion The result from ScanProsite represents 6 hits (by 3 distinct patterns) for 3 types of short sequence motifs which are predicted to be associated with post translational modifications. The sequence motifs are-  Multicopper oxidase 1  FA58C 1 (Coagulation factor 5/8 type C domain)  FA58C 2 PDB structure viewer shows the 3D structure of FA58C 1 and FA58C 2 associated with the A chain of the protein structure. 3.6 Predicting functional domain in protein sequence InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterize them. The contents of InterPro are based around diagnostic signatures and the proteins that they significantly match. The signatures consist of models (simple types, such as regular expressions or more complex ones, such as Hidden Markov models) which describe protein families, domains or sites. Models are built from the amino acid sequences of known families or domains and they are subsequently used to search unknown sequences (such as those arising from novel genome sequencing) in order to classify them. InterProScan is a bioinformatics tool that is available in InterPro via a webserver. It provides a one-stop-shop for automated sequence analysis of both protein and nucleic acid. It offers the researcher the ability to identify both structural and functional regions of interest and to quickly characterize a new or novel sequence with considerable confidence. Objective A protein domain is a conserved part of a given protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain. Domains vary in length from between about 25 amino acids up to 500 amino acids in length. Here our objective is to find out functional domains in a given protein sequence.
  • 80. Page | 80 Method InterProScan home page Go to InterProScan home page through the link www.ebi.ac.uk/InterProScan/ Paste the sequence of protein of interest Click on the 'Submit' button
  • 81. Page | 81 Result Discussion A number of algorithms (14) available in InterProScan tool were selected to find out functional domain in the provided protein sequence. According to PRINTS the protein sequence contains LEUZIPPRFOS domain which is a 5- element fingerprint that provides a signature for the leucine zipper and DNA-binding domains characteristic of the fos oncogenes and fos-related proteins. PFAM, SMART, PROSITE and PROFILE also ensured the presence of leucine zipper domain in the protein sequence.  The DNA binding region comprises a number of basic amino acids such as arginine and lysine.  The `leucine zipper' is a structure that is believed to mediate the function of several eukaryotic gene regulatory proteins. The zipper consists of a periodic repetition of leucine residues at every seventh position, and regions containing them appear to span 8 turns of alpha-helix. The leucine side chains that extend from one helix interact with those from a similar helix, hence facilitating dimerisation in the form of a coiled-coil. Proteins containing this domain are transcription factors.
  • 82. Page | 82 3.7 Predicting secondary structure of a protein sequence Protein secondary structure can be described by the hydrogen-bonding pattern of the peptide backbone of the protein. The most common secondary structures are alpha helices and beta sheets. Other extended structures such as the polyproline helix and alpha sheet are rare in native state proteins but are often hypothesized as important protein folding intermediates. Tight turns and loose, flexible loops link the more "regular" secondary structure elements. The random coil is not a true secondary structure, but is the class of conformations that indicate an absence of regular secondary structure. Accurate secondary-structure prediction is a key element in the prediction of tertiary structure, in all but the simplest (homology modeling) cases. At present there are several secondary-structure prediction methods such as PSIPRED, SAM, PORTER, PROF and SABLE. PSIPRED is a simple and accurate secondary structure prediction method, incorporating two feed-forward neural networks which perform an analysis on output obtained from PSI- BLAST (Position Specific Iterated - BLAST). Using a very stringent cross validation method to evaluate the method's performance, PSIPRED 3.2 achieves an average Q3 score of 81.6%. Method Go to the home page of PsiPred, http://bioinf.cs.ucl.ac.u k/psipred Choose a prediction method, PSIPRED v3.3 (Predict Secondary Structure) Paste the sequence of interest write the email address and short identifier for submission to the boxes provided Click on the 'Predict' button and wait for the result
  • 86. Page | 86 Result Interpretation From the prediction result obtained from PsiPred it was evident that the secondary structure of the provided protein sequence consists of alpha helices and coil structures, but there is no beta sheet. The confidence of prediction was quite good. 3.8 Retrieving 3D structure of a protein from PDB Protein tertiary structure refers to three-dimensional structure of a single, double, or triple bonded protein molecule. The alpha-helixes and beta pleated-sheets are folded into a compact globular structure. The folding is driven by the non-specific hydrophobic interactions (the burial of hydrophobic residues from water), but the structure is stable only when the parts of a protein domain are locked into place by specific tertiary interactions, such as salt bridges, hydrogen bonds, and the tight packing of side chains and disulfide bonds. The Protein Data Bank (PDB) is a repository for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography or NMR spectroscopy and submitted by biologists and biochemists from around the world, are freely accessible on the Internet. The file formats used by the PDB are PDB format files and PDBML (XML) files. The structure files may be viewed using VMD, MDL Chime, Pymol, UCSF Chimera, Rasmol, Swiss-PDB Viewer, StarBiochem, Sirius, and VisProt3DS. The PDB database is updated weekly. Procedure Go to PDB home page, http://www.rcsb.org/pdb/hom e/home.do Write the PDB ID of the desired protein sequence Click to search the protein 3D structure
  • 89. Page | 89 Result Interpretation  The 3D structure of the selected protein (Tumor Suppressor protein, TP53) is composed of a monomer containing- alpha helices, beta strands and coils.  The protein contains several motifs like-  Interaction with HRMT1L2  Transcription activation (acidic)  Interaction with WWOX  DNA-binding region  Required for interaction with FBXO42  Required for interaction with ZNF385A  Interaction with AXIN1  Interaction with E4F1  Interaction with CARM1  Interaction with HIPK2  Bipartite nuclear localization signal  Nuclear export signal  Oligomerization  Basic (repression of DNA-binding) The transcription factor binding sites are also provided PDB search result.