2. outline
Introduction to bioinformatics
Biological databases
Sequence alignment and their algorithms
Structural prediction
Web-based tools
Stand-alone software
3. Introduction to bioinformatics
What is the bioinformatics?
Bioinformatics is an interdisciplinary research area at the interface between
computer science and biological science.
4. Introduction to bioinformatics
What are differences between bioinformatics and
informatics?
What are differences between bioinformatics and
computational biology?
What is the algorithm?
7. Biological databases
Database
A database is a computerized archive used to store and organize data in such a
way that information can be retrieved easily via a variety of search criteria
Entry
Each record should contain a number of fields that hold the actual data items
Value
a particular piece of information
Making a query
To retrieve a particular record from the database, a user can specify a value to
be found in a particular field and expect the computer to retrieve the whole
data record
8. Biological databases
Primary databases
Gen bank (NCBI)
EMBL
DDBJ
www.ncbi.nlm.nih.gov
www.ebi.ac.uk/embl/index.html
www.ddbj.nig.ac.jp
Secondary databases
ExPASY
PIR
SWISS-Prot
http://web.expasy.org
http://pir.georgetown.edu/pirwww/pirhome3.shtml
www.ebi.ac.uk/swissprot/access.html
10. Biological databases
Pitfalls of biological databases
The causes of redundancy include: repeated submission of identical or
overlapping sequences by the same or different authors, revision of
annotations, dumping of expressed sequence tags (EST) data
Redundant sequences
Non-redundant sequences (Ref Seq)
19. Sequence alignment and their
algorithms
Pairwise sequence alignment
Pairwise sequence alignment is the process of aligning two sequences and is
the basis of database similarity searching and multiple sequence alignment
Sequence similarity versus sequence homology
When two sequences are descended from a common evolutionary origin, they
are said to have a homologous relationship or share homology. A related but
different term is sequence similarity, which is the percentage of aligned
residues that are similar in physiochemical properties such as size, charge,
and hydrophobicity
Sequence similarity versus sequence identity
In a protein sequence alignment, sequence identity refers to the percentage of
matches of the same amino acid residues between two aligned sequences.
Similarity refers to the percentage of aligned residues that have similar
physicochemical characteristics and can be more readily substituted for each
other
20. Sequence alignment and their
algorithms
Sequence alignment strategies
Global alignment
In global alignment, two sequences to be aligned are assumed to be generally
similar over their entire length. Alignment is carried out from beginning to end
of both sequences to find the best possible alignment across the entire length
between the two sequences
Local alignment
In local alignment does not assume that the two sequences in question have
similarity over the entire length. It only finds local regions with the highest
level of similarity between the two sequences and aligns these regions without
regard for the alignment of the rest of the sequence regions
22. Sequence alignment and their
algorithms
Linear gap penalty: The cost for creation and extension of gaps are the same
W(I)= gI, g is the cost for each gap and I is the length
Affine gap penalty: different cost for creation and extension
W(I)=gopen + gext (I-1) and gopen < Gext
S
S
,
W I
23. Sequence alignment and their
algorithms
Alignment Algorithms And Methodes
The dot matrix method
The word method
The dynamic programming method
24. Sequence alignment and their
algorithms
Alignment Algorithms
The dot matrix method
The most basic sequence alignment method is the dot matrix method, also
known as the dot plot method
25. Sequence alignment and their
algorithms
Alignment Algorithms
The word method
It works by finding short stretches of identical or nearly identical letters in
two sequences. These short strings of characters are called words, which
are similar to the windows used in the dot matrix method
27. Sequence alignment and their
algorithms
Alignment Algorithms
The dynamic programming method
Dynamic programming is a method that determines optimal alignment by
matching two sequences for all possible pairs of characters between the
two sequences
28.
29. Sequence alignment and their
algorithms
Alignment Algorithms
The dynamic programming method
Global alignment
The classical global pairwise alignment algorithm using dynamic
programming is the Needleman–Wunsch algorithm. In this algorithm, an
optimal alignment is obtained over the entire lengths of the two sequences
Local alignment
The first application of dynamic programming in local alignment is the
Smith–Waterman algorithm. In this algorithm, positive scores are
assigned for matching residues and zeros for mismatches. No negative
scores are used
30. Sequence alignment and their
algorithms
substitution matrix
PAM matrices (point accepted mutation)
The PAM matrices were subsequently derived based on the evolutionary
divergence between sequences of the same cluster. One PAM unit is defined as
1% of the amino acid positions that have been changed. Because of the use of
very closely related homologs, the observed mutations were not expected to
significantly change the common function of the proteins
31. Sequence alignment and their
algorithms
substitution matrix
PAM matrices (point accepted mutation)
32. Sequence alignment and their
algorithms
substitution matrix
BLOSUM matrices
This is the series of blocks amino acid substitution matrices (BLOSUM), all of
which are derived based on direct observation for every possible amino acid
substitution in multiple sequence alignments
34. Sequence alignment and their
algorithms
What Matrices should be used and when?
Matrix
PAM40
Best use
Similarity (%)
Short alignment that are
70-90
highly similar
PAM160
Detecting members of a
50-60
protein family
PAM250
Longer alignments of more App. 30
divergent sequences
BLUSOM90
Short alignment that are
70-90
highly similar
BLUSOME80
Detecting members of a
50-60
protein family
BLUSOME62
Most effective in finding
30-40
all potential similarities
BLUSOME30
Longer alignments of more <30
divergent sequences
Similarity: the range of similarities that the matrix is able to best tdetecr.
35. Comparison
• PAM is based on an evolutionary model
using phylogenetic trees
• BLOSUM assumes no evolutionary model,
but rather conserved “blocks” of proteins
36. Sequence alignment and their
algorithms
Heuristic database searching
The heuristic algorithms perform faster searches because they examine only a
fraction of the possible alignments examined in regular dynamic programming
BLAST (basic local alignment search tool)
BLAST uses heuristics to align a query sequence with all sequences in a
database
38. Sequence alignment and their
algorithms
6- finishing
Negative scores from scoring matrix
Threshold for stopping extension
Minimum
Score (S)
Neighborhood
Score Threshold (T)
If the extension stopped after crossing the X, the alignment is called
High-scoring segment pair (HSP)
39. Sequence alignment and their
algorithms
Suggested BLAST Cutoffs
Finding by chance in nucleotide database is more than proteins
Identity in proteins is more informative than in the nucleic acids
For nucleotide-based searches: hits with E values of 10-6 or
less and seq identity 70% or more
For protein-based searches: hits with E values of 10-3 or less and
seq. identity of 25% or more.
40. Sequence alignment and their
algorithms
BLAST (basic local alignment search tool)
BLASTN
queries nucleotide sequences with a nucleotide sequence database
BLASTP
uses protein sequences as queries to search against a protein sequence
database
BLASTX
uses nucleotide sequences as queries and translates them in all six reading
frames to produce translated protein sequences, which are used to query a
protein sequence database
TBLASTN
queries protein sequences to a nucleotide sequence database with the
sequences translated in all six reading frames
TBLASTX
uses nucleotide sequences, which are translated in all six frames, to search
against a nucleotide sequence database that has all the sequences
translated in six frames
41. Sequence alignment and their
algorithms
PSI-BLAST
Position-specific iterated BLAST (PSI-BLAST) builds profiles and performs
database searches in an iterative fashion. The main feature of PSI-BLAST is
that profiles are constructed automatically and arefine-tunedin each successive
cycle
44. Sequence alignment and their
algorithms
Multiple sequence alignment
Exhaustive algorithms
The exhaustive alignment method involves examining all possible aligned
positions simultaneously
Heuristic algorithms
Because the use of dynamic programming is not feasible for routine multiple
sequence alignment, faster and heuristic algorithms have been developed.
computational strategy to find a near-optimal solution by using rules of
thumb. Essentially, this strategy takes shortcuts by reducing the search
space according to certain criteria
45. Sequence alignment and their
algorithms
Multiple sequence alignment
Heuristic algorithms
Progressive alignment
Progressive alignment depends on the stepwise assembly of multiple
alignment and is heuristic in nature
Clustal
It is a progressive multiple alignment program available either as a standalone or on-line program
T-coffee
T-coffee performs progressive sequence alignments as in Clustal. The main
difference is that, in processing a query, T-Coffee performs both global and
local pairwise alignment for all possible pairs involved. The global pairwise
alignment is performed using the Clustal program
46.
47. Sequence alignment and their
algorithms
Multiple sequence alignment
Heuristic algorithms
Iterative alignment
The iterative approach is based on the idea that an optimal
solution can be found by repeatedly modifying existing
suboptimal solutions
48. Sequence alignment and their
algorithms
Multiple sequence alignment
Heuristic algorithms
Block-Based Alignment
The strategy identifies a block of ungapped alignment shared by all the
sequences, hence, the block-based local alignment strategy
49. Structural prediction
Structural prediction methods
Ab-initio prediction
Computational prediction based on first principles or using the most
elementary information
Threading
Method of predicting the most likely protein structural fold based on secondary
structure similarity with database structures and assessment of energies of the
potential fold. The term has been used interchangeably with fold recognition
Homology-based modeling
Method for predicting the three-dimensional structure of a protein based on
homology by assigning the structure of an unknown protein using an existing
homologous protein structure as a template
50. Hidden Markova algorithm
Statistical model composed of a number of interconnected. Markov chains
with the capability to generate the probability value of an event by taking
into account the influence from hidden variables. Mathematically, it
calculates probability values of connected states among the Markov chains
to find an optimal path within the network of states. It requires training to
obtain the probability values of state transitions. When using a hidden
Markov model to represent a multiple sequence alignment, a sequence can
be generated through the model by incorporating probability values of
match, insertion, and deletion states
52. Neural network algorithm
Machine-learning algorithm for pattern recognition. It is composed of
input, hidden, and output layers. Units of information in each layer are
called nodes. The nodes of different layers are interconnected to form a
network analogous to a biological nervous system. Between the nodes are
mathematical weight parameters that can be trained with known patterns
so they can be used for later predictions. After training, the network is able
to recognize correlation between an input and output