SlideShare ist ein Scribd-Unternehmen logo
1 von 124
FBW
             23-10-2012




Wim Van Criekinge
Inhoud Lessen: Bioinformatica




                                GEEN LES
DataBase Searching

               Dynamic Programming
                 Reloaded
               Database Searching
                 Fasta
                 Blast
                 Statistics
                 Practical Guide
               Extentions
                 PSI-Blast
                 PHI-Blast
                 Local Blast
                 BLAT
Needleman-Wunsch-edu.pl

The Score Matrix
----------------
        Seq1(j)1      2       3      4        5      6      7
Seq2      *    C      K       H       V       F      C      R
(i) *     0    -1     -2      -3      -4      -5     -6     -7
1    C    -1   1 a 0          -1      -2      -3     -4     -5
2    K    -2   0c     2b      1       0       -1     -2     -3
3    K    -3   -1     1       1       0       -1     -2     -3
4    C    -4   -2 matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
               A:     0       0       0       -1     0      -1
5    F    -5   -3     -1(substr(seq1,j-1,1) eq substr(seq2,i-1,1)
                       if     -1      -1      1      0      -1
6    C    -6   -4 up_score = matrix(i-1,j) + GAP 2
               B:     -2      -2      -2      0             1
7    K    -7   -5     -3      -3      -3      -1     1      1
8    C    -8   -6 left_score =-4
               C:     -4       matrix(i,j-1) +-2
                                      -4       GAP 0        0
9    V    -9   -7     -5      -5      -3      -3     -1     -1
Multiple Alignment Method

                    • The most practical and widely used
                      method in multiple sequence alignment
                      is the hierarchical extensions of
                      pairwise alignment methods.
                    • The principal is that multiple alignments
                      is achieved by successive application
                      of pairwise methods.
                            – First do all pairwise alignments (not just one
                              sequence with all others)
                            – Then combine pairwise alignments to generate
                              overall alignment
Database Searching

                     • Consider the task of searching
                       SWISS-PROT against a query
                       sequence:
                       – say our query sequence is 362
                         amino- acids long
                       – SWISS-PROT release 38
                         contains 29,085,265 amino acids
                       – finding local alignments via
                         dynamic programming would
                         entail O(1010) matrix operations
                     • Given size of databases, more
                       efficient methods needed
Heuristic approaches to DP for database searching

FASTA (Pearson 1995)                   BLAST (Altschul 1990, 1997)

Uses heuristics to avoid               Uses rapid word lookup
  calculating the full dynamic           methods to completely skip
  programming matrix                     most of the database
                                         entries
Speed up searches by an
  order of magnitude                   Extremely fast
  compared to full Smith-                  One order of magnitude
  Waterman                                   faster than FASTA
                                           Two orders of magnitude
                                             faster than Smith-
The statistical side of FASTA is
                                             Waterman
  still stronger than BLAST

                                       Almost as sensitive as FASTA
FASTA

        « Hit and extend heuristic»
        • Problem: Too many calculations
          “wasted” by comparing regions
          that have nothing in common
        • Initial insight: Regions that are
          similar between two sequences
          are likely to share short
          stretches that are identical
        • Basic method: Look for similar
          regions only near short
          stretches that match exactly
FASTA-Stages

               1.   Find k-tups in the two sequences (k=1,2 for
                    proteins, 4-6 for DNA sequences)
               2.   Score and select top 10 scoring “local diagonals”
               3.   Rescan top 10 regions, score with PAM250
                    (proteins) or DNA scoring matrix. Trim off the
                    ends of the regions to achieve highest scores.
               4.   Try to join regions with gapped alignments. Join
                    if similarity score is one standard deviation above
                    average expected score
               5.   After finding the best initial region, FASTA
                    performs a global alignment of a 32 residue wide
                    region centered on the best initial region, and
                    uses the score as the optimized score.
FastA

        • Sensitivity: the ability of a
          program to identify weak but
          biologically significant sequence
          similarity.
        • Selectivity: the ability of a
          program to discriminate between
          true matches and matches
          occurring by chance alone.
          – A decrease in selectivity results in
            more false positives being reported.
FastA (http://www.ebi.ac.uk/fasta33/)



Gap opening penalty                     Blosum50
-12, -16 by default                     default.
for fasta with                          Lower PAM
proteins and DNA,                       higher blosum
respectively                            to detect close
                                        sequences
Gap extension                           Higher PAM and
penalty -2, -4 by                       lower blosum
default for fasta                       to detect distant
with proteins and                       sequences
DNA, respectively


                                        The larger the
Max number of
                                        word-length the
scores and
                                        less sensitive, but
alignments is 100
                                        faster the search
                                        will be
FastA Output
                                                 Initn, init1, opt, z-
                                                 score calculated
                                                 during run




Database                                                     E score -
code                                                         expectation
hyperlinked                                                  value, how
to the SRS                                                   many hits are
database at                                                  expected to be
EBI                                                          found by
                                                             chance with
                                                             such a score
                                                             while
                                                             comparing
                                                             this query to
                                                             this database.

                                                             E() does not
                                                             represent the
              Accession   Description   Length               % similarity
              number
FastA is a family of programs

     FastA, TFastA, FastX, FastY

                    Query:      DNA   Protein


                    Database:DNA      Protein
FASTA problems


                 FASTA can miss significant
                 similarity since
                 – For proteins, similar sequences do
                   not have to share identical residues
                    • Asp-Lys-Val is quite similar to
                    • Glu-Arg-Ile yet it is missed even with
                     ktuple size of 1 since no amino acid
                     matches
                    • Gly-Asp-Gly-Lys-Gly is quite similar
                      to Gly-Glu-Gly-Arg-Gly but there is
                     no match with ktuple size of 2
FASTA problems

                 FASTA can miss significant
                  similarity since
                   – For nucleic acids, due to codon
                     “wobble”, DNA sequences may
                     look like XXyXXyXXy where X’s
                     are conserved and y’s are not
                      • GGuUCuACgAAg and
                        GGcUCcACaAAA both code for
                       the same peptide sequence (Gly-Ser-
                       Thr-Lys) but they don’t match with
                       ktuple size of 3 or higher
DataBase Searching

               Dynamic Programming
                 Reloaded
               Database Searching
                 Fasta
                 Blast
                 Statistics
                 Practical Guide
               Extentions
                 PSI-Blast
                 PHI-Blast Local Blast
                 Blast
BLAST - Basic Local Alignment
        Search Tool
What does BLAST do?

• Search a large target set of sequences...

• …for hits to a query sequence...

• …and return the alignments and scores from those
  hits...

• Do it fast.

Show me those sequences that deserve a second look.
  Blast programs were designed for fast database
  searching, with minimal sacrifice of sensitivity to
  distant related sequences.
The big red button




                          Do My Job


             It is dangerous to hide too much of the
             underlying complexity from the scientists.
Overview

           • Approach: find segment pairs
             by first finding word pairs that
             score above a
             threshold, i.e., find word pairs of
             fixed length w with a score of at
             least T
           • Key concept “Neigborhood”:
             Seems similar to FASTA, but
             we are searching for words
             which score above T rather than
             that match exactly
           • Calculate neigborhood (T) for
Overview


Compile a list of words which give a score
above T when paired with the query sequence.
– Example using PAM-120 for query sequence ACDE
  (w=4, T=17):

           A    C    D    E
           A    C    D    E = +3 +9 +5 +5 = 22
   • try all possibilities:
           A    A    A    A = +3 -3     0 0 = 0     no good
           A    A    A    C = +3 -3     0 -7 = -7   no good
   • ...too slow, try directed change
Overview
      A      C D E
      A      C D E = +3 +9 +5 +5 = 22
            • change 1st pos. to all acceptable substitutions
      g      C D E = +1 +9 +5 +5 = 20 ok
      n      C D E = +0 +9 +5 +5 = 19 ok
      I      C D E = -1 +9 +5 +5 = 18 ok
      k      C D E = -2 +9 +5 +5 = 17 ok
         • change 2nd pos.: can't - all alternatives negative
           and the other three positions only add up to 13
         • change 3rd pos. in combination with first position
      gCnE = 1 9 2 5 = 17 ok
         • continue - use recursion

• For "best" values of w and T there are typically
  about 50 words in the list for every residue in the
  query sequence
Neighborhood.pl

# Calculate neighborhood
my %NH;
for (my $i = 0; $i < @A; $i++) {
   my $s1 = $S{$W[0]}{$A[$i]};
   for (my $j = 0; $j < @A; $j++) {
      my $s2 = $S{$W[1]}{$A[$j]};
      for (my $k = 0; $k < @A; $k++) {
         my $s3 = $S{$W[2]}{$A[$k]};
         my $score = $s1 + $s2 + $s3;
         my $word = "$A[$i]$A[$j]$A[$k]";
         next if $word =~ /[BZX*]/;
         $NH{$word} = $score if $score >= $T;
      }
   }
}

# Output neighborhood
foreach my $word (sort {$NH{$b} <=> $NH{$a} or $a cmp $b} keys %NH) {
   print "$word $NH{$word}n";
}
BLOSUM62 RGD 11   PAM200 RGD 13

RGD 17            RGD 18
KGD 14            RGE 17
QGD 13            RGN 16
RGE 13            KGD 15
EGD 12            RGQ 15
HGD 12            KGE 14
NGD 12            HGD 13
RGN 12            KGN 13
AGD 11            RAD 13
MGD 11            RGA 13
RAD 11            RGG 13
RGQ 11            RGH 13
RGS 11            RGK 13
RND 11            RGS 13
RSD 11            RGT 13
SGD 11            RSD 13
TGD 11            WGD 13
indexed




         *

                                                                     Trim to max
                                              Score




                                                  S

                                                             Length of extension

*Two non-overlapping HSP’s on a diagonal within distance A
indexed




         *

                                                                     Trim to max
                                              Score




                                                  S

                                                             Length of extension

*Two non-overlapping HSP’s on a diagonal within distance A
The BLAST algorithm


• Break the search sequence into words
     – W = 3 for proteins, W = 12 for DNA
               MCGPFILGTYC
                                    MCG, CGP, GPF, PFI, FIL,
                      CGP            ILG, LGT, GTY, TYC

                      MCG

• Include in the search all words that score
  above a certain value (T) for any search word
                        MCG   CGP
                        MCT   MGP   …
                        MCN   CTP        This list can be
                         …     …         computed in linear
                                         time
The Blast Algorithm (2)


• Search for the words in the database
     – Word locations can be precomputed and indexed
     – Searching for a short string in a long string
• HSP (High Scoring Pair) = A match between
  a query word and the database
• Find a “hit”: Two non-overlapping HSP’s on a
  diagonal within distance A
• Extend the hit until the score falls below a
  threshold value, S
BLAST parameters


• Lowering the neighborhood word threshold (T)
  allows more distantly related sequences to be
  found, at the expense of increased noise in the
  results set.
• Choosing a value for w
    – small w: many matches to expand
    – big w: many words to be generated
    – w=4 is a good compromise
• Lowering the segment extension cutoff (S) returns
  longer extensions for each hit.
• Changing the minimum E-value changes the
  threshold for reporting a hit.
Critical parameters: T,W and scoring matrix

                   • The proper value of T depends ons both the
                     values in the scoring matrix and balance
                     between speed and sensitivity
                   • Higher values of T progressively remove
                     more word hits and reduce the search space.
                   • Word size (W) of 1 will produce more hits
                     than a word size of 10. In general, if T is
                     scaled uniformly with W, smaller word
                     sizes incraese sensitivity and decrease
                     speed.
                   • The interplay between W,T and the scoring
                     matrix is criticial and choosing them wisely
                     is the most effective way of controlling the
                     speed and sensiviy of blast
DataBase Searching

               Dynamic Programming
                 Reloaded
               Database Searching
                 Fasta
                 Blast
                 Statistics
                 Practical Guide
               Extentions
                 PSI-Blast
                 PHI-Blast
                 Local Blast
                 BLAT
Database Searching


• How can we find a particular short sequence
  in a database of sequences (or one HUGE
  sequence)?
• Problem is identical to local sequence
  alignment, but on a much larger scale.
• We must also have some idea of the
  significance of a database hit.
     – Databases always return some kind of hit, how
       much attention should be paid to the result?
• How can we determine how “unusual” a
  particular alignment score is?
Significance

  Sentence 1:
  “These algorithms are trying to find the best way to match up
  two sequences”

  Sentence 2:
  “This does not mean that they will find anything profound”

  ALIGNMENT:

  THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES
  :: :.. . .. ...:    :    ::::..       :: . : ...
  THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------

  12 exact matches
  14 conservative substitutions

                  Is this a good alignment?
Overview


           • A key to the utility of BLAST is
             the ability to calculate expected
             probabilities of occurrence of
             Maximum Segment Pairs
             (MSPs) given w and T
           • This allows BLAST to rank
             matching sequences in order of
             “significance” and to cut off
             listings at a user-specified
             probability
Mathematical Basis of BLAST


 • Model matches as a sequence of coin tosses
 • Let p be the probability of a “head”
     – For a “fair” coin, p = 0.5
 • (Erdös-Rényi) If there are n throws, then the
   expected length R of the longest run of heads is
                      R = log1/p (n).
 • Example: Suppose n = 20 for a “fair” coin
                     R=log2(20)=4.32

 • Trick is how to model DNA (or amino acid)
   sequence alignments as coin tosses.
Mathematical Basis of BLAST


• To model random sequence alignments, replace a
  match with a “head” and mismatch with a “tail”.




                              AATCAT
                                        HTHHHT
                              ATTCAG



• For DNA, the probability of a “head” is 1/4
     – What is it for amino acid sequences?
Mathematical Basis of BLAST


• So, for one particular alignment, the Erdös-Rényi
  property can be applied
• What about for all possible alignments?
     – Consider that sequences are being shifted back and
       forth, dot matrix plot
• The expected length of the longest match is


                              R=log1/p(mn)

        where m and n are the lengths of the two sequences.
Analytical derivation




                        Erdös-Rényi
                              …
                              …
                              …
                        Karlin-Alschul
Karlin-Alschul Statistics




                            E=kmn-λS
     This equation states that the number of alignments
      expected by chance (E) during the sequence
      database search is a function of the size of the
      search space (m*n), the normalized score (λS)
      and a minor constant (k mostly 0.1)

    E-Value grows linearly with the product of target and
    query sizes. Doubling target set size and doubling
    query length have the same effect on e-value
Analytical derivation




                        Erdös-Rényi      R=log1/p(mn)

                              …
                              …
                              …
                        Karlin-Alschul    E=kmn-λS
Scoring alignments




• Score: S (~R)

  – S= M(qi,ti) - gaps


• Any alignment has a score
• Any two sequences have a(t least one)
  optimal alignment
• For a particular scoring matrix and its
  associated gap initiation and extention costs
  one must calculate λ and k
• Unfortunately (for gapped alignments), you
  can’t do this analytically and the values must
  be estimated empirically
  – The procedure involves aligning random
    sequences (Monte Carlo approach) with a specific
    scoring scheme and observing the alignment
    properties (scores, target frequencies and
    lengths)
Significance


“Monte Carlo” Approach:

• Compares result to randomized
  result, similarly to results generated by a
  roulette wheel at Monte Carlo
• Typical procedure for alignments
   – Randomize sequence A
   – Align to sequence B
   – Repeat many times (hundreds)
   – Keep track op optimal score
• Histogram of scores …
Assessing significance requires a distribution


• I have an pumpkin of diameter 1m. Is that unusual?

                   Frequency




                                        Diameter (m)
Significance


                Normal Distribution does NOT Fit Alignment Scores !!

               • In seeking optimal Alignments between two
                 sequences, one desires those that have the highest
                 score - i.e. one is seeking a distribution of maxima
               • In seeking optimal Matches between an Input
                 Sequence and Sequence Entries in a Database, one
                 again desires the matches that have the highest
                 score, and these are obtained via examination of the
                 distribution of such scores for the entries in the
                 database - this is again a distribution of maxima.

                “A Normal Distribution is a distribution of Sums of
                 independent variables rather than a sum of their
                 Maxima.“
Comparing distributions



              Gaussian:                         Extreme Value:




                                  2                         x           x
                              x
                    1                 2
                                          2         1               e
       f x                e                   f x       e       e
                     2
Alignment scores follow extreme value distributions

Alignment of unrelated/random sequences result in scores
following an extreme value distribution

 x

                                            P = 1 –e-E         E
                                            P(x S) = 1-exp(-k m n e- S)
                                            m, n: sequence lengths.
                                            k,   free parameters.
                                            E=-ln(1-P)


This can be shown analytically for ungapped alignments and has
been found empirically to also hold for gapped alignments under
commonly used conditions.
Alignment scores follow extreme value distributions

Alignment algorithms will always produce
alignments, regardless of whether it is meaningful or not
=> important to have way of selecting significant alignments
from large set of database hits.
Solution: fit distribution of scores from database search to
extreme value distribution; determine p-value of hit from this
fitted distribution.

                                      Example: scores fitted to
                                      extreme value distribution.
                                      99.9% of this distribution is
                                      located below score=112
                                      => hit with score = 112 has a
                                      p-value of 0.1%
Significance


                                     BLAST uses precomputed extreme
                                        value distributions to calculate E-
                                        values from alignment scores
                                     For this reason BLAST only allows
                                        certain combinations of substitution
                                        matrices and gap penalties
                                     This also means that the fit is based on
                                        a different data set than the one you
                                        are working on

  A word of caution: BLAST tends to overestimate the significance of its
  matches

  E-values from BLAST are fine for identifying sure hits
  One should be careful using BLAST’s E-values to judge if a marginal hit
  can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
Determining P-values


• If we can estimate and , then we can
  determine, for a given match score x, the
  probability that a random match with score x
  or greater would have occurred in the
  database.
• For sequence matches, a scoring system and
  database can be parameterized by two
  parameters, k and , related to and .
     – It would be nice if we could compare hit
       significance without regard to the scoring system
       used!
Bit Scores


• The expected number of hits with score        S
  is:
      E = Kmn e s
     – Where m and n are the sequence lengths
• Normalize the raw score using:
                 S     ln K
             S
                     ln 2


• Obtains a “bit score” S’, with a standard set of
  units.
• The new E-value is: E mn 2      S
-74
                         -73
                         -72   *
                         -71   *****
                         -70   *******
                         -69   **********
                                                                                              Needleman-wunsch-Monte-Carlo.pl




                         -68   ***************
                         -67   *************************
                         -66   *************************
                         -65   ************************************
                         -64   *****************************************
                         -63   ************************************************************
                         -61   ************************
                         -60   *****************************
                         -59   *******************
                         -58   **************
                         -57   *********



(Average around -64 !)
                         -56   ********
                         -55   *****
                         -54   ****
                         -53   *
                         -52   *
                         -51   *
                         -50
                         -49
FastA Output

               • The distribution of scores graph of
                 frequency of observed scores
               • expected curve (asterisks) according
                 to the extreme value distribution
                      –the theoretic curve should be
                       similar to the observed results
               • deviations indicate that the fitting
                 parameters are wrong
                      –too weak gap penalties
                      –compositional biases
FastA Output

               < 20 222   0 :*
                22 30     0 :*
                24 18     1 :*
                26 18 15 :*
                28 46 159 :*
                30 207 963 :*
                32 1016 3724 := *
                34 4596 10099 :==== *
                36 9835 20741 :=========     *
                38 23408 34278 :====================     *
                40 41534 47814 :=================================== *
                42 53471 58447 :============================================ *
                44 73080 64473 :====================================================*=======
                46 70283 65667 :=====================================================*====
                48 64918 62869 :===================================================*==
                50 65930 57368 :===============================================*=======
                52 47425 50436 :======================================= *
                54 36788 43081 :=============================== *
                56 33156 35986 :============================ *
                58 26422 29544 :====================== *
                60 21578 23932 :================== *
                62 19321 19187 :===============*
                64 15988 15259 :============*=
                66 14293 12060 :=========*==
                68 11679 9486 :=======*==
                70 10135 7434 :======*==
FastA Output


               72 8957 5809 :====*===                                            Related
                74 7728 4529 :===*===
                76 6176 3525 :==*===
                78 5363 2740 :==*==
                80 4434 2128 :=*==
                82 3823 1628 :=*==
                84 3231 1289 :=*=
                86 2474 998 :*==
                88 2197 772 :*=
                90 1716 597 :*=
                92 1430 462 :*=      :===============*========================
                94 1250 358 :*=      :============*===========================
                96 954 277 :*      :=========*=======================
                98 756 214 :*      :=======*===================
               100 678 166 :*       :=====*==================
               102 580 128 :*       :====*===============
               104 476 99 :*       :===*=============
               106 367 77 :*       :==*==========
               108 309 59 :*       :==*========
               110 287 46 :*       :=*========
               112 206 36 :*       :=*======
               114 161 28 :*       :*=====
               116 144 21 :*       :*====
               118 127 16 :*       :*====
               >120 886 13 :*       :*==============================
FastA Output

               • A summary of the statistics and of the
                 program parameters follows the histogram.
                  – An important number in this summary is the
                    Kolmogorov-Smirnov statistic, which indicates
                    how well the actual data fit the theoretical
                    statistical distribution. The lower this value, the
                    better the fit, and the more reliable the statistical
                    estimates.
                  – In general, a Kolmogorov-Smirnov statistic under
                    0.1 indicates a good fit with the theoretical model.
                    If the statistic is higher than 0.2, the statistics may
                    not be valid, and it is recommended to repeat the
                    search, using more stringent (more negative)
                    values for the gap penalty parameters.
Statistics summary

• Optimal local alignment scores for pairs of random
  amino acid sequences of the same length follow and
  extreme-value distribution. For any score S, the
  probability of observing a score >= S is given by the
  Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(-
  lambda.S))
• k en Lambda are parameters related to the position
  of the maximum and the with of the distribution,
• Note the long tail at the right. This means that a
  score serveral standard deviations above the mean
  has higher probability of arising by chance (that is, it
  is less significant) than if the scores followed a
  normal distribution.
P-values


• Many programs report P = the probability that the
  alignment is no better than random. The relationship
  between Z and P depends on the distribution of the
  scores from the control population, which do NOT
  follow the normal distributions
     – P<=10E-100 (exact match)
     – P in range 10E-100 10E-50 (sequences nearly identical eg.
       Alleles or SNPs
     – P in range 10E-50 10E-10 (closely related
       sequenes, homology certain)
     – P in range 10-5 10E-1 (usually distant relatives)
     – P > 10-1 (match probably insignificant)
E

• For database searches, most programs report E-values. The
  E-value of an alignemt is the expected number of sequences
  that give the same Z-score or better if the database is probed
  with a random sequence. E is found by multiplying the value
  of P by the size of the database probed. Note that E but not P
  depends on the size of the database. Values of P are
  between 0 and 1. Values of E are between 0 and the number
  of sequences in the database searched:
    – E<=0.02         sequences probably homologous
    – E between 0.02 and 1   homology cannot be ruled out
    – E>1     you would have to expect this good a match by just chance
DataBase Searching

               Dynamic Programming
                 Reloaded
               Database Searching
                 Fasta
                 Blast
                 Statistics
                 Practical Guide
               Extentions
                 PSI-Blast
                 PHI-Blast Local Blast
                 Blast
Blast

        BLAST is actually a family of programs:
        • BLASTN - Nucleotide query searching a
          nucleotide database.
        • BLASTP - Protein query searching a
          protein database.
        • BLASTX - Translated nucleotide query
          sequence (6 frames) searching a protein
          database.
        • TBLASTN - Protein query searching a
          translated nucleotide (6 frames) database.
        • TBLASTX - Translated nucleotide query (6
          frames) searching a translated nucleotide
          (6 frames) database.
Blast
Blast
Blast
Blast
Blast
Blast
Blast
Tips


       • Be aware of what options you
         have selected when using
         BLAST, or FASTA
         implementations.
       • Treat BLAST searches as
         scientific experiments
       • So you should try your searches
         with the filters on and off to see
         whether it makes any difference
         to the output
Tips: Low-complexity and Gapped Blast Algorithm

                  • The common, Web-based ones often have
                    default settings that will affect the outcome
                    of your searches. By default all NCBI BLAST
                    implementations filter out biased sequence
                    composition from your query sequence (e.g.
                    signal peptide and transmembrane
                    sequences - beware!).
                  • The SEG program has been implemented
                    as part of the blast routine in order to mask
                    low-complexity regions
                  • Low-complexity regions are denoted by
                    strings of Xs in the query sequence
Tips


       • The sequence databases contain a
         wealth of information. They also
         contain a lot of errors. Contaminants
         …
       • Annotation errors, frameshifts that
         may result in erroneous conceptual
         translations.
       • Hypothetical proteins ?

       • In the words of Fox Mulder, "Trust
         no one."
Tips


       • Once you get a match to things
         in the databases, check whether
         the match is to the entire
         protein, or to a domain. Don't
         immediately assume that a
         match means that your protein
         carries out the same function
         (see above). Compare your
         protein and the match protein(s)
         along their entire lengths before
         making this assumption.
Tips


       • Domain matches can also cause problems
         by hiding other informative matches. For
         instance if your protein contains a common
         domain you'll get significant matches to
         every homologous sequence in the
         database. BLAST only reports back a
         limited number of matches, ordered by P
         value.
       • If this list consists only of matches to the
         same domain, cut this bit out of your query
         sequence and do the BLAST search again
         with the edited sequence (e.g. NHR).
Tips

       • Do controls wherever possible. In
         particular when you use a particular
         search software for the first time.
       • Suitable positive controls would be protein
         sequences known to have distant
         homologues in the databases to check
         how good the software is at detecting such
         matches.
       • Negative controls can be employed to
         make sure the compositional bias of the
         sequence isn't giving you false positives.
         Shuffle your query sequence and see what
         difference this makes to the matches that
         are returned. A real match should be lost
         upon shuffling of your sequence.
Tips

       • Perform Controls
          #!/usr/bin/perl -w
          use strict;

          my ($def, @seq) = <>;
          print $def;
          chomp @seq;
          @seq = split(//, join("", @seq));
          my $count = 0;
          while (@seq) {
             my $index = rand(@seq);
             my $base = splice(@seq, $index, 1);
             print $base;
             print "n" if ++$count % 60 == 0;
          }
          print "n" unless $count %60 == 0;
Tips


       • Read the footer first
       • View results graphically
       • Parse Blasts with Bioperl
FastA vs. Blast

                  • BLAST's major advantage is its speed.
                     – 2-3 minutes for BLAST versus several hours
                       for a sensitive FastA search of the whole of
                       GenBank.
                  • When both programs use their default
                    setting, BLAST is usually more sensitive
                    than FastA for detecting protein sequence
                    similarity.
                     – Since it doesn't require a perfect sequence
                       match in the first stage of the search.
FastA vs. Blast

                     Weakness of BLAST:
                     – The long word size it uses in the initial stage of DNA
                       sequence similarity searches was chosen for speed, and not
                       sensitivity.
                     – For a thorough DNA similarity search, FastA is the
                       program of choice, especially when run with a lowered
                       KTup value.
                     – FastA is also better suited to the specialised task of
                       detecting genomic DNA regions using a cDNA query
                       sequence, because it allows the use of a gap extension
                       penalty of 0. BLAST, which only creates ungapped
                       alignments, will usually detect only the longest exon, or fail
                       altogether.
                  • In general, a BLAST search using the default
                    parameters should be the first step in a database
                    similarity search strategy. In many cases, this is all
                    that may be required to yield all the information
                    needed, in a very short time.
DataBase Searching

               Dynamic Programming
                 Reloaded
               Database Searching
                 Fasta
                 Blast
                 Statistics
                 Practical Guide
               Extentions
                 PSI-Blast
                 PHI-Blast Local Blast
                 BLAT
PSI-Blast

            1. Old (ungapped) BLAST

            2. New BLAST (allows gaps)

            3. Profile -> PSI Blast - Position Specific
            Iterated
                 Strategy:Multiple alignment of the hits
                            Calculates a position-specific score matrix
                            Searches with this matrix
                In many cases is much more sensitive to weak but
                   biologically relevant sequence similarities
                PSSM !!!
PSI-Blast
            • Patterns of conservation from the alignment of
              related sequences can aid the recognition of
              distant similarities.
               – These patterns have been variously called
                 motifs, profiles, position-specific score
                 matrices, and Hidden Markov Models.
                 For each position in the derived pattern, every
                 amino acid is assigned a score.
                 (1) Highly conserved residue at a position: that
                 residue is assigned a high positive score, and
                 others are assigned high negative scores.
                 (2) Weakly conserved positions: all residues receive
                 scores near zero.
                 (3) Position-specific scores can also be assigned to
                 potential insertions and deletions.
Pattern

• a set of alternative
  sequences, using
  “regular expressions”
• Prosite
  (http://www.expasy.org/
  prosite/)
PSSM (Position Specific Scoring Matrice)
PSSM (Position Specific Scoring Matrice)
PSSM (Position Specific Scoring Matrice)
PSI-Blast

            • The power of profile methods can be
              further enhanced through iteration of
              the search procedure.
              – After a profile is run against a
                database, new similar sequences can be
                detected. A new multiple alignment, which
                includes these sequences, can be
                constructed, a new profile abstracted, and
                a new database search performed.
              – The procedure can be iterated as often as
                desired or until convergence, when no new
                statistically significant sequences are
                detected.
PSI-Blast
            (1) PSI-BLAST takes as an input a single protein sequence
                and compares it to a protein database, using the gapped
                BLAST program.
            (2) The program constructs a multiple alignment, and then a
                profile, from any significant local alignments found.
               The original query sequence serves as a template for the multiple
               alignment and profile, whose lengths are identical to that of the
               query. Different numbers of sequences can be aligned in different
               template positions.
            (3) The profile is compared to the protein database, again
                seeking local alignments using the BLAST algorithm.

            (4) PSI-BLAST estimates the statistical significance of the local
                alignments found.
               Because profile substitution scores are constructed to a fixed
               scale, and gap scores remain independent of position, the
               statistical theory and parameters for gapped BLAST alignments
               remain applicable to profile alignments.
            (5) Finally, PSI-BLAST iterates, by returning to step (2), a
                specified number of times or until convergence.
PSI-BLAST




                                                   PSSM




                                                     PSSM


From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST pitfalls




     • Avoid too close sequences: overfit!
     • Can include false homologous! Therefore check
       the matches carefully: include or exclude
       sequences based on biological knowledge.
     • The E-value reflects the significance of the
       match to the previous training set not to the
       original sequence!
     • Choose carefully your query sequence.
     • Try reverse experiment to certify.
Reduce overfitting risk by Cobbler

                   • A single sequence is selected
                     from a set of blocks and enriched
                     by replacing the conserved
                     regions delineated by the blocks
                     by consensus residues derived
                     from the blocks.
                   • Embedding consensus residues
                     improves performance
                   • S. Henikoff and J.G. Henikoff;
                     Protein Science (1997) 6:698-
                     705.
DataBase Searching

               Dynamic Programming
                 Reloaded
               Database Searching
                 Fasta
                 Blast
                 Statistics
                 Practical Guide
               Extentions
                 PSI-Blast
                 PHI-Blast
                 Local Blast
                 BLAT
PHI-Blast Local Blast
(Pattern-Hit Initiated BLAST)
From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
   PHI-Blast Local Blast
PHI-Blast Local Blast
PHI-Blast Local Blast
PHI-Blast Local Blast
DataBase Searching

               Dynamic Programming
                 Reloaded
               Database Searching
                 Fasta
                 Blast
                 Statistics
                 Practical Guide
               Extentions
                 PSI-Blast
                 PHI-Blast
                 Local Blast
                 BLAT
Installing Blast Locally


• 2 flavors: NCBI/WuBlast
• Excutables:
     – ftp://ftp.ncbi.nih.gov/blast/executables/
• Database:
     – ftp://ftp.ncbi.nih.gov/blast/db/
• Formatdb
     – formatdb -i ecoli.nt -p F
     – formatdb -i ecoli.protein -p T
• For options: blastall -
     – blastall -p blastp -i query -d database -o output
DataBase Searching

               Dynamic Programming
                 Reloaded
               Database Searching
                 Fasta
                 Blast
                 Statistics
                 Practical Guide
               Extentions
                 PSI-Blast
                 PHI-Blast
                 Local Blast
                 BLAT
Main database: BLAT

               • BLAT: BLAST-Like Alignment Tool
               • Aligns the input sequence to the
                 Human Genome
               • Connected to several databases, like:
                      –   mRNAs           - GenScan
                      –   ESTs            - TwinScan
                      –   RepeatMasker    - UniGene
                      –   RefSeq          - CpG Islands
BLAT Human Genome Browser
BLAT method

              • Align sequence with BLAT, get alignment
                info
              • Per BLAT hit, pick up additional info from
                connected databases:
                 –   mRNAs
                 –   ESTs
                 –   RepeatMasker
                 –   CpG Islands
                 –   RefSeq Genes
Weblems

          W5.1: Submit the amino acid sequence of papaya
           papein to a BLAST (gapped and ungapped) and to a
           PSI-BLAST search. What are the main difference in
           results?
          W5.2: Is there a relationship between Klebsiella
           aerogenes urease, Pseudomonas diminuta
           phosphotriesterase and mouse adenosine deaminase
           ? Also use DALI, ClustalW and T-coffee.
          W5.3: Yeast two-hybrid typically yields DNA
           sequences. How would you find the corresponding
           protein ?
          W5.4: When and why would you use tblastn ?
          W5.5: How would you search a database if you want to
           restrict the search space to those entries having a
           secretion signal consisting of 4 consecutive (N-
           terminal) basic residues ?

Weitere ähnliche Inhalte

Ähnlich wie Bioinformatica t5-database searching

Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
Prof. Wim Van Criekinge
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
Golden Helix Inc
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
Valeriya Simeonova
 

Ähnlich wie Bioinformatica t5-database searching (20)

sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
 
Mayank
MayankMayank
Mayank
 
Sequence comparison techniques
Sequence comparison techniquesSequence comparison techniques
Sequence comparison techniques
 
Insight Data Engineering - Demo
Insight Data Engineering - DemoInsight Data Engineering - Demo
Insight Data Engineering - Demo
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Multiple Sequence Alignment by Shubham Kaushik
Multiple Sequence Alignment by Shubham KaushikMultiple Sequence Alignment by Shubham Kaushik
Multiple Sequence Alignment by Shubham Kaushik
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmmBioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
FastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHMFastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHM
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
BLAST
BLASTBLAST
BLAST
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 

Mehr von Prof. Wim Van Criekinge

Mehr von Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 

Kürzlich hochgeladen

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 

Bioinformatica t5-database searching

  • 1.
  • 2. FBW 23-10-2012 Wim Van Criekinge
  • 4.
  • 5. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 6. Needleman-Wunsch-edu.pl The Score Matrix ---------------- Seq1(j)1 2 3 4 5 6 7 Seq2 * C K H V F C R (i) * 0 -1 -2 -3 -4 -5 -6 -7 1 C -1 1 a 0 -1 -2 -3 -4 -5 2 K -2 0c 2b 1 0 -1 -2 -3 3 K -3 -1 1 1 0 -1 -2 -3 4 C -4 -2 matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH A: 0 0 0 -1 0 -1 5 F -5 -3 -1(substr(seq1,j-1,1) eq substr(seq2,i-1,1) if -1 -1 1 0 -1 6 C -6 -4 up_score = matrix(i-1,j) + GAP 2 B: -2 -2 -2 0 1 7 K -7 -5 -3 -3 -3 -1 1 1 8 C -8 -6 left_score =-4 C: -4 matrix(i,j-1) +-2 -4 GAP 0 0 9 V -9 -7 -5 -5 -3 -3 -1 -1
  • 7. Multiple Alignment Method • The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods. • The principal is that multiple alignments is achieved by successive application of pairwise methods. – First do all pairwise alignments (not just one sequence with all others) – Then combine pairwise alignments to generate overall alignment
  • 8. Database Searching • Consider the task of searching SWISS-PROT against a query sequence: – say our query sequence is 362 amino- acids long – SWISS-PROT release 38 contains 29,085,265 amino acids – finding local alignments via dynamic programming would entail O(1010) matrix operations • Given size of databases, more efficient methods needed
  • 9. Heuristic approaches to DP for database searching FASTA (Pearson 1995) BLAST (Altschul 1990, 1997) Uses heuristics to avoid Uses rapid word lookup calculating the full dynamic methods to completely skip programming matrix most of the database entries Speed up searches by an order of magnitude Extremely fast compared to full Smith- One order of magnitude Waterman faster than FASTA Two orders of magnitude faster than Smith- The statistical side of FASTA is Waterman still stronger than BLAST Almost as sensitive as FASTA
  • 10. FASTA « Hit and extend heuristic» • Problem: Too many calculations “wasted” by comparing regions that have nothing in common • Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical • Basic method: Look for similar regions only near short stretches that match exactly
  • 11. FASTA-Stages 1. Find k-tups in the two sequences (k=1,2 for proteins, 4-6 for DNA sequences) 2. Score and select top 10 scoring “local diagonals” 3. Rescan top 10 regions, score with PAM250 (proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores. 4. Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected score 5. After finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.
  • 12.
  • 13.
  • 14. FastA • Sensitivity: the ability of a program to identify weak but biologically significant sequence similarity. • Selectivity: the ability of a program to discriminate between true matches and matches occurring by chance alone. – A decrease in selectivity results in more false positives being reported.
  • 15. FastA (http://www.ebi.ac.uk/fasta33/) Gap opening penalty Blosum50 -12, -16 by default default. for fasta with Lower PAM proteins and DNA, higher blosum respectively to detect close sequences Gap extension Higher PAM and penalty -2, -4 by lower blosum default for fasta to detect distant with proteins and sequences DNA, respectively The larger the Max number of word-length the scores and less sensitive, but alignments is 100 faster the search will be
  • 16. FastA Output Initn, init1, opt, z- score calculated during run Database E score - code expectation hyperlinked value, how to the SRS many hits are database at expected to be EBI found by chance with such a score while comparing this query to this database. E() does not represent the Accession Description Length % similarity number
  • 17. FastA is a family of programs FastA, TFastA, FastX, FastY Query: DNA Protein Database:DNA Protein
  • 18. FASTA problems FASTA can miss significant similarity since – For proteins, similar sequences do not have to share identical residues • Asp-Lys-Val is quite similar to • Glu-Arg-Ile yet it is missed even with ktuple size of 1 since no amino acid matches • Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is no match with ktuple size of 2
  • 19. FASTA problems FASTA can miss significant similarity since – For nucleic acids, due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not • GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser- Thr-Lys) but they don’t match with ktuple size of 3 or higher
  • 20. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast Blast
  • 21. BLAST - Basic Local Alignment Search Tool
  • 22. What does BLAST do? • Search a large target set of sequences... • …for hits to a query sequence... • …and return the alignments and scores from those hits... • Do it fast. Show me those sequences that deserve a second look. Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant related sequences.
  • 23. The big red button Do My Job It is dangerous to hide too much of the underlying complexity from the scientists.
  • 24. Overview • Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T • Key concept “Neigborhood”: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly • Calculate neigborhood (T) for
  • 25. Overview Compile a list of words which give a score above T when paired with the query sequence. – Example using PAM-120 for query sequence ACDE (w=4, T=17): A C D E A C D E = +3 +9 +5 +5 = 22 • try all possibilities: A A A A = +3 -3 0 0 = 0 no good A A A C = +3 -3 0 -7 = -7 no good • ...too slow, try directed change
  • 26. Overview A C D E A C D E = +3 +9 +5 +5 = 22 • change 1st pos. to all acceptable substitutions g C D E = +1 +9 +5 +5 = 20 ok n C D E = +0 +9 +5 +5 = 19 ok I C D E = -1 +9 +5 +5 = 18 ok k C D E = -2 +9 +5 +5 = 17 ok • change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13 • change 3rd pos. in combination with first position gCnE = 1 9 2 5 = 17 ok • continue - use recursion • For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence
  • 27. Neighborhood.pl # Calculate neighborhood my %NH; for (my $i = 0; $i < @A; $i++) { my $s1 = $S{$W[0]}{$A[$i]}; for (my $j = 0; $j < @A; $j++) { my $s2 = $S{$W[1]}{$A[$j]}; for (my $k = 0; $k < @A; $k++) { my $s3 = $S{$W[2]}{$A[$k]}; my $score = $s1 + $s2 + $s3; my $word = "$A[$i]$A[$j]$A[$k]"; next if $word =~ /[BZX*]/; $NH{$word} = $score if $score >= $T; } } } # Output neighborhood foreach my $word (sort {$NH{$b} <=> $NH{$a} or $a cmp $b} keys %NH) { print "$word $NH{$word}n"; }
  • 28. BLOSUM62 RGD 11 PAM200 RGD 13 RGD 17 RGD 18 KGD 14 RGE 17 QGD 13 RGN 16 RGE 13 KGD 15 EGD 12 RGQ 15 HGD 12 KGE 14 NGD 12 HGD 13 RGN 12 KGN 13 AGD 11 RAD 13 MGD 11 RGA 13 RAD 11 RGG 13 RGQ 11 RGH 13 RGS 11 RGK 13 RND 11 RGS 13 RSD 11 RGT 13 SGD 11 RSD 13 TGD 11 WGD 13
  • 29.
  • 30. indexed * Trim to max Score S Length of extension *Two non-overlapping HSP’s on a diagonal within distance A
  • 31. indexed * Trim to max Score S Length of extension *Two non-overlapping HSP’s on a diagonal within distance A
  • 32. The BLAST algorithm • Break the search sequence into words – W = 3 for proteins, W = 12 for DNA MCGPFILGTYC MCG, CGP, GPF, PFI, FIL, CGP ILG, LGT, GTY, TYC MCG • Include in the search all words that score above a certain value (T) for any search word MCG CGP MCT MGP … MCN CTP This list can be … … computed in linear time
  • 33. The Blast Algorithm (2) • Search for the words in the database – Word locations can be precomputed and indexed – Searching for a short string in a long string • HSP (High Scoring Pair) = A match between a query word and the database • Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A • Extend the hit until the score falls below a threshold value, S
  • 34.
  • 35. BLAST parameters • Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. • Choosing a value for w – small w: many matches to expand – big w: many words to be generated – w=4 is a good compromise • Lowering the segment extension cutoff (S) returns longer extensions for each hit. • Changing the minimum E-value changes the threshold for reporting a hit.
  • 36. Critical parameters: T,W and scoring matrix • The proper value of T depends ons both the values in the scoring matrix and balance between speed and sensitivity • Higher values of T progressively remove more word hits and reduce the search space. • Word size (W) of 1 will produce more hits than a word size of 10. In general, if T is scaled uniformly with W, smaller word sizes incraese sensitivity and decrease speed. • The interplay between W,T and the scoring matrix is criticial and choosing them wisely is the most effective way of controlling the speed and sensiviy of blast
  • 37. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 38. Database Searching • How can we find a particular short sequence in a database of sequences (or one HUGE sequence)? • Problem is identical to local sequence alignment, but on a much larger scale. • We must also have some idea of the significance of a database hit. – Databases always return some kind of hit, how much attention should be paid to the result? • How can we determine how “unusual” a particular alignment score is?
  • 39. Significance Sentence 1: “These algorithms are trying to find the best way to match up two sequences” Sentence 2: “This does not mean that they will find anything profound” ALIGNMENT: THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES :: :.. . .. ...: : ::::.. :: . : ... THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------ 12 exact matches 14 conservative substitutions Is this a good alignment?
  • 40. Overview • A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T • This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability
  • 41. Mathematical Basis of BLAST • Model matches as a sequence of coin tosses • Let p be the probability of a “head” – For a “fair” coin, p = 0.5 • (Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads is R = log1/p (n). • Example: Suppose n = 20 for a “fair” coin R=log2(20)=4.32 • Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.
  • 42. Mathematical Basis of BLAST • To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”. AATCAT HTHHHT ATTCAG • For DNA, the probability of a “head” is 1/4 – What is it for amino acid sequences?
  • 43. Mathematical Basis of BLAST • So, for one particular alignment, the Erdös-Rényi property can be applied • What about for all possible alignments? – Consider that sequences are being shifted back and forth, dot matrix plot • The expected length of the longest match is R=log1/p(mn) where m and n are the lengths of the two sequences.
  • 44. Analytical derivation Erdös-Rényi … … … Karlin-Alschul
  • 45. Karlin-Alschul Statistics E=kmn-λS This equation states that the number of alignments expected by chance (E) during the sequence database search is a function of the size of the search space (m*n), the normalized score (λS) and a minor constant (k mostly 0.1) E-Value grows linearly with the product of target and query sizes. Doubling target set size and doubling query length have the same effect on e-value
  • 46. Analytical derivation Erdös-Rényi R=log1/p(mn) … … … Karlin-Alschul E=kmn-λS
  • 47. Scoring alignments • Score: S (~R) – S= M(qi,ti) - gaps • Any alignment has a score • Any two sequences have a(t least one) optimal alignment
  • 48. • For a particular scoring matrix and its associated gap initiation and extention costs one must calculate λ and k • Unfortunately (for gapped alignments), you can’t do this analytically and the values must be estimated empirically – The procedure involves aligning random sequences (Monte Carlo approach) with a specific scoring scheme and observing the alignment properties (scores, target frequencies and lengths)
  • 49. Significance “Monte Carlo” Approach: • Compares result to randomized result, similarly to results generated by a roulette wheel at Monte Carlo • Typical procedure for alignments – Randomize sequence A – Align to sequence B – Repeat many times (hundreds) – Keep track op optimal score • Histogram of scores …
  • 50. Assessing significance requires a distribution • I have an pumpkin of diameter 1m. Is that unusual? Frequency Diameter (m)
  • 51.
  • 52.
  • 53. Significance Normal Distribution does NOT Fit Alignment Scores !! • In seeking optimal Alignments between two sequences, one desires those that have the highest score - i.e. one is seeking a distribution of maxima • In seeking optimal Matches between an Input Sequence and Sequence Entries in a Database, one again desires the matches that have the highest score, and these are obtained via examination of the distribution of such scores for the entries in the database - this is again a distribution of maxima. “A Normal Distribution is a distribution of Sums of independent variables rather than a sum of their Maxima.“
  • 54. Comparing distributions Gaussian: Extreme Value: 2 x x x 1 2 2 1 e f x e f x e e 2
  • 55. Alignment scores follow extreme value distributions Alignment of unrelated/random sequences result in scores following an extreme value distribution x P = 1 –e-E E P(x S) = 1-exp(-k m n e- S) m, n: sequence lengths. k, free parameters. E=-ln(1-P) This can be shown analytically for ungapped alignments and has been found empirically to also hold for gapped alignments under commonly used conditions.
  • 56. Alignment scores follow extreme value distributions Alignment algorithms will always produce alignments, regardless of whether it is meaningful or not => important to have way of selecting significant alignments from large set of database hits. Solution: fit distribution of scores from database search to extreme value distribution; determine p-value of hit from this fitted distribution. Example: scores fitted to extreme value distribution. 99.9% of this distribution is located below score=112 => hit with score = 112 has a p-value of 0.1%
  • 57. Significance BLAST uses precomputed extreme value distributions to calculate E- values from alignment scores For this reason BLAST only allows certain combinations of substitution matrices and gap penalties This also means that the fit is based on a different data set than the one you are working on A word of caution: BLAST tends to overestimate the significance of its matches E-values from BLAST are fine for identifying sure hits One should be careful using BLAST’s E-values to judge if a marginal hit can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
  • 58. Determining P-values • If we can estimate and , then we can determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database. • For sequence matches, a scoring system and database can be parameterized by two parameters, k and , related to and . – It would be nice if we could compare hit significance without regard to the scoring system used!
  • 59. Bit Scores • The expected number of hits with score S is: E = Kmn e s – Where m and n are the sequence lengths • Normalize the raw score using: S ln K S ln 2 • Obtains a “bit score” S’, with a standard set of units. • The new E-value is: E mn 2 S
  • 60. -74 -73 -72 * -71 ***** -70 ******* -69 ********** Needleman-wunsch-Monte-Carlo.pl -68 *************** -67 ************************* -66 ************************* -65 ************************************ -64 ***************************************** -63 ************************************************************ -61 ************************ -60 ***************************** -59 ******************* -58 ************** -57 ********* (Average around -64 !) -56 ******** -55 ***** -54 **** -53 * -52 * -51 * -50 -49
  • 61. FastA Output • The distribution of scores graph of frequency of observed scores • expected curve (asterisks) according to the extreme value distribution –the theoretic curve should be similar to the observed results • deviations indicate that the fitting parameters are wrong –too weak gap penalties –compositional biases
  • 62. FastA Output < 20 222 0 :* 22 30 0 :* 24 18 1 :* 26 18 15 :* 28 46 159 :* 30 207 963 :* 32 1016 3724 := * 34 4596 10099 :==== * 36 9835 20741 :========= * 38 23408 34278 :==================== * 40 41534 47814 :=================================== * 42 53471 58447 :============================================ * 44 73080 64473 :====================================================*======= 46 70283 65667 :=====================================================*==== 48 64918 62869 :===================================================*== 50 65930 57368 :===============================================*======= 52 47425 50436 :======================================= * 54 36788 43081 :=============================== * 56 33156 35986 :============================ * 58 26422 29544 :====================== * 60 21578 23932 :================== * 62 19321 19187 :===============* 64 15988 15259 :============*= 66 14293 12060 :=========*== 68 11679 9486 :=======*== 70 10135 7434 :======*==
  • 63. FastA Output 72 8957 5809 :====*=== Related 74 7728 4529 :===*=== 76 6176 3525 :==*=== 78 5363 2740 :==*== 80 4434 2128 :=*== 82 3823 1628 :=*== 84 3231 1289 :=*= 86 2474 998 :*== 88 2197 772 :*= 90 1716 597 :*= 92 1430 462 :*= :===============*======================== 94 1250 358 :*= :============*=========================== 96 954 277 :* :=========*======================= 98 756 214 :* :=======*=================== 100 678 166 :* :=====*================== 102 580 128 :* :====*=============== 104 476 99 :* :===*============= 106 367 77 :* :==*========== 108 309 59 :* :==*======== 110 287 46 :* :=*======== 112 206 36 :* :=*====== 114 161 28 :* :*===== 116 144 21 :* :*==== 118 127 16 :* :*==== >120 886 13 :* :*==============================
  • 64. FastA Output • A summary of the statistics and of the program parameters follows the histogram. – An important number in this summary is the Kolmogorov-Smirnov statistic, which indicates how well the actual data fit the theoretical statistical distribution. The lower this value, the better the fit, and the more reliable the statistical estimates. – In general, a Kolmogorov-Smirnov statistic under 0.1 indicates a good fit with the theoretical model. If the statistic is higher than 0.2, the statistics may not be valid, and it is recommended to repeat the search, using more stringent (more negative) values for the gap penalty parameters.
  • 65. Statistics summary • Optimal local alignment scores for pairs of random amino acid sequences of the same length follow and extreme-value distribution. For any score S, the probability of observing a score >= S is given by the Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(- lambda.S)) • k en Lambda are parameters related to the position of the maximum and the with of the distribution, • Note the long tail at the right. This means that a score serveral standard deviations above the mean has higher probability of arising by chance (that is, it is less significant) than if the scores followed a normal distribution.
  • 66. P-values • Many programs report P = the probability that the alignment is no better than random. The relationship between Z and P depends on the distribution of the scores from the control population, which do NOT follow the normal distributions – P<=10E-100 (exact match) – P in range 10E-100 10E-50 (sequences nearly identical eg. Alleles or SNPs – P in range 10E-50 10E-10 (closely related sequenes, homology certain) – P in range 10-5 10E-1 (usually distant relatives) – P > 10-1 (match probably insignificant)
  • 67. E • For database searches, most programs report E-values. The E-value of an alignemt is the expected number of sequences that give the same Z-score or better if the database is probed with a random sequence. E is found by multiplying the value of P by the size of the database probed. Note that E but not P depends on the size of the database. Values of P are between 0 and 1. Values of E are between 0 and the number of sequences in the database searched: – E<=0.02 sequences probably homologous – E between 0.02 and 1 homology cannot be ruled out – E>1 you would have to expect this good a match by just chance
  • 68. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast Blast
  • 69. Blast BLAST is actually a family of programs: • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database.
  • 70. Blast
  • 71. Blast
  • 72. Blast
  • 73. Blast
  • 74. Blast
  • 75. Blast
  • 76. Blast
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85. Tips • Be aware of what options you have selected when using BLAST, or FASTA implementations. • Treat BLAST searches as scientific experiments • So you should try your searches with the filters on and off to see whether it makes any difference to the output
  • 86. Tips: Low-complexity and Gapped Blast Algorithm • The common, Web-based ones often have default settings that will affect the outcome of your searches. By default all NCBI BLAST implementations filter out biased sequence composition from your query sequence (e.g. signal peptide and transmembrane sequences - beware!). • The SEG program has been implemented as part of the blast routine in order to mask low-complexity regions • Low-complexity regions are denoted by strings of Xs in the query sequence
  • 87. Tips • The sequence databases contain a wealth of information. They also contain a lot of errors. Contaminants … • Annotation errors, frameshifts that may result in erroneous conceptual translations. • Hypothetical proteins ? • In the words of Fox Mulder, "Trust no one."
  • 88. Tips • Once you get a match to things in the databases, check whether the match is to the entire protein, or to a domain. Don't immediately assume that a match means that your protein carries out the same function (see above). Compare your protein and the match protein(s) along their entire lengths before making this assumption.
  • 89. Tips • Domain matches can also cause problems by hiding other informative matches. For instance if your protein contains a common domain you'll get significant matches to every homologous sequence in the database. BLAST only reports back a limited number of matches, ordered by P value. • If this list consists only of matches to the same domain, cut this bit out of your query sequence and do the BLAST search again with the edited sequence (e.g. NHR).
  • 90. Tips • Do controls wherever possible. In particular when you use a particular search software for the first time. • Suitable positive controls would be protein sequences known to have distant homologues in the databases to check how good the software is at detecting such matches. • Negative controls can be employed to make sure the compositional bias of the sequence isn't giving you false positives. Shuffle your query sequence and see what difference this makes to the matches that are returned. A real match should be lost upon shuffling of your sequence.
  • 91. Tips • Perform Controls #!/usr/bin/perl -w use strict; my ($def, @seq) = <>; print $def; chomp @seq; @seq = split(//, join("", @seq)); my $count = 0; while (@seq) { my $index = rand(@seq); my $base = splice(@seq, $index, 1); print $base; print "n" if ++$count % 60 == 0; } print "n" unless $count %60 == 0;
  • 92. Tips • Read the footer first • View results graphically • Parse Blasts with Bioperl
  • 93. FastA vs. Blast • BLAST's major advantage is its speed. – 2-3 minutes for BLAST versus several hours for a sensitive FastA search of the whole of GenBank. • When both programs use their default setting, BLAST is usually more sensitive than FastA for detecting protein sequence similarity. – Since it doesn't require a perfect sequence match in the first stage of the search.
  • 94. FastA vs. Blast Weakness of BLAST: – The long word size it uses in the initial stage of DNA sequence similarity searches was chosen for speed, and not sensitivity. – For a thorough DNA similarity search, FastA is the program of choice, especially when run with a lowered KTup value. – FastA is also better suited to the specialised task of detecting genomic DNA regions using a cDNA query sequence, because it allows the use of a gap extension penalty of 0. BLAST, which only creates ungapped alignments, will usually detect only the longest exon, or fail altogether. • In general, a BLAST search using the default parameters should be the first step in a database similarity search strategy. In many cases, this is all that may be required to yield all the information needed, in a very short time.
  • 95. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 96. PSI-Blast 1. Old (ungapped) BLAST 2. New BLAST (allows gaps) 3. Profile -> PSI Blast - Position Specific Iterated Strategy:Multiple alignment of the hits Calculates a position-specific score matrix Searches with this matrix In many cases is much more sensitive to weak but biologically relevant sequence similarities PSSM !!!
  • 97. PSI-Blast • Patterns of conservation from the alignment of related sequences can aid the recognition of distant similarities. – These patterns have been variously called motifs, profiles, position-specific score matrices, and Hidden Markov Models. For each position in the derived pattern, every amino acid is assigned a score. (1) Highly conserved residue at a position: that residue is assigned a high positive score, and others are assigned high negative scores. (2) Weakly conserved positions: all residues receive scores near zero. (3) Position-specific scores can also be assigned to potential insertions and deletions.
  • 98. Pattern • a set of alternative sequences, using “regular expressions” • Prosite (http://www.expasy.org/ prosite/)
  • 99. PSSM (Position Specific Scoring Matrice)
  • 100. PSSM (Position Specific Scoring Matrice)
  • 101. PSSM (Position Specific Scoring Matrice)
  • 102. PSI-Blast • The power of profile methods can be further enhanced through iteration of the search procedure. – After a profile is run against a database, new similar sequences can be detected. A new multiple alignment, which includes these sequences, can be constructed, a new profile abstracted, and a new database search performed. – The procedure can be iterated as often as desired or until convergence, when no new statistically significant sequences are detected.
  • 103. PSI-Blast (1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program. (2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions. (3) The profile is compared to the protein database, again seeking local alignments using the BLAST algorithm. (4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments. (5) Finally, PSI-BLAST iterates, by returning to step (2), a specified number of times or until convergence.
  • 104. PSI-BLAST PSSM PSSM From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
  • 109. PSI-BLAST pitfalls • Avoid too close sequences: overfit! • Can include false homologous! Therefore check the matches carefully: include or exclude sequences based on biological knowledge. • The E-value reflects the significance of the match to the previous training set not to the original sequence! • Choose carefully your query sequence. • Try reverse experiment to certify.
  • 110. Reduce overfitting risk by Cobbler • A single sequence is selected from a set of blocks and enriched by replacing the conserved regions delineated by the blocks by consensus residues derived from the blocks. • Embedding consensus residues improves performance • S. Henikoff and J.G. Henikoff; Protein Science (1997) 6:698- 705.
  • 111. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 117. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 118. Installing Blast Locally • 2 flavors: NCBI/WuBlast • Excutables: – ftp://ftp.ncbi.nih.gov/blast/executables/ • Database: – ftp://ftp.ncbi.nih.gov/blast/db/ • Formatdb – formatdb -i ecoli.nt -p F – formatdb -i ecoli.protein -p T • For options: blastall - – blastall -p blastp -i query -d database -o output
  • 119. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 120. Main database: BLAT • BLAT: BLAST-Like Alignment Tool • Aligns the input sequence to the Human Genome • Connected to several databases, like: – mRNAs - GenScan – ESTs - TwinScan – RepeatMasker - UniGene – RefSeq - CpG Islands
  • 121. BLAT Human Genome Browser
  • 122. BLAT method • Align sequence with BLAT, get alignment info • Per BLAT hit, pick up additional info from connected databases: – mRNAs – ESTs – RepeatMasker – CpG Islands – RefSeq Genes
  • 123.
  • 124. Weblems W5.1: Submit the amino acid sequence of papaya papein to a BLAST (gapped and ungapped) and to a PSI-BLAST search. What are the main difference in results? W5.2: Is there a relationship between Klebsiella aerogenes urease, Pseudomonas diminuta phosphotriesterase and mouse adenosine deaminase ? Also use DALI, ClustalW and T-coffee. W5.3: Yeast two-hybrid typically yields DNA sequences. How would you find the corresponding protein ? W5.4: When and why would you use tblastn ? W5.5: How would you search a database if you want to restrict the search space to those entries having a secretion signal consisting of 4 consecutive (N- terminal) basic residues ?