SlideShare ist ein Scribd-Unternehmen logo
1 von 63
BLAST and FASTA


                  1
Pairwise Alignment

          Global                        Local
• Best score from among       • Best score from among
  alignments of full-length     alignments of partial
  sequences                     sequences
• Needelman-Wunch             • Smith-Waterman
  algorithm                     algorithm




                                                        2
Why do we need local alignments?

 •   To compare a short sequence to a large one.

 •   To compare a single sequence to an entire
     database

 •   To compare a partial sequence to the whole.



                                                   3
Why do we need local alignments?
 • Identify newly determined sequences
 • Compare new genes to known ones
 • Guess functions for entire genomes full of
   ORFs of unknown function




                                                4
Mathematical Basis
for Local Alignment
• Model matches as a sequence of coin
  tosses
• Let p be the probability of “head”
   – For a “fair” coin, p = 0.5
• According to Paul Erdös-Alfréd Rényi
  law:
  If there are n throws, then the expected
  length, R, of the longest run of “heads”
  is
               R = log1/p (n).               Paul ErdĂśs
                                                          5
Mathematical Basis
for Local Alignment

• Example: Suppose n = 20 for a “fair” coin
             R=log2(20)=4.32
• Problem: How does one model DNA (or
  amino acid) alignments as coin tosses.




                                              6
Modeling Sequence Alignments
• To model random sequence alignments, replace a match by
  “head” (H) and mismatch by “tail” (T).

             AATCAT
                              HTHHHT
             ATTCAG

• For ungapped DNA alignments, the probability of a “head”
  is 1/4.

• For ungapped amino acid alignments, the probability of a
  “head” is 1/20.
                                                             7
Modeling Sequence Alignments
• Thus, for any one particular alignment, the Erdös-
  RĂŠnyi law can be applied
• What about for all possible alignments?
   – Consider that sequences can being shifted back and
     forth in the dot matrix plot
• The expected length of the longest match is
                   R = log1/p(mn)
  where m and n are the lengths of the two
  sequences.
                                                          8
Modeling Sequence Alignments
• Suppose m = n = 10, and we deal with DNA
  sequences
            R = log4(100) = 3.32
• This analysis assumes that the base
  composition is uniform and the alignment is
  ungapped. The result is approximate, but
  not bad.

                                            9
10
Heuristic Methods: FASTA and BLAST

FASTA
• First fast sequence searching algorithm for
  comparing a query sequence against a database.

BLAST
• Basic Local Alignment Search Technique
  improvement of FASTA: Search speed, ease of
  use, statistical rigor.
                                              11
FASTA and BLAST
• Basic idea: a good alignment contains
  subsequences of absolute identity (short lengths
  of exact matches):

  – First, identify very short exact matches.
  – Next, the best short hits from the first step are
    extended to longer regions of similarity.
  – Finally, the best hits are optimized.


                                                        12
FASTA
Derived from logic of the dot plot
  – compute best diagonals from all frames of
    alignment
The method looks for exact matches between
 words in query and test sequence
  – DNA words are usually 6 nucleotides long
  – protein words are 2 amino acids long



                                                13
FASTA Algorithm




                  14
Makes Longest Diagonal
After all diagonals are found, tries to join
 diagonals by adding gaps

Computes alignments in regions of best
 diagonals


                                           15
FASTA Alignments




                   16
FASTA Results - Histogram
!!SEQUENCE_LIST 1.0
(Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02
TO: /u/browns02/Victor/Search-set/*.seq Sequences:     2,050 Symbols:
913,285 Word Size: 6
 Searching with both strands of the query.
 Scoring matrix: GenRunData:fastadna.cmp
 Constant pamfactor used
 Gap creation penalty: 16 Gap extension penalty: 4

Histogram Key:
 Each histogram symbol represents 4 search set sequences
 Each inset symbol represents 1 search set sequences
 z-scores computed from opt scores
z-score obs    exp
        (=)    (*)
< 20      0      0:
  22      0      0:
  24      3      0:=
  26      2      0:=
  28      5      0:==
  30     11      3:*==
  32     19     11:==*==
  34     38     30:=======*==
  36     58     61:===============*
  38     79    100:====================    *
  40    134    140:==================================*
  42    167    171:==========================================*
  44    205    189:===============================================*====
  46    209    192:===============================================*=====   17
  48    177    184:=============================================*
FASTA Results - List
The best scores are:                   init1 initn      opt     z-sc E(1018780)..

SW:PPI1_HUMAN    Begin: 1 End: 269
! Q00169 homo sapiens (human). phosph... 1854   1854   1854   2249.3   1.8e-117
SW:PPI1_RABIT    Begin: 1 End: 269
! P48738 oryctolagus cuniculus (rabbi... 1840   1840   1840   2232.4   1.6e-116
SW:PPI1_RAT    Begin: 1 End: 270
! P16446 rattus norvegicus (rat). pho... 1543   1543   1837   2228.7   2.5e-116
SW:PPI1_MOUSE    Begin: 1 End: 270
! P53810 mus musculus (mouse). phosph... 1542   1542   1836   2227.5   2.9e-116
SW:PPI2_HUMAN    Begin: 1 End: 270
! P48739 homo sapiens (human). phosph... 1533   1533   1533   1861.0   7.7e-96
SPTREMBL_NEW:BAC25830    Begin: 1 End: 270
! Bac25830 mus musculus (mouse). 10, ... 1488   1488   1522   1847.6   4.2e-95
SP_TREMBL:Q8N5W1    Begin: 1 End: 268
! Q8n5w1 homo sapiens (human). simila... 1477   1477   1522   1847.6   4.3e-95
SW:PPI2_RAT    Begin: 1 End: 269
! P53812 rattus norvegicus (rat). pho... 1482   1482   1516   1840.4   1.1e-94




                                                                                    18
FASTA Results - Alignment
SCORES   Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58
>>GB_IN3:DMU09374                                         (2038 nt)
 initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58
  66.2% identity in 875 nt overlap
 (83-957:151-1022)

                   60        70        80         90      100       110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
                                            || ||| | ||||| |    ||| |||||
DMU09374     AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
                    130       140       150        160      170       180

                  120       130       140       150       160       170
u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA
             |||||||||   || |||    |   | || ||| |         || || ||||| ||
DMU09374     GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC
                    190       200       210       220       230       240

                  180       190       200       210       220       230
u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC
               ||| | ||||| ||    |||   ||||    | || | |||||||| || ||| ||
DMU09374     AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC
                    250       260       270       280       290       300

                  240       250       260       270       280       290
u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC
             ||||||||||     ||||| |     |||||| |||| |||   || ||| || |
DMU09374     AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT   19
                    310       320       330       340       350       360
FASTA on the Web

• Many websites offer
  FASTA searches
• Each server has its limits
• Be aware that you
  depend “on the kindness
  of strangers.”

                               20
Institut de GĂŠnĂŠtique Humaine, Montpellier France, GeneStream server
         http://www2.igh.cnrs.fr/bin/fasta-guess.cgi
Oak Ridge National Laboratory GenQuest server
         http://avalon.epm.ornl.gov/
European Bioinformatics Institute, Cambridge, UK
         http://www.ebi.ac.uk/htbin/fasta.py?request
EMBL, Heidelberg, Germany
         http://www.embl-heidelberg.de/cgi/fasta-wrapper-free
Munich Information Center for Protein Sequences (MIPS)
at Max-Planck-Institut, Germany
         http://speedy.mips.biochem.mpg.de/mips/programs/fasta.html
Institute of Biology and Chemistry of Proteins Lyon, France
         http://www.ibcp.fr/serv_main.html
Institute Pasteur, France
         http://central.pasteur.fr/seqanal/interfaces/fasta.html
GenQuest at The Johns Hopkins University
         http://www.bis.med.jhmi.edu/Dan/gq/gq.form.html
National Cancer Center of Japan
         http://bioinfo.ncc.go.jp

                                                                       21
FASTA Format
• simple format used by almost all programs
• >header line with a [return] at end
• Sequence (no specific requirements for line
  length, characters, etc)
>URO1 uro1.seq   Length: 2018   November 9, 2000 11:50   Type: N   Check: 3854   ..
CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA
ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT
GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC
CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG
TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA
GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT
CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA
TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG
GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC
CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC
CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT                          22
Assessing Alignment Significance
• Generate random alignments and
calculate their scores
• Compute the mean and the standard
deviation (SD) for random scores
• Compute the deviation of the actual score
from the mean of random scores
               Z = (meanX)/SD
• Evaluate the significance of the alignment
• The probability of a Z value is called the E
score
                                            23
E scores or E values
E scores are not equivalent to p
values where
             p < 0.05
are generally considered
statistically significant.
                               24
E values (rules of thumb)
E values below 10-6 are most probably
statistically significant.
E values above 10-6 but below 10-3
deserve a second look.
E values above 10-3 should not be
tossed aside lightly; they should be
thrown out with great force.           25
BLAST
• Basic Local Alignment Search Tool
  – Altschul et al. 1990,1994,1997
• Heuristic method for local alignment
• Designed specifically for database searches
• Based on the same assumption as FASTA
  that good alignments contain short lengths
  of exact matches
                                            26
BLAST
• Both BLAST and FASTA search for local
  sequence similarity - indeed they have exactly
  the same goals, though they use somewhat
  different algorithms and statistical approaches.

• BLAST benefits
  – Speed
  – User friendly
  – Statistical rigor
  – More sensitive
                                                27
Input/Output
• Input:
  – Query sequence Q
  – Database of sequences DB
  – Minimal score S

• Output:
  – Sequences from DB (Seq), such that Q and Seq
    have scores > S

                                               28
BLAST Searches GenBank
[BLAST= Basic Local Alignment Search Tool]
The NCBI BLAST web server lets you compare your
  query sequence to various sections of GenBank:
        –   nr = non-redundant (main sections)
        –   month = new sequences from the past few weeks
        –   refseq_rna
        –   RNA entries from NCBI's Reference Sequence project
        –   refseq_genomic
        –   Genomic entries from NCBI's Reference Sequence project
        –   ESTs
        –   Taxon = e.g., human, Drososphila, yeast, E. coli
        –   proteins (by automatic translation)
        –   pdb = Sequences derived from the 3-dimensional structure
            from Brookhaven Protein Data Bank
                                                                  29
BLAST
• Uses word matching like FASTA
• Similarity matching of words (3 amino acids, 11
  bases)
  – does not require identical words.
• If no words are similar, then no alignment
  – Will not find matches for very short sequences

• Does not handle gaps well
• “gapped BLAST” is somewhat better
                                                     30
BLAST Algorithm




                  31
BLAST Word Matching
MEAAVKEEISVEDEAVDKNI
MEA
 EAA
  AAV        Break query
    AVK
     VKE     into words:
      KEE
       EEI
         EIS
          ISV
          ...         Break database
                        sequences
                        into words:


                                       32
Find locations of matching words
       in database sequences

      ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELT MEAT
MEA
EAA     TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY
AAV        IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRV KLVAIVDPH
AVK
KLV
KEE
EEI
EIS
ISV




                                                         33
Extend hits one base at a time




                                 34
Seq_XYZ:      HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA
Query:           QSVFDYIYYGCYCGWGLG_GK__PRDA

E-val=10-13




  •Use two word matches as anchors to build an alignment
  between the query and a database sequence.

  •Then score the alignment.
                                                     35
HSPs are Aligned Regions
• The results of the word matching and
  attempts to extend the alignment are
  segments
   - called HSPs (High-Scoring Segment
     Pairs)
• BLAST often produces several short HSPs
  rather than a single aligned region

                                            36
•   >gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.
•             Length = 369
•    Score =    272 bits (137),   Expect = 4e-71
•    Identities = 258/297 (86%), Gaps = 1/297 (0%)
•    Strand = Plus / Plus
•
•   Query: 17    aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76
•                |||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||
•   Sbjct: 1     aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59
•
•   Query: 77    agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136
•                |||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||
•   Sbjct: 60    agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119
•
•   Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196
•               |||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||
•   Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179
•
•   Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256
•              ||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||
•   Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239
•
•   Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313
•              || || ||||| || ||||||||||| | |||||||||||||||||| ||||||||
•   Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296




                                                                                    37
BLAST variants




                 38
39
40
41
42
43
Understanding BLAST output




                         44
45
46
47
48
49
50
51
52
53
Choosing the right parameters




                            54
55
56
57
Controlling the output




                         58
59
60
61
62
More on BLAST

NCBI Blast Glossary
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html

Education: Blast Information
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

Steve Altschul's Blast Course
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html




                                                             63

Weitere ähnliche Inhalte

Was ist angesagt?

Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure predictionSiva Dharshini R
 
Protein database
Protein databaseProtein database
Protein databaseKhalid Hakeem
 
Biological databases
Biological databasesBiological databases
Biological databasesSarfaraz Nasri
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formatsAlphonsa Joseph
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data miningSangeeta Das
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijayVijay Hemmadi
 
Genome annotation
Genome annotationGenome annotation
Genome annotationShifa Ansari
 
Identification of disease genes
Identification of disease genesIdentification of disease genes
Identification of disease genesPrasanthperceptron
 
Uses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsUses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsPragya Pai
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis Nitin Naik
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure predictionkaramveer prajapat
 
Structural databases
Structural databases Structural databases
Structural databases Priyadharshana
 
Pathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformaticsPathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformaticssarwat bashir
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionRoshan Karunarathna
 
Protein micro array
Protein micro arrayProtein micro array
Protein micro arraykrupa sagar
 

Was ist angesagt? (20)

Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
Protein database
Protein databaseProtein database
Protein database
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Identification of disease genes
Identification of disease genesIdentification of disease genes
Identification of disease genes
 
Uses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsUses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in Bioinformatics
 
NCBI
NCBINCBI
NCBI
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Structural databases
Structural databases Structural databases
Structural databases
 
Pathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformaticsPathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformatics
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure prediction
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Protein micro array
Protein micro arrayProtein micro array
Protein micro array
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 

Ähnlich wie Blast fasta 4

Similarity
SimilaritySimilarity
Similarityhiratufail
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Prof. Wim Van Criekinge
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Prof. Wim Van Criekinge
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekingeProf. Wim Van Criekinge
 
MSc Thesis Presentation
MSc Thesis PresentationMSc Thesis Presentation
MSc Thesis PresentationReem Sherif
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekingeProf. Wim Van Criekinge
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
 
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...Hao Jin
 
Presentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop schedulingPresentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop schedulingAntonio Maria Fiscarelli
 
Lecture6.pptx
Lecture6.pptxLecture6.pptx
Lecture6.pptxgregcaporaso
 
Jogging While Driving, and Other Software Engineering Research Problems (invi...
Jogging While Driving, and Other Software Engineering Research Problems (invi...Jogging While Driving, and Other Software Engineering Research Problems (invi...
Jogging While Driving, and Other Software Engineering Research Problems (invi...David Rosenblum
 
Representations for large-scale (Big) Sequence Data Mining
Representations for large-scale (Big) Sequence Data MiningRepresentations for large-scale (Big) Sequence Data Mining
Representations for large-scale (Big) Sequence Data MiningVijay Raghavan
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Dataaimsnist
 

Ähnlich wie Blast fasta 4 (20)

Similarity
SimilaritySimilarity
Similarity
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
Arom fold
Arom foldArom fold
Arom fold
 
Phylogenetics1
Phylogenetics1Phylogenetics1
Phylogenetics1
 
BLAST
BLASTBLAST
BLAST
 
MSc Thesis Presentation
MSc Thesis PresentationMSc Thesis Presentation
MSc Thesis Presentation
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
Tree building 2
Tree building 2Tree building 2
Tree building 2
 
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
 
Presentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop schedulingPresentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop scheduling
 
Lecture6.pptx
Lecture6.pptxLecture6.pptx
Lecture6.pptx
 
Jogging While Driving, and Other Software Engineering Research Problems (invi...
Jogging While Driving, and Other Software Engineering Research Problems (invi...Jogging While Driving, and Other Software Engineering Research Problems (invi...
Jogging While Driving, and Other Software Engineering Research Problems (invi...
 
_BLAST.ppt
_BLAST.ppt_BLAST.ppt
_BLAST.ppt
 
Representations for large-scale (Big) Sequence Data Mining
Representations for large-scale (Big) Sequence Data MiningRepresentations for large-scale (Big) Sequence Data Mining
Representations for large-scale (Big) Sequence Data Mining
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
 

KĂźrzlich hochgeladen

Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdfssuserdda66b
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 

KĂźrzlich hochgeladen (20)

Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

Blast fasta 4

  • 2. Pairwise Alignment Global Local • Best score from among • Best score from among alignments of full-length alignments of partial sequences sequences • Needelman-Wunch • Smith-Waterman algorithm algorithm 2
  • 3. Why do we need local alignments? • To compare a short sequence to a large one. • To compare a single sequence to an entire database • To compare a partial sequence to the whole. 3
  • 4. Why do we need local alignments? • Identify newly determined sequences • Compare new genes to known ones • Guess functions for entire genomes full of ORFs of unknown function 4
  • 5. Mathematical Basis for Local Alignment • Model matches as a sequence of coin tosses • Let p be the probability of “head” – For a “fair” coin, p = 0.5 • According to Paul ErdĂśs-AlfrĂŠd RĂŠnyi law: If there are n throws, then the expected length, R, of the longest run of “heads” is R = log1/p (n). Paul ErdĂśs 5
  • 6. Mathematical Basis for Local Alignment • Example: Suppose n = 20 for a “fair” coin R=log2(20)=4.32 • Problem: How does one model DNA (or amino acid) alignments as coin tosses. 6
  • 7. Modeling Sequence Alignments • To model random sequence alignments, replace a match by “head” (H) and mismatch by “tail” (T). AATCAT HTHHHT ATTCAG • For ungapped DNA alignments, the probability of a “head” is 1/4. • For ungapped amino acid alignments, the probability of a “head” is 1/20. 7
  • 8. Modeling Sequence Alignments • Thus, for any one particular alignment, the ErdĂśs- RĂŠnyi law can be applied • What about for all possible alignments? – Consider that sequences can being shifted back and forth in the dot matrix plot • The expected length of the longest match is R = log1/p(mn) where m and n are the lengths of the two sequences. 8
  • 9. Modeling Sequence Alignments • Suppose m = n = 10, and we deal with DNA sequences R = log4(100) = 3.32 • This analysis assumes that the base composition is uniform and the alignment is ungapped. The result is approximate, but not bad. 9
  • 10. 10
  • 11. Heuristic Methods: FASTA and BLAST FASTA • First fast sequence searching algorithm for comparing a query sequence against a database. BLAST • Basic Local Alignment Search Technique improvement of FASTA: Search speed, ease of use, statistical rigor. 11
  • 12. FASTA and BLAST • Basic idea: a good alignment contains subsequences of absolute identity (short lengths of exact matches): – First, identify very short exact matches. – Next, the best short hits from the first step are extended to longer regions of similarity. – Finally, the best hits are optimized. 12
  • 13. FASTA Derived from logic of the dot plot – compute best diagonals from all frames of alignment The method looks for exact matches between words in query and test sequence – DNA words are usually 6 nucleotides long – protein words are 2 amino acids long 13
  • 15. Makes Longest Diagonal After all diagonals are found, tries to join diagonals by adding gaps Computes alignments in regions of best diagonals 15
  • 17. FASTA Results - Histogram !!SEQUENCE_LIST 1.0 (Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02 TO: /u/browns02/Victor/Search-set/*.seq Sequences: 2,050 Symbols: 913,285 Word Size: 6 Searching with both strands of the query. Scoring matrix: GenRunData:fastadna.cmp Constant pamfactor used Gap creation penalty: 16 Gap extension penalty: 4 Histogram Key: Each histogram symbol represents 4 search set sequences Each inset symbol represents 1 search set sequences z-scores computed from opt scores z-score obs exp (=) (*) < 20 0 0: 22 0 0: 24 3 0:= 26 2 0:= 28 5 0:== 30 11 3:*== 32 19 11:==*== 34 38 30:=======*== 36 58 61:===============* 38 79 100:==================== * 40 134 140:==================================* 42 167 171:==========================================* 44 205 189:===============================================*==== 46 209 192:===============================================*===== 17 48 177 184:=============================================*
  • 18. FASTA Results - List The best scores are: init1 initn opt z-sc E(1018780).. SW:PPI1_HUMAN Begin: 1 End: 269 ! Q00169 homo sapiens (human). phosph... 1854 1854 1854 2249.3 1.8e-117 SW:PPI1_RABIT Begin: 1 End: 269 ! P48738 oryctolagus cuniculus (rabbi... 1840 1840 1840 2232.4 1.6e-116 SW:PPI1_RAT Begin: 1 End: 270 ! P16446 rattus norvegicus (rat). pho... 1543 1543 1837 2228.7 2.5e-116 SW:PPI1_MOUSE Begin: 1 End: 270 ! P53810 mus musculus (mouse). phosph... 1542 1542 1836 2227.5 2.9e-116 SW:PPI2_HUMAN Begin: 1 End: 270 ! P48739 homo sapiens (human). phosph... 1533 1533 1533 1861.0 7.7e-96 SPTREMBL_NEW:BAC25830 Begin: 1 End: 270 ! Bac25830 mus musculus (mouse). 10, ... 1488 1488 1522 1847.6 4.2e-95 SP_TREMBL:Q8N5W1 Begin: 1 End: 268 ! Q8n5w1 homo sapiens (human). simila... 1477 1477 1522 1847.6 4.3e-95 SW:PPI2_RAT Begin: 1 End: 269 ! P53812 rattus norvegicus (rat). pho... 1482 1482 1516 1840.4 1.1e-94 18
  • 19. FASTA Results - Alignment SCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58 >>GB_IN3:DMU09374 (2038 nt) initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58 66.2% identity in 875 nt overlap (83-957:151-1022) 60 70 80 90 100 110 u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC || ||| | ||||| | ||| ||||| DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC 130 140 150 160 170 180 120 130 140 150 160 170 u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA ||||||||| || ||| | | || ||| | || || ||||| || DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC 190 200 210 220 230 240 180 190 200 210 220 230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC ||| | ||||| || ||| |||| | || | |||||||| || ||| || DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC 250 260 270 280 290 300 240 250 260 270 280 290 u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC |||||||||| ||||| | |||||| |||| ||| || ||| || | DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT 19 310 320 330 340 350 360
  • 20. FASTA on the Web • Many websites offer FASTA searches • Each server has its limits • Be aware that you depend “on the kindness of strangers.” 20
  • 21. Institut de GĂŠnĂŠtique Humaine, Montpellier France, GeneStream server http://www2.igh.cnrs.fr/bin/fasta-guess.cgi Oak Ridge National Laboratory GenQuest server http://avalon.epm.ornl.gov/ European Bioinformatics Institute, Cambridge, UK http://www.ebi.ac.uk/htbin/fasta.py?request EMBL, Heidelberg, Germany http://www.embl-heidelberg.de/cgi/fasta-wrapper-free Munich Information Center for Protein Sequences (MIPS) at Max-Planck-Institut, Germany http://speedy.mips.biochem.mpg.de/mips/programs/fasta.html Institute of Biology and Chemistry of Proteins Lyon, France http://www.ibcp.fr/serv_main.html Institute Pasteur, France http://central.pasteur.fr/seqanal/interfaces/fasta.html GenQuest at The Johns Hopkins University http://www.bis.med.jhmi.edu/Dan/gq/gq.form.html National Cancer Center of Japan http://bioinfo.ncc.go.jp 21
  • 22. FASTA Format • simple format used by almost all programs • >header line with a [return] at end • Sequence (no specific requirements for line length, characters, etc) >URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 .. CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT 22
  • 23. Assessing Alignment Significance • Generate random alignments and calculate their scores • Compute the mean and the standard deviation (SD) for random scores • Compute the deviation of the actual score from the mean of random scores Z = (meanX)/SD • Evaluate the significance of the alignment • The probability of a Z value is called the E score 23
  • 24. E scores or E values E scores are not equivalent to p values where p < 0.05 are generally considered statistically significant. 24
  • 25. E values (rules of thumb) E values below 10-6 are most probably statistically significant. E values above 10-6 but below 10-3 deserve a second look. E values above 10-3 should not be tossed aside lightly; they should be thrown out with great force. 25
  • 26. BLAST • Basic Local Alignment Search Tool – Altschul et al. 1990,1994,1997 • Heuristic method for local alignment • Designed specifically for database searches • Based on the same assumption as FASTA that good alignments contain short lengths of exact matches 26
  • 27. BLAST • Both BLAST and FASTA search for local sequence similarity - indeed they have exactly the same goals, though they use somewhat different algorithms and statistical approaches. • BLAST benefits – Speed – User friendly – Statistical rigor – More sensitive 27
  • 28. Input/Output • Input: – Query sequence Q – Database of sequences DB – Minimal score S • Output: – Sequences from DB (Seq), such that Q and Seq have scores > S 28
  • 29. BLAST Searches GenBank [BLAST= Basic Local Alignment Search Tool] The NCBI BLAST web server lets you compare your query sequence to various sections of GenBank: – nr = non-redundant (main sections) – month = new sequences from the past few weeks – refseq_rna – RNA entries from NCBI's Reference Sequence project – refseq_genomic – Genomic entries from NCBI's Reference Sequence project – ESTs – Taxon = e.g., human, Drososphila, yeast, E. coli – proteins (by automatic translation) – pdb = Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank 29
  • 30. BLAST • Uses word matching like FASTA • Similarity matching of words (3 amino acids, 11 bases) – does not require identical words. • If no words are similar, then no alignment – Will not find matches for very short sequences • Does not handle gaps well • “gapped BLAST” is somewhat better 30
  • 32. BLAST Word Matching MEAAVKEEISVEDEAVDKNI MEA EAA AAV Break query AVK VKE into words: KEE EEI EIS ISV ... Break database sequences into words: 32
  • 33. Find locations of matching words in database sequences ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELT MEAT MEA EAA TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY AAV IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRV KLVAIVDPH AVK KLV KEE EEI EIS ISV 33
  • 34. Extend hits one base at a time 34
  • 35. Seq_XYZ: HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA Query: QSVFDYIYYGCYCGWGLG_GK__PRDA E-val=10-13 •Use two word matches as anchors to build an alignment between the query and a database sequence. •Then score the alignment. 35
  • 36. HSPs are Aligned Regions • The results of the word matching and attempts to extend the alignment are segments - called HSPs (High-Scoring Segment Pairs) • BLAST often produces several short HSPs rather than a single aligned region 36
  • 37. • >gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'. • Length = 369 • Score = 272 bits (137), Expect = 4e-71 • Identities = 258/297 (86%), Gaps = 1/297 (0%) • Strand = Plus / Plus • • Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76 • |||||||||||||||| | ||| | ||| || ||| | |||| ||||| ||||||||| • Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59 • • Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136 • |||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| || • Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119 • • Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196 • |||||||| | || | ||||||||||||||| ||||||||||| || |||||||||||| • Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179 • • Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256 • ||||||||| | |||||||| |||||||||||||||||| |||||||||||||||||||| • Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239 • • Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313 • || || ||||| || ||||||||||| | |||||||||||||||||| |||||||| • Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296 37
  • 39. 39
  • 40. 40
  • 41. 41
  • 42. 42
  • 43. 43
  • 45. 45
  • 46. 46
  • 47. 47
  • 48. 48
  • 49. 49
  • 50. 50
  • 51. 51
  • 52. 52
  • 53. 53
  • 54. Choosing the right parameters 54
  • 55. 55
  • 56. 56
  • 57. 57
  • 59. 59
  • 60. 60
  • 61. 61
  • 62. 62
  • 63. More on BLAST NCBI Blast Glossary http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html Education: Blast Information http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html Steve Altschul's Blast Course http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html 63

Hinweis der Redaktion

  1. 27
  2. 29