SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Fundamentals in Sequence
     Analysis 1.(part 1)
Review of Basic biology + database searching in
Biology.



              Hugues Sicotte
                 NCBI
The Flow of Biotechnology
         Information
 Gene                                      Function




> DNA sequence
AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC
TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA
TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA
ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG
TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA   > Protein sequence
TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG   MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNI
GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA   DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK
CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC   KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE
TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA   PDEAEQDCIEFGKKIANI
ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG
TAAGAAGATCGCGAACATCTAGTAGA
Prequisites to Sequence Analysis
• Basic Biology so you can understand the language
  of the databases: Central Dogma (transcription;
  Translation, Prokaryotes, Eukaryotes,CDS, 3
  ´UTR, 5´UTR, introns, exons, promoters, operons,
  codons, start codons, stop
  codons,snRNA,hnRNA,tRNA, secondary
  structure, tertiary structure).
• Before you can analyze sequences.. You have to
  understand their structure.. And know about Basic
  Biological Database Searching
Central Dogmas of Molecular Biology
1) The concept of genes is historically defined on the basic of genetic
inheritance of a phenotype. (Mendellian Inheritance)
2) The DNA an organism encodes the genetic information. It is made up of
a double stranded helix composed of ribose sugars.
Adenine(A), Citosine (C), Guanine (G) and Thymine (T).
[note that only 4 values nees be encode ACGT.. Which can be done using 2
bits.. But to allow redundant letter combinations (like N means any 4
nucleotides), one usually resorts to a 4 bit alphabet.]
Central Dogmas of Molecular Biology
3) Each side of the double helix faces it´s complementary base.
A T, and G  C.
4) Biochemical process that read off the DNA always read it from the 5
´´side towards the 3´ side. (replication and transcription).
5) A gene can be located on either the ´plus strand´ or the minus strand.
But rule 4) imposes the orientation of reading .. And rule 3
(complementarity) tells us to complement each base E.g.
If the sequence on the + strand is ACGTGATCGATGCTA, the – strand
must be read off by reading the complement of this sequence going
´backwards´
e.g. TAGCATCGATCACGT
Central Dogmas of Molecular Biology
 6) DNA information is copied over to mRNA that acts as a template to
 produce proteins.




We often concentrate on protein coding genes, because proteins are
the building blocks of cells and the majority of bio-active molecules.
(but let´s not forget the various RNA genes)
Prokaryotic genes

         Prokaryotes (intronless protein coding genes)
Upstream (5’)                            Gene region
           promoter                                                      Downstream (3’)
                              TAC
                                                                                  DNA
                          Transcription (gene is encoded on minus strand ..
                          And the reverse complement is read into mRNA)
                             ATG
                                                                                mRNA
                      5´ UTR CoDing Sequence (CDS)                 3´ UTR
                             ATG

                            Translation: tRNA read off each codons, 3
                            bases at a time, starting at start codon until it
                            reaches a STOP codon.
                                                                     protein
Why does Nature bothers with the mRNA?
Why would the cell want to have an intermediate between DNA and
the proteins it encodes?
     •Gene information can be amplified by having many copies of an
     RNA made from one copy of DNA.
     •Regulation of gene expression can be effected by having specific
     controls at each element of the pathway between DNA and
     proteins. The more elements there are in the pathway, the more
     opportunities there are to control it in different circumstances.
     •In Eukaryotes, the DNA can then stay pristine and protected,
     away from the caustic chemistry of the cytoplasm.
Prokaryotic genes (operons)

      Prokaryotes (operon structure)
upstream promoter                                              downstream




                    Gene 1        Gene 2           Gene 3

         In prokaryotes, sometimes genes that are part of the same
         operational pathway are grouped together under a single
         promoter. They then produce a pre-mRNA which
         eventually produces 3 separates mRNA´s.
Bacterial Gene Structure of signals




    Bacterial genomes have simple gene structure.
    - Transcription factor binding site.
    - Promoters
               -35 sequence (T82T84G78A65C54A45) 15-20 bases
               -10 sequence (T80A95T45A60A50T96) 5-9 bases
    - Start of transcription : initiation start: Purine90 (sometimes it’s the
    “A” in CAT)
    - translation binding site (shine-dalgarno) 10 bp upstream of AUG
    (AGGAGG)
    - One or more Open Reading Frame
         •start-codon (unless sequence is partial)
         •until next in-frame stop codon on that strand ..
         Separated by intercistronic sequences.
    - Termination
Genetic Code

How does an mRNA specify amino acid sequence? The answer lies in
the genetic code. It would be impossible for each amino acid to be
specified by one nucleotide, because there are only 4 nucleotides and 20
amino acids. Similarly, two nucleotide combinations could only specify
16 amino acids. The final conclusion is that each amino acid is specified
by a particular combination of three nucleotides, called a codon:

Each 3 nucleotide code for one amino acid.
•The first codon is the start codon, and usually coincides with the Amino
Acid Methionine. (M which has codon code ‘ATG’)
•The last codon is the stop codon and does NOT code for an amino acid.
It is sometimes represented by ‘*’ to indicate the ‘STOP’ codon.

•A coding region (abbreviation CDS) starts at the START codon and
ends at the STOP codon.
Codon table
              Note the degeneracy of the
              genetic code. Each amino acid
              might have up to six codons that
              specify it.
              • Different organisms have
              different frequencies of codon
              usage.
              •A handful of species vary from
              the codon association described
              above, and use different codons fo
              different amino acids.

              How do tRNAs recognize to which
              codon to bring an amino acid? The
              tRNA has an anticodon on its
              mRNA-binding end that is
              complementary to the codon on the
              mRNA. Each tRNA only binds the
              appropriate amino acid for its
              anticodon.
RNA




RNA has the same primary structure as DNA. It consists of a sugar-phosphate
  backbone, with nucleotides attached to the 1' carbon of the sugar. The differences
  between DNA and RNA are that:
  1. RNA has a hydroxyl group on the 2' carbon of the sugar (thus, the difference
      between deoxyribonucleic acid and ribonucleic acid.
  2. Instead of using the nucleotide thymine, RNA uses another nucleotide called
      uracil:
  3. Because of the extra hydroxyl group on the sugar, RNA is too bulky to form
      a stable double helix. RNA exists as a single-stranded molecule. However,
      regions of double helix can form where there is some base pair
      complementation (U and A , G and C), resulting in hairpin loops. The RNA
      molecule with its hairpin loops is said to have a secondary structure.
  4. Because the RNA molecule is not restricted to a rigid double helix, it can
      form many different stable three-dimensional tertiary structures.
tRNA ( transfer RNA)
 is a small RNA that has a very specific secondary and tertiary structure such that it can
 bind an amino acid at one end, and mRNA at the other end. It acts as an adaptor to carry
 the amino acid elements of a protein to the appropriate place as coded for by the mRNA. T




                                                                      Three-
                                                                      dimensional
                                                                      Tertiary
Secondary structure of tRNA                                           structure
Bacterial Gene Prediction

Most of the consensus sequences are known from ecoli
studies. So for each bacteria the exact distribution of
consensus will change.
Most modern gene prediction programs need to be
“trained”. E.g. they find their own consensus and assembly
rules given a few examples genes.
A few programs find their own rules from a completely
unannotated bacterial genome by trying to find conserved
patterns. This is feasible because ORF’s restrict the
search space of possible gene candidates.
E.g. selfid program(selfid@igs.cnrs-mrs.fr)
Open Reading Frames

The simplest bacterial gene prediction techniques
   simply
1) identify all open reading frames(ORFs),
2) and blastx them against known proteins.
3) The ORFs with the best homology are retained
   first.
4) This usually densely covers the bacterial
   genomes with genes. rRNA and tRNA are
   detected separately using tRNAScan or blastn.
Open Reading Frames (ORF)
On a given piece of DNA, there can be 6 possible frames. The ORF can be
either on the + or minus strand and on any of 3 possible frames
Frame 1: 1st base of start codon can either start at base 1,4,7,10,...
Frame 2: 1st base of start codon can either start at base 2,5,8,11,...
Frame 3: 1st base of start codon can either start at base 3,6,9,12,...
(frame –1,-2,-3 are on minus strand)
Some programs have other conventions for naming frames.. (0..5, 1-6, etc)

Gene finding in
eukaryotic cDNA uses
ORF finding +blastx as
well.
http://www.ncbi.nlm.nih
.gov/gorf/gorf.html
try with gi=41 ( or your
own piece of DNA)
Eukaryotic Central Dogma
In Eukaryotes ( cells where the DNA is sequestered in a separate nucleus)
The DNA does not contain a duplicate of the coding gene, rather exons must be spliced.
( many eukaryotes genes contain no introns! .. Particularly true in ´lower´ organisms)
mRNA – (messenger RNA) Contains the assembled copy of the gene. The mRNA acts as a
messenger to carry the information stored in the DNA in the nucleus to the cytoplasm
where the ribosomes can make it into protein.
Eukaryotic Nuclear Gene Structure

Gene prediction for Pol II transcribed genes.
• Upstream Enhancer elements.
• Upstream Promoter elements.
• GC box(-90nt) (20bp), CAAT box(-75 nt)(22bp)

• TATA promoter (-30 nt) (70%, 15 nt
consensus (Bucher et al (1990))
• 14-20 nt spacer DNA
• CAP site (8 bp)
• Transcription Initiation.
• Transcript region, interrupted by introns.
Translation Initiation (Kozak signal 12 bp
consensus) 6 bp prior to initiation codon.
• polyA signal (AATAAA 99%,other)
introns
•Transcript region, interrupted by introns. Each
introns
    •starts with a donor site consensus
    (G100T100A62A68G84T63..)
    •Has a branch site near 3’ end of intron
    (one not very conserved consensus
    UACUAAC)
    •ends with an acceptor site consensus.
    (12Py..NC65A100G100)




           UACUAAC      AG
Exons
•The exons of the transcript region are
composed of:
        •5’UTR (mean length of 769 bp) with a
        specific base composition, that
        depends on local G+C content of
        genome)
        •AUG (or other start codon)
        •Remainder of coding region
        •Stop Codon
        •3’ UTR (mean length of 457, with a
        specific base composition that
        depends on local G+C content of
        genome)
Structure of the Eukaryotic Genome

          ~6-12% of human DNA encodes
          proteins(higher fraction in
          nematode)
          ~10% of human DNA codes for
          UTR
          ~90% of human DNA is non-
          coding.
Non-Coding Eukaryotic DNA



    Untranslated regions (UTR’s)
    •introns (can be genes within
    introns of another gene!)
    •intergenic regions.
            - repetitive elements
            - pseudogenes (dead
           genes that may(or not)
    have been retroposed back in the
    genome as a single-exon “gene”
Pseudogenes

Pseudogenes:
        Dna sequence that might code for a
gene, but that is unable to result in a protein.
This deficiency might be in transcription (lack of
promoter, for example) or in translation or both.
Processed pseudogenes:
         Gene retroposed back in the genome
after being processed by the splicing apperatus.
Thus it is fully spliced and has polyA tail.
Insertion process flanks mRNA sequence with
short direct repeats.
Thus no promoters.. Unless is accidentally
retroposed downstream of the promoter
sequence.
Do not confuse with single-exon genes.
Repeats
Each repeat family has many subfamilies.
- ALU: ~ 300nt long; 600,000 elements in human
genome. can cause false homology with mRNA.
Many have an Alu1 restriction site.
- Retroposons. ( can get copied back into
genome)
    - Telltale sign: Direct or inverted repeat flank
    the repeated element. That repeat was the
    priming site for the RNA that was inserted.
LINEs (Long INtersped Elements)
        L1 1-7kb long, 50000 copies
        Have two ORFs!!!!! Will cause problems
for gene prediction programs.
SINEs (Short Intersped Elements)
Low-Complexity Elements

• When analyzing sequences, one often rely on the
  fact that two stretches are similar to infer that they
  are homologous (and therefore related).. But
  sequences with repeated patterns will match
  without there being any philogenetic relation!
• Sequences like ATATATACTTATATA which are
  mostly two letters are called low-complexity.
• Triplet repeats (particularly CAG) have a tendency
  to make the replication machinery stutter.. So they
  are amplified.
• The low-complexity sequence can also be hidden
  at the translated protein level.
Masking
•To avoid finding spurious matches in alignment programs, you
should always mask out the query sequence.
•Before predicting genes it is a good idea to mask out repeats (at
least those containing ORFs).
•Before running blastn against a genomic record, you must mask
out the repeats.
•Most used Programs:
CENSOR:
Repeat Masker:
http://ftp.genome.washington.edu/cgi-bin/RepeatMasker
More Non-Protein genes
rRNA - ribosomal RNA
   is one of the structural components of the ribosome. It has sequence
   complementarity to regions of the mRNA so that the ribosome knows where to
   bind to an mRNA it needs to make protein from.

snRNA - small nuclear RNA
   is involved in the machinery that processes RNA's as they travel between the
   nucleus and the cytoplasm.
hnRNA – hetero-nuclear RNA.
         small RNA involved in transcription.
Protein Processing & localization.


The protein as read off from the mRNA may not be in the final
form that will be used in the cell. Some proteins contains
• Signal Peptide (located at N-terminus (beginning)), this signal
peptide is used to guide the protein out of the nucleus towards it´s
final cellular localization. This signal peptide is cleaved-out at
the cleavage site once the protein has reach (or is near) it´s final
destination.
•Various Post-Translational modifications (phosphorylation)


The final protein is called the “mature peptide”
Convention for nucleotides in database


Because the mRNA is actually read off the minus strand
of the DNA, the nucleotide sequence are always quoted
on the minus strand.
In bioinformatics the sequence format does NOT make a
difference between Uracil and Thymine. There is no
symbol for Uracil.. It is always represented by a ´T´
Even genomic sequence follows that convention. A gene
on the ´plus´ strand is quoted so that it is in the same
strand as it´s product mRNA.
Biology Information on the
         Internet
Biology Information on the Internet

• Introduction to Databases
• Searching the Internet for Biology
  Information.
  – General Search methods
  – Biology Web sites
• Introduction to Genbank file format.
• Introduction to Entrez and Pubmed

• Ref: Chapters 1,2,5,6 of “Bioinformatics”
• Databases:
          – A collection of Records.
               – Each record has many fields.
  Spread-sheet – Each field contain specific information.
  Flat-file    – Each field has a data type.
  version of a     » E.g. money, currency,Text Field, Integer,
                     date,address(text field) ,citation (text field)
  database.
               – Each record has a primary key. A UNIQUE
                 identifier that unambiguously defines this
                 record.
gi      Accession version date     Genbank Division taxid organims       Number of Chromosomes
6226959 NM_000014       3 06/01/00 PRI               9606 homo sapiens   22 diploid + X+Y
6226762 NM_000014       2 10/12/99 PRI               9606 homo sapiens   22 diploid + X+Y
4557224 NM_000014       1 02/04/99 PRI               9606 homo sapiens   22 diploid + X+Y
     41 X63129          1 06/06/96 MAM               9913 bos taurus     29+X+Y
gi      Accession version date       Genbank Division taxid organims       Number of Chromosomes
6226959 NM_000014       3 01/06/2000 PRI               9606 homo sapiens   22 diploid + X+Y
6226762 NM_000014       2 12/10/1999 PRI               9606 homo sapiens   22 diploid + X+Y
4557224 NM_000014       1 04/02/1999 PRI               9606 homo sapiens   22 diploid + X+Y
     41 X63129          1 06/06/1996 MAM               9913 bos taurus     29+X+Y

   Gi = Genbank Identifier: Unique Key : Primary Key

   GI Changes with each update of the sequence
   record.
   Accession Number: Secondary key: Points to same locus and
   sequence despite sequence updates.


   Accession + Version Number equivalent to Gi
gi      Accession version date       Genbank Division taxid organims       Number of Chromosomes
6226959 NM_000014       3 01/06/2000 PRI               9606 homo sapiens   22 diploid + X+Y
6226762 NM_000014       2 12/10/1999 PRI               9606 homo sapiens   22 diploid + X+Y
4557224 NM_000014       1 04/02/1999 PRI               9606 homo sapiens   22 diploid + X+Y
     41 X63129          1 06/06/1996 MAM               9913 bos taurus     29+X+Y
Relational Database (Normalizing a database for repeated sub-
elements of a database.. Splitting it into smaller databases, relating
the sub-databases to the first one using the primary key.)
  gi          Accession         version   date          Genbank Division taxid
  6226959     NM_000014               3   01/06/2000    PRI               9606
  6226762     NM_000014               2   12/10/1999    PRI               9606
  4557224     NM_000014               1   04/02/1999    PRI               9606
       41     X63129                  1   06/06/1996    MAM               9913

  taxid    organims     Number of Chromosomes
      9606 homo sapiens 22 diploid + X+Y
      9913 bos taurus   29+X+Y
Types of Relational databases.
• The Internet can be though of as one
  enormous relational database.
  – The “links”/URL are the primary keys.
• SQL (Standard Query Language)
  – Sybase; Oracle ; Access; (Databases systems)
     • Sybase used at NCBI.
  – SRS(One type of database querying system of
    use in Biology)
Indexed searches.
• To allow easy searching of a database, make
  an index.
• An index is a list of primary keys
  corresponding to a key in a given field (or to
  a collection of fields)

 Genbank division
 PRI     6226959;6226762;4557224;…
 MAM     41;…

 Accession
 NM_000014
         6226959;6226762;4557224;
 X63129 41;
Indexed searches.
• Boolean Query: Merging and Intersecting lists:
  – AND (in both lists) (e.g. human AND genome)
        – +human +genome
        – human && genome
  – OR (in either lists) (e.g. human OR genome)
        – human || genome
Search strategies
• Search engines use complex strategies that go
  beyond Boolean queries.
    – Phrases matching:
        • human genome -> “human genome”
    – togetherness: documents with human close to genome
      are scored higher.
    – Term expansion & synomyms:
        • human -> homo sapiens
    – neigbours:
            – human genome-> genome projects, chromosomes,genetics
    – Frequency of links (www.google.com)
• To avoid these term mapping, enclose your queries in quotes:
  “human” AND “genome”
Search strategies
• Search engines use complex strategies that
  go beyond Boolean queries.

• To avoid these term mapping, enclose your queries in
  quotes: “human” AND “genome”

• To require that ALL the terms in your query be important,
  precede them with a “+” . This also prevents term
  mapping.
• To force the order of the words to be important, group
  sentences within strings. “biology of mammals”.
Indexed searches.
Example

• find the advanced query page at
  http://www.altavista.com
• type human (and hit the Search button)
• Type genome:
• type human AND genome
• type “human genome” (finds the least matches)
• type human OR genome (finds the most matches)
• Search Engines:
  – Web Spiders: Collection of All web pages, but
    since Web pages change all the time and new
    ones appear, they must constantly roam the web
    and re-index.. Or depend on people submitting
    their own pages.
     •   www.google.com (BEST!)
     •   www.infoseek.com
     •   www.lycos.com
     •   www.exite.com
     •   www.webcrawler.com
     •   www.lycos.com
     •   www.looksmart.com (country specific)
• Search Engines:
     • www.google.com (BEST!)
     • Google ranks pages according to how many pages with those
       terms refer to the pages you are asking for. Not only must one
       document contain ALL the search terms, but other documents
       which refer to this one must also contain all the terms.
     • Great when you know what you are looking for! You can also
       use “” to require immediate proximity and order of terms.
     • E.g. type
              » Web server for the blast program.
     But google only indexes about 40% of the web.. So you may
      have to use other web spiders.

     (disclaimer.. I don’t own stock in that company.. But I’d like to)
• Search Engines:
  – Curated Collections: Not comprehensive:
    Contains list of best sites for commonly
    requested topics, but is missing important sites
    for more specialized topics (like biology)
     • www.yahoo.com (Has travel maps too!)
  – Answer-based curated collections: Easy to
    use english-like queries. First looks at list of
    predefined answers, then refines answers based
    on user interaction. Also answer new questions.
     •   www.askjeeves.com
     •   www.magellan.com
     •   www.altavista.com(has translation TOOLS)
     •   www.hotbot.com
• Search Engines:
  – Meta-Search Engines: Polls several search
    engines, and returns the consensus of all results.
    Is likely to miss sites, but the sites it returns are
    very relevant to the query.
  – Other operating mode is to return the sum of all
    the results.. Then becomes very sensitive to a
    very detailled query.
     •   www.metacrawler.com
     •   www.savvysearch.com
     •   www.1blink.com (fast)
     •   www.metafind.com
     •   www.dogpile.com
• Virtual Libraries: Curated collections of
  links for Biologists.(by Biologists)
  – Pedro’s BioMolecular Research Tools:(1996)
     • http://www.public.iastate.edu/~pedro/
  – Virtual Library: Bio Sciences
     • http://vlib.org/Biosciences.html
  – Publications and abstract search.
     • http://www.ncbi.nlm.nih.gov/
  – Expasy server
     • http://www.expasy.ch
  – EBI Biocatalog (software & databases list)
     • http://www.ebi.ac.uk/biocat/
Biological Databases
• Nucleotide databases:
   – Genbank: International Collaboration
      • NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia)
      • A “bank” No curation.. Submission to these database is
        required for publication in a journal.
   – Organism specific databases (Exercize: Find URLs
     using search engines)
      •   FlyBase
      •   ChickGBASE
      •   pigbase
      •   wormpep
      •   YPD (Yeast Protein Database)
      •   SGD(Saccharomyces Genome Database)
• Protein Databases:
   – NCBI:
   – Swiss Prot:(Free for academic use, otherwise
     commercial. Licensing restrictions on discoveries made
     using the DB. 1998 version free of any licensing)
      • http://www.expasy.ch(latest pay version)
      • NCBI has the latest free version.
      • Translated Proteins from Genbank Submissions


   – EMBL
      • TrEMBL is a computer-annotated supplement of SWISS-PROT
        that contains all the translations of EMBL nucleotide sequence
        entries not yet integrated in SWISS-PROT


   – PIR
• Structure databases:
  – PDB: Protein structure database.
      • Http://www.rscb.org/pdb/
  – MMDB: NCBI’s version of PDB with entrez
    links.
      • Http://www.ncbi.nlm.nih.gov
• Genome Mapping Information:
  – http://www.il-st-acad-sci.org/health/genebase.html
  – NCBI(Human)
  – Genome Centers:
      • Stanford, Washington University, Stanford
  – Research Centers and Universities
• Litterature databases:
  – NCBI: Pubmed: All biomedical litterature.
     • Www.ncbi.nlm.nih.gov
     • Abstracts and links to publisher sites for
        – full text retrieval/ordering
        – journal browsing.
  – Publisher web sites.
  – Biomednet: Commercial site for litterature
    search.
• Pathways Database:
  – KEGG: Kyoto Encyclopedia of Genes and
    Genomes: www.genome.ad.jp/kegg/kegg/html
• Database Identifiers: Primary keys
  – GI (changes with each sequence update for
    NCBI only)
       • Annotation may change without the gi changing!
  –   Accession(stable)
  –   version(changes with each sequence update)
  –   “Version” also refers to Accession.version
  –   Secondary accession: Records may have been
      merged in the past.. So the records which were
      not chosen as the primary were made
      secondary.
Primary Databases
• A primary Database is a repository of data
  derived from experiments or from research
  knowledge.
  –   Genbank (Nucleotide repository)
  –   Protein DB, Swissprot
  –   PDB (MMDB) are primary databases.
  –   Pubmed (litterature)
  –   Genome Mapping databases.
  –   Kegg Database.(pathways)
Secondary Databases
• A secondary database contains information
  derived from other sources.
  – Refseq (Currated collection of Genbank at
    NCBI)
  – Unigene (Clustering of ESTs at NCBI)
• Organism-specific databases are often a mix
  between primary and secondary.
Genbank Records
• A Bank: No attempt at reconciliation.
• Submit a sequence  Get an Accession Number!
    – Cannot modify sequences without submitter’s consent.
    – No attempt at reconciliation.(not a unique collection
      per LOCUS/gene)
    – Entries of various sequence quality and different
      sources==> Separate in various divisions based on
       • High Quality sequences in taxon specific divisions.
       • Low Quality sequences in Usage specific databases.
• A Collaboration between NCBI, EMBL and
  DDBJ. They contain (nearly) the same
  information, only the data format differs.
EMBL does not differentiate between the different types of RNA
records, while NCBI (and DDBJ) do. In Entrez EMBL records are
patched up to add that information.
Refseq and LocusLink
• Attempt to produce 1 mRNA, 1 protein, and
  1 genomic gene for each frequently
  occuring allele of a protein expressing gene.
• www.ncbi.nlm.nih.gov/LocusLink
• Special non-genbank Accession numbers
  –   NM_nnnnnn mRNA refseq
  –   NP_nnnnnn protein refseq
  –   NC_nnnnnn refseq genomic contig
  –   NT_nnnnnn temporary genomic contig
  –   NX_nnnnnn predicted gene
Genbank divisions

Sequences in genbank are split into various categories based
   on
1) The quality and type of sequences
2) The high quality nucleotide sequences are divided into
   organism-dependant divisions.
• Genbank Entry type: (and query to restrict to that
  field)
   – mRNA (1/10000 errors)
       • biomol_mRNA [PROP]
   – cDNA (EST, 95-99% accuracy, single pass )
       • gbdiv_EST [PROP]
   – genomic ( biomol_genomic [PROP])
       • in HTGS division: >99% accuracy;
            – gbdiv_HTG [PROP]
       • GSS(low-quality genome survey sequences)
            – gbdiv_GSS [PROP]
       • rest of Genbank; 1/10000 accuracy.
            – Human gbdiv_PRI [PROP]
            – mouse gbdiv_ROD [PROP]
            – bovine gbdiv_MAM [PROP]
   – STS(EST or cDNA used in mapping)
       • gbdiv_STS [PROP]
FASTA Format
                                        MOST important
                                        data format!!!
>identifier descriptive text
nucleotide of amino-acid
sequence on multiple lines if needed.

Example:
>gi|41|emb|X63129.1|BTA1AT B.taurus mRNA for alpha-1-anti-trypsin
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC ….
Modified FASTA Format
1) A few tools follow the convention that
   lower case sequences are masked. (repeat
   masker, some versions of blast, megablast,
   blastz)
2) A few analysis tools (like CLUSTAL)
   want a simplified identifier on the defline..
   So they can have a short string for the
   alignment.
>X63129.1
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC ….
• WIM now will talk about GCG …
Feature table
       (NCBI;EMBL/DDBJ)
• http://www.ncbi.nlm.nih.gov/collab/FT/index.htm
Genbank Data format
    41

•   LOCUS     BTA1AT       1380 bp mRNA            MAM       30-APR-1992
•   DEFINITION B.taurus mRNA for alpha-1-antitrypsin.
•   ACCESSION X63129
•   NID    g41
•   VERSION X63129.1 GI:41
•   KEYWORDS alpha-1 antitrypsin; serine protease inhibitor; serpin.
•   SOURCE     Bos taurus.
•   ORGANISM Bos taurus
•        Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria;
•        Artiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; Bos.
Genbank References
•   LOCUS     BTA1AT         1380 bp mRNA             MAM       30-APR-1992
•   ...
•   REFERENCE 1 (bases 1 to 1380)
•   AUTHORS Sinha,D.
•   TITLE Direct Submission
•   JOURNAL Submitted (22-OCT-1991) D. Sinha, Dept of Biochemistry,
    Temple University, 3400 North Broad Street, Philadelphia, PA 19140, USA
•   REFERENCE 2 (bases 1 to 1380)
•   AUTHORS Sinha,D., Bakhshi,M.R. and Kirby,E.P.
•   TITLE Complete cDNA sequence of bovine alpha 1-antitrypsin
•   JOURNAL Biochim. Biophys. Acta 1130 (2), 209-212 (1992)
•   MEDLINE 92223096
•   FEATURES            Location/Qualifiers
•
Genbank Source Qualifier
•   LOCUS        BTA1AT         1380 bp mRNA     MAM   30-APR-1992
•   ...
•   FEATURES               Location/Qualifiers
•       source        1..1380
•                 /organism="Bos taurus"
•                 /db_xref="taxon:9913"
•                 /tissue_type="liver"
•                 /cell_type="hepatocyte"
•                 /clone_lib="lambda gt11"
•                 /clone="2f-Ic"
•       mRNA            <1..>1380
•       sig_peptide 33..104
•       ...
Genbank mRNA+CDS features
•   mRNA         <1..>1380
•   sig_peptide 33..104
•     CDS        33..1283
•                /codon_start=1
•                /product="alpha-1-antitrypsin"
•                /protein_id="CAA44840.1"
•                /db_xref="PID:g42"
•                /db_xref="GI:42"
•                /db_xref="SWISS-PROT:P34955"
•   /
    translation="MALSITRGLLLLAALCCLAPISLAGVLQGHAVQETDDTSHQEAACH
    KIAPNLANFAFSIYHHLAHQSNTSNIFFSPVSIASAFAMLSLGAKGNTHTEILKGL
    GFNLTELAEAEIHKGFQHLLHTLNQPNHQLQLTTGNGLFINESAKLVDTFLEDV
    KNLYHSEAFSINFRDAEEAKKKINDYVEKGSHGKIVELVKVLDPNTVFALVNYI
    SFKGKWEKPFEMKHTTERDFHVDEQTTVKVPMMNRLGMFDLHYCDKLASWV
    LLLDYVGNVTACFILPDLGKLQQLEDKLNNELLAKFLEKKYASSANLHLPKLSI
    SETYDLKSVLGDVGITEVFSDRADLSGITKEQPLKVSKALHKAALTIDEKGTEA
    VGSTFLEAIPMSLPPDVEFNRPFLCILYDRNTKSPLFVGKVVNPTQA"
•      mat_peptide 105..1280
•                /product="alpha-1-antitrypsin"
•      polyA_signal 1343..1348
•   ...
        Genbank Sequence format
•   BASE COUNT       357 a    413 c    322 g    288 t
•   ORIGIN
•        1 gaccagccct gacctaggac agtgaatcga taatggcact   ctccatcacg   cggggccttc
•       61 tgctgctggc agccctgtgc tgcctggccc ccatctccct   ggctggagtt   ctccaaggac
•      121 acgctgtcca agagacagat gatacatccc accaggaagc   agcgtgccac   aagattgccc
•      181 ccaacctggc caactttgcc ttcagcatat accaccattt   ggctcatcag   tccaacacca
•      241 gcaacatctt cttctccccc gtgagcatcg cttcagcctt   tgcgatgctc   tccctgggag
•      301 ccaagggcaa cactcacact gagatcctga agggcctggg   tttcaacctc   actgagctcg
•      361 cagaggctga gatccacaaa ggctttcagc atcttctcca   caccctgaac   cagccaaacc
•   ...
•    1321 gtccccccac tccctccatg gcattaaagg atgactgacc    tagccccgaa aaaaaaaaaa
•   //
EMBL DATA FORMAT
• Embl: http://www.ebi.ac.uk/Databases/
• http://www.ebi.ac.uk/cgi-bin/emblfetch
• Use Accession X63129
DDBJ DATA FORMAT
• DDBJ: http://www.ddbj.nig.ac.jp/
• http://ftp2.ddbj.nig.ac.jp:8000/getstart-
  e.html
• Use Accession X63129
• Flat file format same as NCBI/Genbank
  format.
Entrez
• Index Based search system. Each field in
  the database is searchable individually or as
  agregate.
  – (e.g. CDS [FKEY])
  – default is agregate [ALL FIELDS] *
• All primary databases are interlinked as one
  big relational database.
  – (e.g. Pubmed links in Genbank records)
• Phrase matching.
  – Human genome -> “human genome”
Entrez
• Available neighbours (related documents or
  related sequences)
• In Pubmed searches: Term mapping to
  neighbouring documents and neighbouring terms.
• Term mapping to chemical names.
  – In pubmed: term [All Fields] is term mapped to
    chemical names + MeSH terms + Text Fields.
  – .. Unless “term” is whithin double quotes.
Entrez
• http://www.ncbi.nlm.nih.gov/Entrez/

• Tutorials:
• http://www.ncbi.nlm.nih.gov/Class/MLACo
  urse/Genetics/index.html
• http://www.ncbi.nlm.nih.gov/Literature/pubmed_search.
• http://www.ncbi.nlm.nih.gov/Database.tut1.html
SWISSPROT
            http://www.expasy.ch/sprot/sprot_details.html


1. Core data: protein sequence data; the citation information and the
   taxonomic data
2. Annotation
   • Function(s) of the protein
   • Domains and sites. For example calcium binding regions, ATP-
       binding sites, zinc fingers, homeobox, kringle, etc.
   • Post-translational modification(s). For example carbohydrates,
       phosphorylation, acetylation, GPI-anchor, etc.
   • Secondary structure
   • Quaternary structure. For example homodimer, heterotrimer, etc
   • Similarities to other proteins
   • Disease(s) associated with deficiencie(s) in the protein
   • Sequence conflicts, variants, etc.
SWISSPROT

http://www.expasy.ch/cgi-bin/get-random-entry.pl?S
REBASE (Restriction enzymes dataBASE)
Restriction enzymes have a pattern recognition sequence, and then
    within or a few bases away from that pattern is the actual
    cutting site
http://rebase.neb.com/rebase/rebase.html
I prefer the bairoch format (SWISSPROT format)
http://rebase.neb.com/rebase/rebase.f19.html
ID enzyme name
ET enzyme type
OS microorganism name
PT prototype
RS recognition sequence, cut site
MS methylation site (type)
CR commercial sources for the restriction enzyme
CM commercial sources for the methylase
RN [count]
RA authors
RL jour, vol, pages, year, etc.
Exercises
•You can work in teams for this.
•1a) Use the first 6000 bases of your genomic piece [ or find a
bacterial genomic or mRNA sequence in Entrez with length between
2000:10000 ]
•b) Use the ORF finder to find the gene(s). Compare the answer you
get to the annotation you can infer from using blastn against genbank
and to using blastx against a protein database.
•Do the Entrez exercizes. ( separate word document)

Weitere ähnliche Inhalte

Was ist angesagt?

Dna and transcription_tutorial
Dna and transcription_tutorialDna and transcription_tutorial
Dna and transcription_tutorialdaniela gonzalez
 
Transcription dna2011
Transcription dna2011Transcription dna2011
Transcription dna2011MUBOSScz
 
mRNA stability by kk sahu
mRNA stability by kk sahumRNA stability by kk sahu
mRNA stability by kk sahuKAUSHAL SAHU
 
Protein synthesis project
Protein synthesis projectProtein synthesis project
Protein synthesis projectpunxsyscience
 
Studying gene expression and function
Studying gene expression and functionStudying gene expression and function
Studying gene expression and functionMd Murad Khan
 
Transcription and translation lecture notes
Transcription and translation  lecture notes Transcription and translation  lecture notes
Transcription and translation lecture notes Leonardo Pinzon
 
Pre trans splicing gene therapy
Pre trans splicing gene therapyPre trans splicing gene therapy
Pre trans splicing gene therapyfaraharooj
 
Gene expression&amp;regulation part ii
Gene expression&amp;regulation part iiGene expression&amp;regulation part ii
Gene expression&amp;regulation part iiDr.SIBI P ITTIYAVIRAH
 
Long non coding RNA and Their clinical perspective
Long non coding RNA and Their clinical perspectiveLong non coding RNA and Their clinical perspective
Long non coding RNA and Their clinical perspectiveMOHIT GOSWAMI
 
Protein synthesis2 ppt
Protein synthesis2 pptProtein synthesis2 ppt
Protein synthesis2 pptHameed kakar
 
Metastasis-associated lung adenocarcinoma transcript 1 (MALAT1)
Metastasis-associated lung adenocarcinoma transcript 1 (MALAT1)Metastasis-associated lung adenocarcinoma transcript 1 (MALAT1)
Metastasis-associated lung adenocarcinoma transcript 1 (MALAT1)Ahmed Al-Abadlah
 
Gene expression concept and analysis
Gene expression concept and analysisGene expression concept and analysis
Gene expression concept and analysisNoha Lotfy Ibrahim
 
Gene Expression in Eukaryotes
Gene Expression in EukaryotesGene Expression in Eukaryotes
Gene Expression in EukaryotesDr.M.Prasad Naidu
 
artificial or synthetic transcription factor for regulation of gene expression
artificial or synthetic transcription factor for regulation of gene expressionartificial or synthetic transcription factor for regulation of gene expression
artificial or synthetic transcription factor for regulation of gene expressionBalaji Rathod
 
difference between Transcription in eukaryotes and prokaryotes
difference between Transcription in eukaryotes and prokaryotes difference between Transcription in eukaryotes and prokaryotes
difference between Transcription in eukaryotes and prokaryotes kamilKhan63
 
Protein synthesis
Protein synthesisProtein synthesis
Protein synthesisjoanmaureen
 
Gene expression in eukaryotes
Gene expression in eukaryotesGene expression in eukaryotes
Gene expression in eukaryotesDr.M.Prasad Naidu
 

Was ist angesagt? (20)

Dna and transcription_tutorial
Dna and transcription_tutorialDna and transcription_tutorial
Dna and transcription_tutorial
 
Dna
DnaDna
Dna
 
Transcription dna2011
Transcription dna2011Transcription dna2011
Transcription dna2011
 
mRNA stability by kk sahu
mRNA stability by kk sahumRNA stability by kk sahu
mRNA stability by kk sahu
 
Protein synthesis project
Protein synthesis projectProtein synthesis project
Protein synthesis project
 
Prokaryotic transcription
Prokaryotic transcriptionProkaryotic transcription
Prokaryotic transcription
 
Studying gene expression and function
Studying gene expression and functionStudying gene expression and function
Studying gene expression and function
 
Transcription and translation lecture notes
Transcription and translation  lecture notes Transcription and translation  lecture notes
Transcription and translation lecture notes
 
Pre trans splicing gene therapy
Pre trans splicing gene therapyPre trans splicing gene therapy
Pre trans splicing gene therapy
 
Gene expression&amp;regulation part ii
Gene expression&amp;regulation part iiGene expression&amp;regulation part ii
Gene expression&amp;regulation part ii
 
Rna
RnaRna
Rna
 
Long non coding RNA and Their clinical perspective
Long non coding RNA and Their clinical perspectiveLong non coding RNA and Their clinical perspective
Long non coding RNA and Their clinical perspective
 
Protein synthesis2 ppt
Protein synthesis2 pptProtein synthesis2 ppt
Protein synthesis2 ppt
 
Metastasis-associated lung adenocarcinoma transcript 1 (MALAT1)
Metastasis-associated lung adenocarcinoma transcript 1 (MALAT1)Metastasis-associated lung adenocarcinoma transcript 1 (MALAT1)
Metastasis-associated lung adenocarcinoma transcript 1 (MALAT1)
 
Gene expression concept and analysis
Gene expression concept and analysisGene expression concept and analysis
Gene expression concept and analysis
 
Gene Expression in Eukaryotes
Gene Expression in EukaryotesGene Expression in Eukaryotes
Gene Expression in Eukaryotes
 
artificial or synthetic transcription factor for regulation of gene expression
artificial or synthetic transcription factor for regulation of gene expressionartificial or synthetic transcription factor for regulation of gene expression
artificial or synthetic transcription factor for regulation of gene expression
 
difference between Transcription in eukaryotes and prokaryotes
difference between Transcription in eukaryotes and prokaryotes difference between Transcription in eukaryotes and prokaryotes
difference between Transcription in eukaryotes and prokaryotes
 
Protein synthesis
Protein synthesisProtein synthesis
Protein synthesis
 
Gene expression in eukaryotes
Gene expression in eukaryotesGene expression in eukaryotes
Gene expression in eukaryotes
 

Ähnlich wie Central dogma

3.5 transcription & translation
3.5 transcription & translation3.5 transcription & translation
3.5 transcription & translationcartlidge
 
Genome organization and gene expression and its regulation
Genome organization and gene expression and its regulationGenome organization and gene expression and its regulation
Genome organization and gene expression and its regulationabhishek soni
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expressionishi tandon
 
Structure of RNA.pptx
Structure of RNA.pptxStructure of RNA.pptx
Structure of RNA.pptxsalman91742
 
Genetic code and translation
Genetic code and translationGenetic code and translation
Genetic code and translationSafder Abbas
 
L-1_Nucleic acid.pptx
L-1_Nucleic acid.pptxL-1_Nucleic acid.pptx
L-1_Nucleic acid.pptxMithilaBanik
 
Central dogma of molecular genetics valerio
Central dogma of molecular genetics valerioCentral dogma of molecular genetics valerio
Central dogma of molecular genetics valerioGenny Valerio
 
Biol102 chp17-pp-spr10-100508132228-phpapp02
Biol102 chp17-pp-spr10-100508132228-phpapp02Biol102 chp17-pp-spr10-100508132228-phpapp02
Biol102 chp17-pp-spr10-100508132228-phpapp02Cleophas Rwemera
 
Biol102 chp17-pp-spr10-100508132228-phpapp02
Biol102 chp17-pp-spr10-100508132228-phpapp02Biol102 chp17-pp-spr10-100508132228-phpapp02
Biol102 chp17-pp-spr10-100508132228-phpapp02Cleophas Rwemera
 
Biology lecture 5
Biology lecture 5Biology lecture 5
Biology lecture 5Etugen
 
Biochem synthesis of rna(june.23.2010)
Biochem   synthesis of rna(june.23.2010)Biochem   synthesis of rna(june.23.2010)
Biochem synthesis of rna(june.23.2010)MBBS IMS MSU
 

Ähnlich wie Central dogma (20)

Microbial genetics lectures 10, 11, and 12
Microbial genetics lectures 10, 11, and 12 Microbial genetics lectures 10, 11, and 12
Microbial genetics lectures 10, 11, and 12
 
3.5 transcription & translation
3.5 transcription & translation3.5 transcription & translation
3.5 transcription & translation
 
Genome organization and gene expression and its regulation
Genome organization and gene expression and its regulationGenome organization and gene expression and its regulation
Genome organization and gene expression and its regulation
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expression
 
Structure of RNA.pptx
Structure of RNA.pptxStructure of RNA.pptx
Structure of RNA.pptx
 
Gene Expression_AA1.ppt
Gene Expression_AA1.pptGene Expression_AA1.ppt
Gene Expression_AA1.ppt
 
Genetic code and translation
Genetic code and translationGenetic code and translation
Genetic code and translation
 
L-1_Nucleic acid.pptx
L-1_Nucleic acid.pptxL-1_Nucleic acid.pptx
L-1_Nucleic acid.pptx
 
protein synthesis
protein synthesisprotein synthesis
protein synthesis
 
Central dogma of molecular genetics valerio
Central dogma of molecular genetics valerioCentral dogma of molecular genetics valerio
Central dogma of molecular genetics valerio
 
Biol102 chp17-pp-spr10-100508132228-phpapp02
Biol102 chp17-pp-spr10-100508132228-phpapp02Biol102 chp17-pp-spr10-100508132228-phpapp02
Biol102 chp17-pp-spr10-100508132228-phpapp02
 
Biol102 chp17-pp-spr10-100508132228-phpapp02
Biol102 chp17-pp-spr10-100508132228-phpapp02Biol102 chp17-pp-spr10-100508132228-phpapp02
Biol102 chp17-pp-spr10-100508132228-phpapp02
 
Pptgenlec
PptgenlecPptgenlec
Pptgenlec
 
Translation
TranslationTranslation
Translation
 
Genetic Code and Translation.pdf
Genetic Code and Translation.pdfGenetic Code and Translation.pdf
Genetic Code and Translation.pdf
 
Genetic code
Genetic code Genetic code
Genetic code
 
Biology lecture 5
Biology lecture 5Biology lecture 5
Biology lecture 5
 
Protein synthesis
Protein synthesis Protein synthesis
Protein synthesis
 
Biochem synthesis of rna(june.23.2010)
Biochem   synthesis of rna(june.23.2010)Biochem   synthesis of rna(june.23.2010)
Biochem synthesis of rna(june.23.2010)
 
AP Bio Ch 10 Power Point
AP Bio Ch 10 Power PointAP Bio Ch 10 Power Point
AP Bio Ch 10 Power Point
 

Kürzlich hochgeladen

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 

Kürzlich hochgeladen (20)

Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

Central dogma

  • 1. Fundamentals in Sequence Analysis 1.(part 1) Review of Basic biology + database searching in Biology. Hugues Sicotte NCBI
  • 2. The Flow of Biotechnology Information Gene Function > DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA > Protein sequence TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNI GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA PDEAEQDCIEFGKKIANI ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAGA
  • 3. Prequisites to Sequence Analysis • Basic Biology so you can understand the language of the databases: Central Dogma (transcription; Translation, Prokaryotes, Eukaryotes,CDS, 3 ´UTR, 5´UTR, introns, exons, promoters, operons, codons, start codons, stop codons,snRNA,hnRNA,tRNA, secondary structure, tertiary structure). • Before you can analyze sequences.. You have to understand their structure.. And know about Basic Biological Database Searching
  • 4. Central Dogmas of Molecular Biology 1) The concept of genes is historically defined on the basic of genetic inheritance of a phenotype. (Mendellian Inheritance) 2) The DNA an organism encodes the genetic information. It is made up of a double stranded helix composed of ribose sugars. Adenine(A), Citosine (C), Guanine (G) and Thymine (T). [note that only 4 values nees be encode ACGT.. Which can be done using 2 bits.. But to allow redundant letter combinations (like N means any 4 nucleotides), one usually resorts to a 4 bit alphabet.]
  • 5. Central Dogmas of Molecular Biology 3) Each side of the double helix faces it´s complementary base. A T, and G  C. 4) Biochemical process that read off the DNA always read it from the 5 ´´side towards the 3´ side. (replication and transcription). 5) A gene can be located on either the ´plus strand´ or the minus strand. But rule 4) imposes the orientation of reading .. And rule 3 (complementarity) tells us to complement each base E.g. If the sequence on the + strand is ACGTGATCGATGCTA, the – strand must be read off by reading the complement of this sequence going ´backwards´ e.g. TAGCATCGATCACGT
  • 6. Central Dogmas of Molecular Biology 6) DNA information is copied over to mRNA that acts as a template to produce proteins. We often concentrate on protein coding genes, because proteins are the building blocks of cells and the majority of bio-active molecules. (but let´s not forget the various RNA genes)
  • 7. Prokaryotic genes Prokaryotes (intronless protein coding genes) Upstream (5’) Gene region promoter Downstream (3’) TAC DNA Transcription (gene is encoded on minus strand .. And the reverse complement is read into mRNA) ATG mRNA 5´ UTR CoDing Sequence (CDS) 3´ UTR ATG Translation: tRNA read off each codons, 3 bases at a time, starting at start codon until it reaches a STOP codon. protein
  • 8. Why does Nature bothers with the mRNA? Why would the cell want to have an intermediate between DNA and the proteins it encodes? •Gene information can be amplified by having many copies of an RNA made from one copy of DNA. •Regulation of gene expression can be effected by having specific controls at each element of the pathway between DNA and proteins. The more elements there are in the pathway, the more opportunities there are to control it in different circumstances. •In Eukaryotes, the DNA can then stay pristine and protected, away from the caustic chemistry of the cytoplasm.
  • 9. Prokaryotic genes (operons) Prokaryotes (operon structure) upstream promoter downstream Gene 1 Gene 2 Gene 3 In prokaryotes, sometimes genes that are part of the same operational pathway are grouped together under a single promoter. They then produce a pre-mRNA which eventually produces 3 separates mRNA´s.
  • 10. Bacterial Gene Structure of signals Bacterial genomes have simple gene structure. - Transcription factor binding site. - Promoters -35 sequence (T82T84G78A65C54A45) 15-20 bases -10 sequence (T80A95T45A60A50T96) 5-9 bases - Start of transcription : initiation start: Purine90 (sometimes it’s the “A” in CAT) - translation binding site (shine-dalgarno) 10 bp upstream of AUG (AGGAGG) - One or more Open Reading Frame •start-codon (unless sequence is partial) •until next in-frame stop codon on that strand .. Separated by intercistronic sequences. - Termination
  • 11. Genetic Code How does an mRNA specify amino acid sequence? The answer lies in the genetic code. It would be impossible for each amino acid to be specified by one nucleotide, because there are only 4 nucleotides and 20 amino acids. Similarly, two nucleotide combinations could only specify 16 amino acids. The final conclusion is that each amino acid is specified by a particular combination of three nucleotides, called a codon: Each 3 nucleotide code for one amino acid. •The first codon is the start codon, and usually coincides with the Amino Acid Methionine. (M which has codon code ‘ATG’) •The last codon is the stop codon and does NOT code for an amino acid. It is sometimes represented by ‘*’ to indicate the ‘STOP’ codon. •A coding region (abbreviation CDS) starts at the START codon and ends at the STOP codon.
  • 12. Codon table Note the degeneracy of the genetic code. Each amino acid might have up to six codons that specify it. • Different organisms have different frequencies of codon usage. •A handful of species vary from the codon association described above, and use different codons fo different amino acids. How do tRNAs recognize to which codon to bring an amino acid? The tRNA has an anticodon on its mRNA-binding end that is complementary to the codon on the mRNA. Each tRNA only binds the appropriate amino acid for its anticodon.
  • 13. RNA RNA has the same primary structure as DNA. It consists of a sugar-phosphate backbone, with nucleotides attached to the 1' carbon of the sugar. The differences between DNA and RNA are that: 1. RNA has a hydroxyl group on the 2' carbon of the sugar (thus, the difference between deoxyribonucleic acid and ribonucleic acid. 2. Instead of using the nucleotide thymine, RNA uses another nucleotide called uracil: 3. Because of the extra hydroxyl group on the sugar, RNA is too bulky to form a stable double helix. RNA exists as a single-stranded molecule. However, regions of double helix can form where there is some base pair complementation (U and A , G and C), resulting in hairpin loops. The RNA molecule with its hairpin loops is said to have a secondary structure. 4. Because the RNA molecule is not restricted to a rigid double helix, it can form many different stable three-dimensional tertiary structures.
  • 14. tRNA ( transfer RNA) is a small RNA that has a very specific secondary and tertiary structure such that it can bind an amino acid at one end, and mRNA at the other end. It acts as an adaptor to carry the amino acid elements of a protein to the appropriate place as coded for by the mRNA. T Three- dimensional Tertiary Secondary structure of tRNA structure
  • 15. Bacterial Gene Prediction Most of the consensus sequences are known from ecoli studies. So for each bacteria the exact distribution of consensus will change. Most modern gene prediction programs need to be “trained”. E.g. they find their own consensus and assembly rules given a few examples genes. A few programs find their own rules from a completely unannotated bacterial genome by trying to find conserved patterns. This is feasible because ORF’s restrict the search space of possible gene candidates. E.g. selfid program(selfid@igs.cnrs-mrs.fr)
  • 16. Open Reading Frames The simplest bacterial gene prediction techniques simply 1) identify all open reading frames(ORFs), 2) and blastx them against known proteins. 3) The ORFs with the best homology are retained first. 4) This usually densely covers the bacterial genomes with genes. rRNA and tRNA are detected separately using tRNAScan or blastn.
  • 17. Open Reading Frames (ORF) On a given piece of DNA, there can be 6 possible frames. The ORF can be either on the + or minus strand and on any of 3 possible frames Frame 1: 1st base of start codon can either start at base 1,4,7,10,... Frame 2: 1st base of start codon can either start at base 2,5,8,11,... Frame 3: 1st base of start codon can either start at base 3,6,9,12,... (frame –1,-2,-3 are on minus strand) Some programs have other conventions for naming frames.. (0..5, 1-6, etc) Gene finding in eukaryotic cDNA uses ORF finding +blastx as well. http://www.ncbi.nlm.nih .gov/gorf/gorf.html try with gi=41 ( or your own piece of DNA)
  • 18. Eukaryotic Central Dogma In Eukaryotes ( cells where the DNA is sequestered in a separate nucleus) The DNA does not contain a duplicate of the coding gene, rather exons must be spliced. ( many eukaryotes genes contain no introns! .. Particularly true in ´lower´ organisms) mRNA – (messenger RNA) Contains the assembled copy of the gene. The mRNA acts as a messenger to carry the information stored in the DNA in the nucleus to the cytoplasm where the ribosomes can make it into protein.
  • 19. Eukaryotic Nuclear Gene Structure Gene prediction for Pol II transcribed genes. • Upstream Enhancer elements. • Upstream Promoter elements. • GC box(-90nt) (20bp), CAAT box(-75 nt)(22bp) • TATA promoter (-30 nt) (70%, 15 nt consensus (Bucher et al (1990)) • 14-20 nt spacer DNA • CAP site (8 bp) • Transcription Initiation. • Transcript region, interrupted by introns. Translation Initiation (Kozak signal 12 bp consensus) 6 bp prior to initiation codon. • polyA signal (AATAAA 99%,other)
  • 20. introns •Transcript region, interrupted by introns. Each introns •starts with a donor site consensus (G100T100A62A68G84T63..) •Has a branch site near 3’ end of intron (one not very conserved consensus UACUAAC) •ends with an acceptor site consensus. (12Py..NC65A100G100) UACUAAC AG
  • 21. Exons •The exons of the transcript region are composed of: •5’UTR (mean length of 769 bp) with a specific base composition, that depends on local G+C content of genome) •AUG (or other start codon) •Remainder of coding region •Stop Codon •3’ UTR (mean length of 457, with a specific base composition that depends on local G+C content of genome)
  • 22. Structure of the Eukaryotic Genome ~6-12% of human DNA encodes proteins(higher fraction in nematode) ~10% of human DNA codes for UTR ~90% of human DNA is non- coding.
  • 23. Non-Coding Eukaryotic DNA Untranslated regions (UTR’s) •introns (can be genes within introns of another gene!) •intergenic regions. - repetitive elements - pseudogenes (dead genes that may(or not) have been retroposed back in the genome as a single-exon “gene”
  • 24. Pseudogenes Pseudogenes: Dna sequence that might code for a gene, but that is unable to result in a protein. This deficiency might be in transcription (lack of promoter, for example) or in translation or both. Processed pseudogenes: Gene retroposed back in the genome after being processed by the splicing apperatus. Thus it is fully spliced and has polyA tail. Insertion process flanks mRNA sequence with short direct repeats. Thus no promoters.. Unless is accidentally retroposed downstream of the promoter sequence. Do not confuse with single-exon genes.
  • 25. Repeats Each repeat family has many subfamilies. - ALU: ~ 300nt long; 600,000 elements in human genome. can cause false homology with mRNA. Many have an Alu1 restriction site. - Retroposons. ( can get copied back into genome) - Telltale sign: Direct or inverted repeat flank the repeated element. That repeat was the priming site for the RNA that was inserted. LINEs (Long INtersped Elements) L1 1-7kb long, 50000 copies Have two ORFs!!!!! Will cause problems for gene prediction programs. SINEs (Short Intersped Elements)
  • 26. Low-Complexity Elements • When analyzing sequences, one often rely on the fact that two stretches are similar to infer that they are homologous (and therefore related).. But sequences with repeated patterns will match without there being any philogenetic relation! • Sequences like ATATATACTTATATA which are mostly two letters are called low-complexity. • Triplet repeats (particularly CAG) have a tendency to make the replication machinery stutter.. So they are amplified. • The low-complexity sequence can also be hidden at the translated protein level.
  • 27. Masking •To avoid finding spurious matches in alignment programs, you should always mask out the query sequence. •Before predicting genes it is a good idea to mask out repeats (at least those containing ORFs). •Before running blastn against a genomic record, you must mask out the repeats. •Most used Programs: CENSOR: Repeat Masker: http://ftp.genome.washington.edu/cgi-bin/RepeatMasker
  • 28. More Non-Protein genes rRNA - ribosomal RNA is one of the structural components of the ribosome. It has sequence complementarity to regions of the mRNA so that the ribosome knows where to bind to an mRNA it needs to make protein from. snRNA - small nuclear RNA is involved in the machinery that processes RNA's as they travel between the nucleus and the cytoplasm. hnRNA – hetero-nuclear RNA. small RNA involved in transcription.
  • 29. Protein Processing & localization. The protein as read off from the mRNA may not be in the final form that will be used in the cell. Some proteins contains • Signal Peptide (located at N-terminus (beginning)), this signal peptide is used to guide the protein out of the nucleus towards it´s final cellular localization. This signal peptide is cleaved-out at the cleavage site once the protein has reach (or is near) it´s final destination. •Various Post-Translational modifications (phosphorylation) The final protein is called the “mature peptide”
  • 30. Convention for nucleotides in database Because the mRNA is actually read off the minus strand of the DNA, the nucleotide sequence are always quoted on the minus strand. In bioinformatics the sequence format does NOT make a difference between Uracil and Thymine. There is no symbol for Uracil.. It is always represented by a ´T´ Even genomic sequence follows that convention. A gene on the ´plus´ strand is quoted so that it is in the same strand as it´s product mRNA.
  • 31. Biology Information on the Internet
  • 32. Biology Information on the Internet • Introduction to Databases • Searching the Internet for Biology Information. – General Search methods – Biology Web sites • Introduction to Genbank file format. • Introduction to Entrez and Pubmed • Ref: Chapters 1,2,5,6 of “Bioinformatics”
  • 33. • Databases: – A collection of Records. – Each record has many fields. Spread-sheet – Each field contain specific information. Flat-file – Each field has a data type. version of a » E.g. money, currency,Text Field, Integer, date,address(text field) ,citation (text field) database. – Each record has a primary key. A UNIQUE identifier that unambiguously defines this record. gi Accession version date Genbank Division taxid organims Number of Chromosomes 6226959 NM_000014 3 06/01/00 PRI 9606 homo sapiens 22 diploid + X+Y 6226762 NM_000014 2 10/12/99 PRI 9606 homo sapiens 22 diploid + X+Y 4557224 NM_000014 1 02/04/99 PRI 9606 homo sapiens 22 diploid + X+Y 41 X63129 1 06/06/96 MAM 9913 bos taurus 29+X+Y
  • 34. gi Accession version date Genbank Division taxid organims Number of Chromosomes 6226959 NM_000014 3 01/06/2000 PRI 9606 homo sapiens 22 diploid + X+Y 6226762 NM_000014 2 12/10/1999 PRI 9606 homo sapiens 22 diploid + X+Y 4557224 NM_000014 1 04/02/1999 PRI 9606 homo sapiens 22 diploid + X+Y 41 X63129 1 06/06/1996 MAM 9913 bos taurus 29+X+Y Gi = Genbank Identifier: Unique Key : Primary Key GI Changes with each update of the sequence record. Accession Number: Secondary key: Points to same locus and sequence despite sequence updates. Accession + Version Number equivalent to Gi
  • 35. gi Accession version date Genbank Division taxid organims Number of Chromosomes 6226959 NM_000014 3 01/06/2000 PRI 9606 homo sapiens 22 diploid + X+Y 6226762 NM_000014 2 12/10/1999 PRI 9606 homo sapiens 22 diploid + X+Y 4557224 NM_000014 1 04/02/1999 PRI 9606 homo sapiens 22 diploid + X+Y 41 X63129 1 06/06/1996 MAM 9913 bos taurus 29+X+Y Relational Database (Normalizing a database for repeated sub- elements of a database.. Splitting it into smaller databases, relating the sub-databases to the first one using the primary key.) gi Accession version date Genbank Division taxid 6226959 NM_000014 3 01/06/2000 PRI 9606 6226762 NM_000014 2 12/10/1999 PRI 9606 4557224 NM_000014 1 04/02/1999 PRI 9606 41 X63129 1 06/06/1996 MAM 9913 taxid organims Number of Chromosomes 9606 homo sapiens 22 diploid + X+Y 9913 bos taurus 29+X+Y
  • 36. Types of Relational databases. • The Internet can be though of as one enormous relational database. – The “links”/URL are the primary keys. • SQL (Standard Query Language) – Sybase; Oracle ; Access; (Databases systems) • Sybase used at NCBI. – SRS(One type of database querying system of use in Biology)
  • 37. Indexed searches. • To allow easy searching of a database, make an index. • An index is a list of primary keys corresponding to a key in a given field (or to a collection of fields) Genbank division PRI 6226959;6226762;4557224;… MAM 41;… Accession NM_000014 6226959;6226762;4557224; X63129 41;
  • 38. Indexed searches. • Boolean Query: Merging and Intersecting lists: – AND (in both lists) (e.g. human AND genome) – +human +genome – human && genome – OR (in either lists) (e.g. human OR genome) – human || genome
  • 39. Search strategies • Search engines use complex strategies that go beyond Boolean queries. – Phrases matching: • human genome -> “human genome” – togetherness: documents with human close to genome are scored higher. – Term expansion & synomyms: • human -> homo sapiens – neigbours: – human genome-> genome projects, chromosomes,genetics – Frequency of links (www.google.com) • To avoid these term mapping, enclose your queries in quotes: “human” AND “genome”
  • 40. Search strategies • Search engines use complex strategies that go beyond Boolean queries. • To avoid these term mapping, enclose your queries in quotes: “human” AND “genome” • To require that ALL the terms in your query be important, precede them with a “+” . This also prevents term mapping. • To force the order of the words to be important, group sentences within strings. “biology of mammals”.
  • 41. Indexed searches. Example • find the advanced query page at http://www.altavista.com • type human (and hit the Search button) • Type genome: • type human AND genome • type “human genome” (finds the least matches) • type human OR genome (finds the most matches)
  • 42. • Search Engines: – Web Spiders: Collection of All web pages, but since Web pages change all the time and new ones appear, they must constantly roam the web and re-index.. Or depend on people submitting their own pages. • www.google.com (BEST!) • www.infoseek.com • www.lycos.com • www.exite.com • www.webcrawler.com • www.lycos.com • www.looksmart.com (country specific)
  • 43. • Search Engines: • www.google.com (BEST!) • Google ranks pages according to how many pages with those terms refer to the pages you are asking for. Not only must one document contain ALL the search terms, but other documents which refer to this one must also contain all the terms. • Great when you know what you are looking for! You can also use “” to require immediate proximity and order of terms. • E.g. type » Web server for the blast program. But google only indexes about 40% of the web.. So you may have to use other web spiders. (disclaimer.. I don’t own stock in that company.. But I’d like to)
  • 44. • Search Engines: – Curated Collections: Not comprehensive: Contains list of best sites for commonly requested topics, but is missing important sites for more specialized topics (like biology) • www.yahoo.com (Has travel maps too!) – Answer-based curated collections: Easy to use english-like queries. First looks at list of predefined answers, then refines answers based on user interaction. Also answer new questions. • www.askjeeves.com • www.magellan.com • www.altavista.com(has translation TOOLS) • www.hotbot.com
  • 45. • Search Engines: – Meta-Search Engines: Polls several search engines, and returns the consensus of all results. Is likely to miss sites, but the sites it returns are very relevant to the query. – Other operating mode is to return the sum of all the results.. Then becomes very sensitive to a very detailled query. • www.metacrawler.com • www.savvysearch.com • www.1blink.com (fast) • www.metafind.com • www.dogpile.com
  • 46. • Virtual Libraries: Curated collections of links for Biologists.(by Biologists) – Pedro’s BioMolecular Research Tools:(1996) • http://www.public.iastate.edu/~pedro/ – Virtual Library: Bio Sciences • http://vlib.org/Biosciences.html – Publications and abstract search. • http://www.ncbi.nlm.nih.gov/ – Expasy server • http://www.expasy.ch – EBI Biocatalog (software & databases list) • http://www.ebi.ac.uk/biocat/
  • 47. Biological Databases • Nucleotide databases: – Genbank: International Collaboration • NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia) • A “bank” No curation.. Submission to these database is required for publication in a journal. – Organism specific databases (Exercize: Find URLs using search engines) • FlyBase • ChickGBASE • pigbase • wormpep • YPD (Yeast Protein Database) • SGD(Saccharomyces Genome Database)
  • 48. • Protein Databases: – NCBI: – Swiss Prot:(Free for academic use, otherwise commercial. Licensing restrictions on discoveries made using the DB. 1998 version free of any licensing) • http://www.expasy.ch(latest pay version) • NCBI has the latest free version. • Translated Proteins from Genbank Submissions – EMBL • TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT – PIR
  • 49. • Structure databases: – PDB: Protein structure database. • Http://www.rscb.org/pdb/ – MMDB: NCBI’s version of PDB with entrez links. • Http://www.ncbi.nlm.nih.gov • Genome Mapping Information: – http://www.il-st-acad-sci.org/health/genebase.html – NCBI(Human) – Genome Centers: • Stanford, Washington University, Stanford – Research Centers and Universities
  • 50. • Litterature databases: – NCBI: Pubmed: All biomedical litterature. • Www.ncbi.nlm.nih.gov • Abstracts and links to publisher sites for – full text retrieval/ordering – journal browsing. – Publisher web sites. – Biomednet: Commercial site for litterature search. • Pathways Database: – KEGG: Kyoto Encyclopedia of Genes and Genomes: www.genome.ad.jp/kegg/kegg/html
  • 51. • Database Identifiers: Primary keys – GI (changes with each sequence update for NCBI only) • Annotation may change without the gi changing! – Accession(stable) – version(changes with each sequence update) – “Version” also refers to Accession.version – Secondary accession: Records may have been merged in the past.. So the records which were not chosen as the primary were made secondary.
  • 52. Primary Databases • A primary Database is a repository of data derived from experiments or from research knowledge. – Genbank (Nucleotide repository) – Protein DB, Swissprot – PDB (MMDB) are primary databases. – Pubmed (litterature) – Genome Mapping databases. – Kegg Database.(pathways)
  • 53. Secondary Databases • A secondary database contains information derived from other sources. – Refseq (Currated collection of Genbank at NCBI) – Unigene (Clustering of ESTs at NCBI) • Organism-specific databases are often a mix between primary and secondary.
  • 54. Genbank Records • A Bank: No attempt at reconciliation. • Submit a sequence  Get an Accession Number! – Cannot modify sequences without submitter’s consent. – No attempt at reconciliation.(not a unique collection per LOCUS/gene) – Entries of various sequence quality and different sources==> Separate in various divisions based on • High Quality sequences in taxon specific divisions. • Low Quality sequences in Usage specific databases. • A Collaboration between NCBI, EMBL and DDBJ. They contain (nearly) the same information, only the data format differs. EMBL does not differentiate between the different types of RNA records, while NCBI (and DDBJ) do. In Entrez EMBL records are patched up to add that information.
  • 55. Refseq and LocusLink • Attempt to produce 1 mRNA, 1 protein, and 1 genomic gene for each frequently occuring allele of a protein expressing gene. • www.ncbi.nlm.nih.gov/LocusLink • Special non-genbank Accession numbers – NM_nnnnnn mRNA refseq – NP_nnnnnn protein refseq – NC_nnnnnn refseq genomic contig – NT_nnnnnn temporary genomic contig – NX_nnnnnn predicted gene
  • 56. Genbank divisions Sequences in genbank are split into various categories based on 1) The quality and type of sequences 2) The high quality nucleotide sequences are divided into organism-dependant divisions.
  • 57. • Genbank Entry type: (and query to restrict to that field) – mRNA (1/10000 errors) • biomol_mRNA [PROP] – cDNA (EST, 95-99% accuracy, single pass ) • gbdiv_EST [PROP] – genomic ( biomol_genomic [PROP]) • in HTGS division: >99% accuracy; – gbdiv_HTG [PROP] • GSS(low-quality genome survey sequences) – gbdiv_GSS [PROP] • rest of Genbank; 1/10000 accuracy. – Human gbdiv_PRI [PROP] – mouse gbdiv_ROD [PROP] – bovine gbdiv_MAM [PROP] – STS(EST or cDNA used in mapping) • gbdiv_STS [PROP]
  • 58. FASTA Format MOST important data format!!! >identifier descriptive text nucleotide of amino-acid sequence on multiple lines if needed. Example: >gi|41|emb|X63129.1|BTA1AT B.taurus mRNA for alpha-1-anti-trypsin GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC CATCACGCGGGGCCTTCTGCTGCTGGC ….
  • 59. Modified FASTA Format 1) A few tools follow the convention that lower case sequences are masked. (repeat masker, some versions of blast, megablast, blastz) 2) A few analysis tools (like CLUSTAL) want a simplified identifier on the defline.. So they can have a short string for the alignment. >X63129.1 GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC CATCACGCGGGGCCTTCTGCTGCTGGC ….
  • 60. • WIM now will talk about GCG …
  • 61. Feature table (NCBI;EMBL/DDBJ) • http://www.ncbi.nlm.nih.gov/collab/FT/index.htm
  • 62. Genbank Data format 41 • LOCUS BTA1AT 1380 bp mRNA MAM 30-APR-1992 • DEFINITION B.taurus mRNA for alpha-1-antitrypsin. • ACCESSION X63129 • NID g41 • VERSION X63129.1 GI:41 • KEYWORDS alpha-1 antitrypsin; serine protease inhibitor; serpin. • SOURCE Bos taurus. • ORGANISM Bos taurus • Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; • Artiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; Bos.
  • 63. Genbank References • LOCUS BTA1AT 1380 bp mRNA MAM 30-APR-1992 • ... • REFERENCE 1 (bases 1 to 1380) • AUTHORS Sinha,D. • TITLE Direct Submission • JOURNAL Submitted (22-OCT-1991) D. Sinha, Dept of Biochemistry, Temple University, 3400 North Broad Street, Philadelphia, PA 19140, USA • REFERENCE 2 (bases 1 to 1380) • AUTHORS Sinha,D., Bakhshi,M.R. and Kirby,E.P. • TITLE Complete cDNA sequence of bovine alpha 1-antitrypsin • JOURNAL Biochim. Biophys. Acta 1130 (2), 209-212 (1992) • MEDLINE 92223096 • FEATURES Location/Qualifiers •
  • 64. Genbank Source Qualifier • LOCUS BTA1AT 1380 bp mRNA MAM 30-APR-1992 • ... • FEATURES Location/Qualifiers • source 1..1380 • /organism="Bos taurus" • /db_xref="taxon:9913" • /tissue_type="liver" • /cell_type="hepatocyte" • /clone_lib="lambda gt11" • /clone="2f-Ic" • mRNA <1..>1380 • sig_peptide 33..104 • ...
  • 65. Genbank mRNA+CDS features • mRNA <1..>1380 • sig_peptide 33..104 • CDS 33..1283 • /codon_start=1 • /product="alpha-1-antitrypsin" • /protein_id="CAA44840.1" • /db_xref="PID:g42" • /db_xref="GI:42" • /db_xref="SWISS-PROT:P34955" • / translation="MALSITRGLLLLAALCCLAPISLAGVLQGHAVQETDDTSHQEAACH KIAPNLANFAFSIYHHLAHQSNTSNIFFSPVSIASAFAMLSLGAKGNTHTEILKGL GFNLTELAEAEIHKGFQHLLHTLNQPNHQLQLTTGNGLFINESAKLVDTFLEDV KNLYHSEAFSINFRDAEEAKKKINDYVEKGSHGKIVELVKVLDPNTVFALVNYI SFKGKWEKPFEMKHTTERDFHVDEQTTVKVPMMNRLGMFDLHYCDKLASWV LLLDYVGNVTACFILPDLGKLQQLEDKLNNELLAKFLEKKYASSANLHLPKLSI SETYDLKSVLGDVGITEVFSDRADLSGITKEQPLKVSKALHKAALTIDEKGTEA VGSTFLEAIPMSLPPDVEFNRPFLCILYDRNTKSPLFVGKVVNPTQA" • mat_peptide 105..1280 • /product="alpha-1-antitrypsin" • polyA_signal 1343..1348
  • 66. ... Genbank Sequence format • BASE COUNT 357 a 413 c 322 g 288 t • ORIGIN • 1 gaccagccct gacctaggac agtgaatcga taatggcact ctccatcacg cggggccttc • 61 tgctgctggc agccctgtgc tgcctggccc ccatctccct ggctggagtt ctccaaggac • 121 acgctgtcca agagacagat gatacatccc accaggaagc agcgtgccac aagattgccc • 181 ccaacctggc caactttgcc ttcagcatat accaccattt ggctcatcag tccaacacca • 241 gcaacatctt cttctccccc gtgagcatcg cttcagcctt tgcgatgctc tccctgggag • 301 ccaagggcaa cactcacact gagatcctga agggcctggg tttcaacctc actgagctcg • 361 cagaggctga gatccacaaa ggctttcagc atcttctcca caccctgaac cagccaaacc • ... • 1321 gtccccccac tccctccatg gcattaaagg atgactgacc tagccccgaa aaaaaaaaaa • //
  • 67. EMBL DATA FORMAT • Embl: http://www.ebi.ac.uk/Databases/ • http://www.ebi.ac.uk/cgi-bin/emblfetch • Use Accession X63129
  • 68. DDBJ DATA FORMAT • DDBJ: http://www.ddbj.nig.ac.jp/ • http://ftp2.ddbj.nig.ac.jp:8000/getstart- e.html • Use Accession X63129 • Flat file format same as NCBI/Genbank format.
  • 69. Entrez • Index Based search system. Each field in the database is searchable individually or as agregate. – (e.g. CDS [FKEY]) – default is agregate [ALL FIELDS] * • All primary databases are interlinked as one big relational database. – (e.g. Pubmed links in Genbank records) • Phrase matching. – Human genome -> “human genome”
  • 70. Entrez • Available neighbours (related documents or related sequences) • In Pubmed searches: Term mapping to neighbouring documents and neighbouring terms. • Term mapping to chemical names. – In pubmed: term [All Fields] is term mapped to chemical names + MeSH terms + Text Fields. – .. Unless “term” is whithin double quotes.
  • 71. Entrez • http://www.ncbi.nlm.nih.gov/Entrez/ • Tutorials: • http://www.ncbi.nlm.nih.gov/Class/MLACo urse/Genetics/index.html • http://www.ncbi.nlm.nih.gov/Literature/pubmed_search. • http://www.ncbi.nlm.nih.gov/Database.tut1.html
  • 72. SWISSPROT http://www.expasy.ch/sprot/sprot_details.html 1. Core data: protein sequence data; the citation information and the taxonomic data 2. Annotation • Function(s) of the protein • Domains and sites. For example calcium binding regions, ATP- binding sites, zinc fingers, homeobox, kringle, etc. • Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation, GPI-anchor, etc. • Secondary structure • Quaternary structure. For example homodimer, heterotrimer, etc • Similarities to other proteins • Disease(s) associated with deficiencie(s) in the protein • Sequence conflicts, variants, etc.
  • 74. REBASE (Restriction enzymes dataBASE) Restriction enzymes have a pattern recognition sequence, and then within or a few bases away from that pattern is the actual cutting site http://rebase.neb.com/rebase/rebase.html I prefer the bairoch format (SWISSPROT format) http://rebase.neb.com/rebase/rebase.f19.html ID enzyme name ET enzyme type OS microorganism name PT prototype RS recognition sequence, cut site MS methylation site (type) CR commercial sources for the restriction enzyme CM commercial sources for the methylase RN [count] RA authors RL jour, vol, pages, year, etc.
  • 75. Exercises •You can work in teams for this. •1a) Use the first 6000 bases of your genomic piece [ or find a bacterial genomic or mRNA sequence in Entrez with length between 2000:10000 ] •b) Use the ORF finder to find the gene(s). Compare the answer you get to the annotation you can infer from using blastn against genbank and to using blastx against a protein database. •Do the Entrez exercizes. ( separate word document)