SlideShare ist ein Scribd-Unternehmen logo
1 von 21
GenBank (Genetic Sequence Databank)
Definition: GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known
genetic sequences.
 It has a flat file structure that is an ASCII text file, readable & downloadable by both humans and
computers.
 It is maintained by the National Center for Biotechnology (NCBI).
 Entry data contains information on:
1.The sequence;
2.Accession numbers;
3.The scientific and gene names;
4.Taxonomy/phylogenetic classification of the source organism;
5.A feature that identifies coding regions;
6.References to published literature;
7.Transcription units &;
8.Mutation sites.
9. There are approximately 286,730,369,256 sequence records in the traditional GenBank divisions as of
2011.
GenBank flat file Format
1. The LOCUS field: It consists of five different subfields, namely:
 1a Locus Name (e.g. HSHFE) - It is a tag for grouping similar sequences.
 The first two or three letters usually designate the organism.
 In this case HS stands for Homo sapiens. The last several characters are associated with another
group designation, such as gene product. In this example, the last three digits represent the gene
symbol, HFE.
 1b Sequence Length (12146 bp) – It is the total number of nucleotide base pairs (or amino acid
residues) in the sequence record.
 1c Molecule Type (e.g. DNA)- Type of molecule that was sequenced. All sequence data in an entry
must be of the same type.
 1d GenBank Division (PRI) - GenBank has different divisions.
 In this example, PRI stands for primate sequences.
 Other divisions include ROD (rodent sequences), MAM (other mammal sequences), PLN (plant,
fungal, and algal sequences), & BCT (bacterial sequences).
 1e Modification Date (23-July-1999) - Date of most recent modification made to the record. The
date of first public release is not available in the sequence record. This information can be obtained
only by contacting NCBI at info@ncbi.nlm.nih.gov.
2. DEFINITION: – It is a brief description of the sequence.
 The description may include source organism name, gene or protein name, or designation as
untranscribed or untranslated sequences (e.g., a promoter region).
 For sequences containing a coding region (CDS), the definition field may also contain a
“completeness” qualifier such as "complete CDS" or "exon 1."
3. ACCESSION (Z92910): – It is a unique identifier assigned to a complete sequence record.
 This number never changes, even if the record is modified.
 An “accession number” is a combination of letters and numbers that are usually in the format of
one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g.,
AC123456).
4. VERSION (Z92910.1) – It is an identification number assigned to a single, specific sequence in
the database.
 This number is in the format “accession.version.”
 If any changes are made to the sequence data, the version part of the number will increase by one.
 E.g. U12345.1 becomes U12345.2.
 A version number of Z92910.1 for this HFE sequence indicates that the sequence data has not been
altered thus it is an original submission.
5. Gene Identifier (GI) (1890179) - Also a sequence identification number.
 Whenever a sequence is changed, the version number is increased and a new GI is assigned.
 If a nucleotide sequence record contains a protein translation of the sequence, the translation will
have its own GI number.
6. KEYWORDS (haemochromatosis; HFE gene) – A “keyword” can be “any word or phrase used
to describe the sequence”.
7. SOURCE (human) - Usually contains an abbreviated or common name of the source organism.
8. ORGANISM (Homo sapiens) - The scientific name (usually genus & species) & phylogenetic
lineage. Refer to the NCBI Taxonomy Homepage for more information about the classification
scheme used to construct taxonomic lineages.
9. REFERENCE – It is a citation of publications by sequence authors that supports information
presented in the sequence record.
 Several references may be included in one record.
 References are automatically sorted from the oldest to the newest.
 Cited publications are searchable by author, article or publication title, journal title, or MEDLINE
unique identifier (UID).
 The UID links the sequence record to the MEDLINE record.
 When the REFERENCE TITLE contains the words "Direct Submission“, contact information for
the submitter(s) is provided.
10. . The FEATURES Table:
11. BASE COUNT & ORIGIN:
BASECOUNT - Base Count gives the total number of adenine (A), cytosine (C), guanine (G), and thymine
(T) bases in the sequence.
12. ORIGIN - Origin contains the sequence data, which begins on the line immediately below the field
title.
//
EMBL
 The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/), maintained at the
European Bioinformatics Institute (EBI),
 It is used to incorporate and distributes nucleotide sequences from public sources.
 The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA).
 Data are exchanged between the collaborating databases on a daily basis.
 The web-based tool, Webin, is the preferred system for individual submission of nucleotide
sequences, including Third Party Annotation (TPA) and alignment data.
 Automatic submission procedures are used for submission of data from large-scale genome
sequencing
 The latest data collection can be accessed via FTP, email and WWW interfaces.
 The EBI's Sequence Retrieval System (SRS) integrates and links the main nucleotide and protein
databases as well as many other specialist molecular biology databases.
 For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are available that
allow external users to compare their own sequences against the data in the EMBL Nucleotide
Sequence Database, the complete genomic component subsection of the database, the WGS data
sets and other databases.
 All available resources can be accessed via the EBI home page at http://www.ebi.ac.uk.
EMBL format
ID LISOD standard; DNA; PRO; 756 BP.
XX
AC X64011; S78972;
XX
SV X64011.1
XX
DT 28-APR-1992 (Rel. 31, Created)
DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)
XX
DE L.ivanovii sod gene for superoxide dismutase
XX
KW sod gene; superoxide dismutase.
XX
OS Listeria ivanovii
OC Bacteria; Firmicutes; Bacillus/Clostridium group;
OC Bacillus/Staphylococcus group; Listeria.
XX
RN [1]
RX MEDLINE; 92140371.
RA Haas A., Goebel W.;
RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by
RT functional complementation in Escherichia coli and characterization of the
RT gene product.";
RL Mol. Gen. Genet. 231:313-322(1992).
XX
RN [2]
RP 1-756
RA Kreft J.;
RT ;
RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.
RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am
RL Hubland, 8700 Wuerzburg, FRG
XX
DR SWISS-PROT; P28763; SODM_LISIV.
XX
FH Key Location/Qualifiers
FH
FT source 1..756
FT /db_xref="taxon:1638"
FT /organism="Listeria ivanovii"
FT /strain="ATCC 19119"
FT RBS 95..100
FT /gene="sod"
FT terminator 723..746
FT /gene="sod"
FT CDS 109..717
FT /db_xref="SWISS-PROT:P28763"
FT /transl_table=11
FT /gene="sod"
FT /EC_number="1.15.1.1"
FT /product="superoxide dismutase"
FT /protein_id="CAA45406.1"
FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG
FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA
FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL
FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"
XX
SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other;
cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60
gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120
ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180
gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240
ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300
cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta 360
ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca 420
atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg 480
gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt 540
tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat 600
gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca 660
ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta 720
tcgaaaggct cacttaggtg ggtcttttta tttcta 756
//
ID - Identification.
AC - Accession number(s).
DT - Date.
DE - Description.
GN - Gene name(s).
OS - Organism species.
OG - Organelle.
OC - Organism classification.
RN - Reference number.
RP - Reference position.
RC - Reference comments.
RX - Reference cross-references.
RA - Reference authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references.
KW - Keywords.
FT - Feature table data.
SQ - Sequence header.
- (blanks) sequence data.
// - Termination line.
Some entries do not contain all of the line types, and some line types occur many times in a single entry.
Each entry must begin with an identification line (ID) and end with a terminator line (//). In addition the
following line types are always present in an entry: AC (once), DT (3 times), DE (1 or more), OS (1 or
more), OC (1 or more), RN (1 or more), RP (1 or more), RA (1 or more), RL (1 or more), SQ (once), and
at least one sequence data line. The other line types (GN, OG, RC, RM, CC, DR, KW and FT) are optional.
GenBank:
 Locus name helps in group entries with similar sequences. The first 3 characters denotes the organism, the
fourth and fifth characters gives other group designations, such as gene product and the last character is a
series of sequential integers.
 Sequence Length contains number of nucleotide base pairs (or amino acid residues) in the sequence
record.
 Molecule Type shows the type of sequenced molecule.
 Genbank Division shows the GenBank division to which a record belongs and is indicated by a three letter
abbreviation.
1. PRI - primate sequences
2. ROD - rodent sequences
3. MAM - other mammalian sequences
4. VRT - other vertebrate sequences
5. INV - invertebrate sequences
6. PLN - plant, fungal, and algal sequences
7. BCT - bacterial sequences
8. VRL - viral sequences
9. PHG - bacteriophage sequences
10. SYN - synthetic sequences
11. UNA - unannotated sequences
12. EST - EST sequences (expressed sequence tags)
13. PAT - patent sequences
14. STS - STS sequences (sequence tagged sites)
15. GSS - GSS sequences (genome survey sequences)
16. HTG - HTG sequences (high-throughput genomic seq)
17. HTC - unfinished high-throughput cDNA sequencing
18. ENV - environmental sampling sequences
 Modification Date shows the last date of modification.
 Definition is a brief description of sequence that includes information such as source organism, gene
name/protein name, or some description of the sequence's function.
 Accession number indicates the unique identifier for a sequence record.
 Records from the RefSeq
NT_123456 constructed genomic contigs
NM_123456 mRNAs
NP_123456 proteins
NC_123456 chromosomes
 Version shows a nucleotide sequence identification number that represents a single, specific sequence in
the GenBank database.
 GI "GenInfo Identifier" is a sequence identification number for the nucleotide sequence.
 Keywords describes word or phrase of the sequence.
 Source indicates free-format information including an abbreviated form of the organism name, sometimes
followed by a molecule type.
 Organism describes the formal scientific name for the source organism and its lineage.
 Reference includes publications by the authors of the sequence that discuss the data reported in the record.
 Authors contains List of authors in the order in which they appear in the cited article.
Entrez Search Field: Author [AUTH]
 Title represents the title of the published work or tentative title of an unpublished word.
Entrez Search Field: Text Word [WORD]
 Journal: MEDLINE abbreviation of the journal name.
Entrez Search Field: Journal Name [JOUR]
 Pubmed: PubMed Identifier (PMID)
 Features shows information about genes and gene products, as well as regions of biological significance
reported in the sequence.
 Source is a mandatory feature in each record that summarizes the length of the sequence, scientific name
of the source organism, and Taxon ID number. Can also include other information such as map location,
strain, clone, tissue type, etc., if provided by submitter.
 Taxon is a stable unique identification number for the taxon of the source organism.
 CDS (Coding sequence) represents region of nucleotides that corresponds with the sequence of amino
acids in a protein.
Protein sequence databases
Introduction:
The Protein database is a collection of sequences from several sources, including translations from
annotated coding regions in GenBank, RefSeqand TPA, as well as records from SwissProt, PIR, PRF,
and PDB. Protein sequences are the fundamental determinants of biological structure and function.
SWISS-PROT
– Manually curated
– high-quality annotations, less data
GenPept/TREMBL
– Translated coding sequences from GenBank/EMBL
– Few annotations, more up to date
PIR
– Phylogenetic-based annotations
All 3 now combining efforts to form UniProt (http://www.uniprot.org)
PDB (Protein Databank)
 Stores 3-dimensional atomic coordinates for biological molecules including protein and nucleic
acids
 Data obtained by X-ray crystallography, NMR, or computer modelling http://www.rcsb.org/pdb/
MMDB (Molecular Modelling database)
Over 28,000 3D macromolecular structures, including proteins and
polynucleotides(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure)
SCOP (Structural Classification of Proteins)
Classification of proteins according to structural and evolutionary relationships
SWISS-PROT
Introduction:
SWISS-PROT is an annotated protein sequence database, which was created at the Department of
Medical Biochemistry of the University of Geneva and has been a collaborative effort of the Department
and the European Molecular Biology Laboratory (EMBL), since 1987. SWISS-PROT is now an equal
partnership between the EMBL and the Swiss Institute of Bioinformatics (SIB). The EMBL activities are
carried out by its Hinxton Outstation, the European Bioinformatics Institute (EBI). The SWISS-PROT
protein sequence database consists of sequence entries. Sequence entries are composed of different line
types, each with their own format.
The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct
criteria:
(i) annotations
(ii) (ii) minimal redundancy and
(iii) (iii) integration with other databases.
Annotations
CORE DATA
• The sequence data
• The citation information (bibliographical references)
• The taxonomic data (description of the biological source of the protein)
Annotation- Additional Data
• Descriptions include:
• Function(s) of the protein
• Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI-
anchor
• Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers,
homeoboxes, and SH2 and SH3 domains
• Secondary structure, e.g. alpha helix, beta sheet
• Quaternary structure, i.g. homodimer, heterotrimer, etc.
• Similarities to other proteins
• Disease(s) associated with any number of deficiencies in the protein
• Sequence conflicts, variants, etc.
Minimal Redundancy
• Much of data comes from more than one literature report
• Data condensed and merged to appear more concise and coherent
• Conflicts in data are listed for each entry
Integration with other databases
• 50+ databases for cross-reference
• Nucleic acid sequences, protein tertiary structure, protein 3-D models, etc.
• Allows Swiss-PROT to play a major role as the focal point for biomolecular interconnectivity
Documentation
• All files documented and indexed
• Documentation kept up-to-date
Applications for the Knowledgebase
• Provides highly organized data and information on a wide variety of proteins
• Can be used as a starting point for protein research
• Allows searches to be conducted starting with various search strings
• Biochemical encyclopedia
SWISS-PROT Flat File format
ID - Identification.
AC - Accession number(s).
DT - Date.
DE - Description.
GN - Gene name(s).
OS - Organism species.
OG - Organelle.
OC - Organism classification.
RN - Reference number.
RP - Reference position.
RC - Reference comments.
RX - Reference cross-references.
RA - Reference authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references.
KW - Keywords.
FT - Feature table data.
SQ - Sequence header.
- (blanks) sequence data.
// - Termination line.
Data retrieval tools
Dedicated to access information for molecular biologists.
Most widely used are,
1. Entrez
2. DBGET
3. SRS
Each of these allows,
- Text based searching of a no. of linked DBs.(Data Bases)
- Sequence searching.
They differ in,
- The DBs they cover
- How the retrieved information is accessed and presented.
Entrez
- WWW-based data retrieval system.
- Developed by NCBI (National Centre for Biotechnology Information).
- Integrates information held in different DBs.
Data bases covered by Entrez are,
 Nucleic acid - GenBank, RefSeq, PDB.
 Protein seqs - SWISS-PROT, PIR.
 3D structures – MMDB
 Genomes – Many sources
 PopSet – From GenBank
 OMIM – OMIM
 Taxonomy – NCBI taxonomy database
 Books- Bookshelf
 ProbeSet – GEO (Gene Expression Omnibus)
 Literature - PubMed
SRS
SRS is a Sequence Retrieval System
- Data retrieval tool developed by EBI
- Integrates 80 molecular biology DBs
- An Open source software (Can be installed locally)
SRS has an associated scripting language called Icarus
Central resource for molecular biology data
- more than 250 databanks have been indexed. More than 35 SRS servers over the WWW(world wide)
Data analysis applications server
- 11 protein applications
- 6 nucleic acid applications
- Uniform query interface on the web
History of SRS
1990 - Main author Dr. Thure Etzold
– Development started in EMBL, Heidelberg
1997
– Moved to EBI in Cambridge. Development work was supported by various grants amongst
others from the EMBnet.
1998
– Etzold and his group join LionBiosciences
Information retrieval
– Easy way to retrieve information from sequence and sequence-related databases
– Possibility to search for multiple words/other criteria
Linkage between different databases
– E.g. Find all primary structures with known three-dimensional structure.
Different types of database in SRS
Sequence & structure
– DNA, protein, three-dimensional structures
Sequence-related
Gene-related
– Genome, mapping, mutations, transcription factors
– SNP
Bibliographic
– Medline, enzyme
User-defined
SRS main toolbar tabs:
Top Page: displays databases in different database groups
Query: displays either the standard or extended query form
Results or “the query manager”: maintains a history of all the results obtained during a session
Projects or “the project manager”: maintains a history of all queries and views used during a
session
Views: allows a user to define a user specific view for one or more databases
Databanks: contains a list and some facts about the databases available in the system
Search terms in SRS
SRS indexed fields can be searched using any of the following:
– Single word search
– Multiple word phrases
– Numbers and dates
– Regular expressions
– Wildcards
LocusLink
Introduction:
LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) is a National Center for Biotechnology
Information (NCBI) online resource. It is principally intended for use by graduate students and
professional researchers in the biomedical sciences. It is designed to bring together related information on
genetic loci and gene products from several sources. LocusLink provides a central point of access for basic
biomedical information and molecular data for genes, transcripts, and proteins from model organisms,
currently including human, rat, mouse, fruit fly, and zebrafish.
LocusLink relate to PubMed, RefSeq, and other NCBI databases
NCBI has a large and growing number of search tools for biologists to obtain information. A few
of these include:
PubMed: a searchable biomedical literature citation index. For a given genetic locus, LocusLink
leads directly to a short list of PubMed citations for that gene. (This list usually includes reports pertaining
to central genetic or molecular biological discoveries, and to reports on disease-causing alleles, for the gene
in question.)
RefSeq: Another new NCBI database, RefSeq (Reference Sequence) entries are intended to serve
as "authority files" for genetic sequence information. For a given genetic open reading frame, RefSeq
provides a curated file on the gene sequence and its transcriptional and translational processing (where
available). An professional review process helps to ensure the biological accuracy of these authority files.
RefSeq files are accessible directly from the LocusLink entry for the genetic locus in question.
OMIM (Online Mendelian Inheritance in Man): a database of human genes and genetic diseases,
including knowledge of their molecular and physiological roles and causes. The writeups for genetic loci
and their roles in physiology are often extensive and are frequently updated. OMIM files are accessible
directly from the LocusLink entry for the genetic locus in question.
GenBank, Protein Database, Homologene, UniGene, genetic variations database (single
nucleotide polymorphisms): links to gene-specific information from each of these databases are directly
available from the LocusLink entry for the genetic locus in question.
Steps involved in the usage of LocusLink
 Go to the LocusLink home page: http://www.ncbi.nlm.nih.gov/LocusLink.
 Although an alphabetical list of entries is available, LocusLink can be most easily searched using
the query box at the top of the page.
 Users can enter a wide variety of terms, for example: gene name or gene symbol (e.g., SDHA),
protein name (succinate dehydrogenase flavoprotein), protein symbol (SDH), EC (Enzyme
Commission) number (1.3.5.1), and disease states (Leigh syndrome).
 Type in your search query into the "Query:" box, then press "go".
 If multiple terms are entered (e.g., succinate dehydrogenase) the search engine automatically
searched for files containing both words (succinate and dehydrogenase) in the file. Searches can
also be constructed using the terms AND, OR (to find files containing both or either search terms),
and NOT (to find files containing the first but not the second term).
 On the results page, first note that the number of entries returned is given. If you get no results,
refer to the "help" section, linked in the left-hand bar on the page.
 "Description" is a brief explanation of the function of the locus.
 The "Position" column gives the chromosomal map location of the genetic locus. Clicking the blue
entry links to a visual chromosomal map with the gene marked on it.
 The rainbow-colored "Links" column gives links to several other NCBI databases:
“P” PubMed
“O” Online Mendelian Inheritance in Man (OMIM)
“R” RefSeq database
“G” GenBank database
“P” Protein database
“H” Homologene database
“U” Unigene database
“V” Variation data: single nucelotide polymorphism (SNP) database

Weitere ähnliche Inhalte

Was ist angesagt?

BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
nadeem akhter
 

Was ist angesagt? (20)

Cath
CathCath
Cath
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Gen bank databases
Gen bank databasesGen bank databases
Gen bank databases
 
Gen bank (genetic sequence databank)
Gen bank (genetic sequence databank)Gen bank (genetic sequence databank)
Gen bank (genetic sequence databank)
 
Ddbj
DdbjDdbj
Ddbj
 
Scop database
Scop databaseScop database
Scop database
 
EMBL
EMBLEMBL
EMBL
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
 
Protein database
Protein databaseProtein database
Protein database
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentation
 
Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahu
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
NCBI
NCBINCBI
NCBI
 
Est database
Est databaseEst database
Est database
 
Composite and Specialized databases
Composite and Specialized databasesComposite and Specialized databases
Composite and Specialized databases
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 

Ähnlich wie Gen bank

100505 koenig biological_databases
100505 koenig biological_databases100505 koenig biological_databases
100505 koenig biological_databases
Meetika Gupta
 
Communications
CommunicationsCommunications
Communications
somasushma
 

Ähnlich wie Gen bank (20)

Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Locus link
Locus linkLocus link
Locus link
 
2016 02 23_biological_databases_part1
2016 02 23_biological_databases_part12016 02 23_biological_databases_part1
2016 02 23_biological_databases_part1
 
NCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners SlidesNCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners Slides
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
 
100505 koenig biological_databases
100505 koenig biological_databases100505 koenig biological_databases
100505 koenig biological_databases
 
Proteome databases
Proteome databasesProteome databases
Proteome databases
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Bio onttalk 30minutes-june2003[1]
Bio onttalk 30minutes-june2003[1]Bio onttalk 30minutes-june2003[1]
Bio onttalk 30minutes-june2003[1]
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Introduction to Biological databases
Introduction to Biological databasesIntroduction to Biological databases
Introduction to Biological databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
database retrival.pdf
database retrival.pdfdatabase retrival.pdf
database retrival.pdf
 
Communications
CommunicationsCommunications
Communications
 

Mehr von Vidya Kalaivani Rajkumar

Mehr von Vidya Kalaivani Rajkumar (20)

Recombinant vaccines-Peptide Vaccines
Recombinant vaccines-Peptide Vaccines Recombinant vaccines-Peptide Vaccines
Recombinant vaccines-Peptide Vaccines
 
Transgenic plants- Abiotic stress tolerance
Transgenic plants- Abiotic stress toleranceTransgenic plants- Abiotic stress tolerance
Transgenic plants- Abiotic stress tolerance
 
Bioreactors in tissue engineering
Bioreactors in tissue engineeringBioreactors in tissue engineering
Bioreactors in tissue engineering
 
Tissue assembly in microgravity
Tissue assembly in microgravityTissue assembly in microgravity
Tissue assembly in microgravity
 
In vivo synthesis of tissues and organs
In vivo synthesis of tissues and organsIn vivo synthesis of tissues and organs
In vivo synthesis of tissues and organs
 
Bioartificial pancreas
Bioartificial pancreasBioartificial pancreas
Bioartificial pancreas
 
Biomaterials for tissue engineering
Biomaterials for tissue engineeringBiomaterials for tissue engineering
Biomaterials for tissue engineering
 
Haematopoietic system
Haematopoietic systemHaematopoietic system
Haematopoietic system
 
Fasta
FastaFasta
Fasta
 
Water vascular system of star fish
Water vascular system of star fishWater vascular system of star fish
Water vascular system of star fish
 
Cephalopodes are advance molluscs
Cephalopodes are advance molluscsCephalopodes are advance molluscs
Cephalopodes are advance molluscs
 
Beat air pollution
Beat air pollution Beat air pollution
Beat air pollution
 
Birth control methods
Birth control methodsBirth control methods
Birth control methods
 
Future of human evolution
Future of human evolutionFuture of human evolution
Future of human evolution
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Assignment on developmental zoology
Assignment on developmental zoologyAssignment on developmental zoology
Assignment on developmental zoology
 
Development of chick
Development of chickDevelopment of chick
Development of chick
 
Protein structure visualization tools-RASMOL
Protein structure visualization tools-RASMOLProtein structure visualization tools-RASMOL
Protein structure visualization tools-RASMOL
 
Swiss pdb viewer
Swiss pdb viewerSwiss pdb viewer
Swiss pdb viewer
 
Swiss PROT
Swiss PROT Swiss PROT
Swiss PROT
 

Kürzlich hochgeladen

Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 

Kürzlich hochgeladen (20)

FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 

Gen bank

  • 1. GenBank (Genetic Sequence Databank) Definition: GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences.  It has a flat file structure that is an ASCII text file, readable & downloadable by both humans and computers.  It is maintained by the National Center for Biotechnology (NCBI).  Entry data contains information on: 1.The sequence; 2.Accession numbers; 3.The scientific and gene names; 4.Taxonomy/phylogenetic classification of the source organism; 5.A feature that identifies coding regions; 6.References to published literature; 7.Transcription units &; 8.Mutation sites. 9. There are approximately 286,730,369,256 sequence records in the traditional GenBank divisions as of 2011.
  • 2. GenBank flat file Format 1. The LOCUS field: It consists of five different subfields, namely:  1a Locus Name (e.g. HSHFE) - It is a tag for grouping similar sequences.  The first two or three letters usually designate the organism.  In this case HS stands for Homo sapiens. The last several characters are associated with another group designation, such as gene product. In this example, the last three digits represent the gene symbol, HFE.  1b Sequence Length (12146 bp) – It is the total number of nucleotide base pairs (or amino acid residues) in the sequence record.  1c Molecule Type (e.g. DNA)- Type of molecule that was sequenced. All sequence data in an entry must be of the same type.  1d GenBank Division (PRI) - GenBank has different divisions.  In this example, PRI stands for primate sequences.
  • 3.  Other divisions include ROD (rodent sequences), MAM (other mammal sequences), PLN (plant, fungal, and algal sequences), & BCT (bacterial sequences).  1e Modification Date (23-July-1999) - Date of most recent modification made to the record. The date of first public release is not available in the sequence record. This information can be obtained only by contacting NCBI at info@ncbi.nlm.nih.gov. 2. DEFINITION: – It is a brief description of the sequence.  The description may include source organism name, gene or protein name, or designation as untranscribed or untranslated sequences (e.g., a promoter region).  For sequences containing a coding region (CDS), the definition field may also contain a “completeness” qualifier such as "complete CDS" or "exon 1." 3. ACCESSION (Z92910): – It is a unique identifier assigned to a complete sequence record.  This number never changes, even if the record is modified.  An “accession number” is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456). 4. VERSION (Z92910.1) – It is an identification number assigned to a single, specific sequence in the database.  This number is in the format “accession.version.”  If any changes are made to the sequence data, the version part of the number will increase by one.  E.g. U12345.1 becomes U12345.2.  A version number of Z92910.1 for this HFE sequence indicates that the sequence data has not been altered thus it is an original submission. 5. Gene Identifier (GI) (1890179) - Also a sequence identification number.  Whenever a sequence is changed, the version number is increased and a new GI is assigned.  If a nucleotide sequence record contains a protein translation of the sequence, the translation will have its own GI number. 6. KEYWORDS (haemochromatosis; HFE gene) – A “keyword” can be “any word or phrase used to describe the sequence”.
  • 4. 7. SOURCE (human) - Usually contains an abbreviated or common name of the source organism. 8. ORGANISM (Homo sapiens) - The scientific name (usually genus & species) & phylogenetic lineage. Refer to the NCBI Taxonomy Homepage for more information about the classification scheme used to construct taxonomic lineages. 9. REFERENCE – It is a citation of publications by sequence authors that supports information presented in the sequence record.  Several references may be included in one record.  References are automatically sorted from the oldest to the newest.  Cited publications are searchable by author, article or publication title, journal title, or MEDLINE unique identifier (UID).  The UID links the sequence record to the MEDLINE record.  When the REFERENCE TITLE contains the words "Direct Submission“, contact information for the submitter(s) is provided. 10. . The FEATURES Table:
  • 5. 11. BASE COUNT & ORIGIN: BASECOUNT - Base Count gives the total number of adenine (A), cytosine (C), guanine (G), and thymine (T) bases in the sequence. 12. ORIGIN - Origin contains the sequence data, which begins on the line immediately below the field title.
  • 6. // EMBL  The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/), maintained at the European Bioinformatics Institute (EBI),  It is used to incorporate and distributes nucleotide sequences from public sources.  The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA).  Data are exchanged between the collaborating databases on a daily basis.  The web-based tool, Webin, is the preferred system for individual submission of nucleotide sequences, including Third Party Annotation (TPA) and alignment data.  Automatic submission procedures are used for submission of data from large-scale genome sequencing  The latest data collection can be accessed via FTP, email and WWW interfaces.  The EBI's Sequence Retrieval System (SRS) integrates and links the main nucleotide and protein databases as well as many other specialist molecular biology databases.  For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are available that allow external users to compare their own sequences against the data in the EMBL Nucleotide Sequence Database, the complete genomic component subsection of the database, the WGS data sets and other databases.  All available resources can be accessed via the EBI home page at http://www.ebi.ac.uk.
  • 7. EMBL format ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX FH Key Location/Qualifiers FH FT source 1..756 FT /db_xref="taxon:1638" FT /organism="Listeria ivanovii" FT /strain="ATCC 19119" FT RBS 95..100 FT /gene="sod" FT terminator 723..746 FT /gene="sod" FT CDS 109..717 FT /db_xref="SWISS-PROT:P28763" FT /transl_table=11 FT /gene="sod" FT /EC_number="1.15.1.1" FT /product="superoxide dismutase" FT /protein_id="CAA45406.1"
  • 8. FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" XX SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta 360 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca 420 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg 480 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt 540 tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat 600 gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca 660 ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta 720 tcgaaaggct cacttaggtg ggtcttttta tttcta 756 // ID - Identification. AC - Accession number(s). DT - Date. DE - Description. GN - Gene name(s). OS - Organism species. OG - Organelle. OC - Organism classification. RN - Reference number. RP - Reference position. RC - Reference comments. RX - Reference cross-references. RA - Reference authors. RL - Reference location. CC - Comments or notes. DR - Database cross-references. KW - Keywords. FT - Feature table data. SQ - Sequence header. - (blanks) sequence data.
  • 9. // - Termination line. Some entries do not contain all of the line types, and some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). In addition the following line types are always present in an entry: AC (once), DT (3 times), DE (1 or more), OS (1 or more), OC (1 or more), RN (1 or more), RP (1 or more), RA (1 or more), RL (1 or more), SQ (once), and at least one sequence data line. The other line types (GN, OG, RC, RM, CC, DR, KW and FT) are optional. GenBank:  Locus name helps in group entries with similar sequences. The first 3 characters denotes the organism, the fourth and fifth characters gives other group designations, such as gene product and the last character is a series of sequential integers.  Sequence Length contains number of nucleotide base pairs (or amino acid residues) in the sequence record.  Molecule Type shows the type of sequenced molecule.  Genbank Division shows the GenBank division to which a record belongs and is indicated by a three letter abbreviation. 1. PRI - primate sequences 2. ROD - rodent sequences 3. MAM - other mammalian sequences 4. VRT - other vertebrate sequences 5. INV - invertebrate sequences 6. PLN - plant, fungal, and algal sequences 7. BCT - bacterial sequences 8. VRL - viral sequences 9. PHG - bacteriophage sequences 10. SYN - synthetic sequences 11. UNA - unannotated sequences 12. EST - EST sequences (expressed sequence tags) 13. PAT - patent sequences 14. STS - STS sequences (sequence tagged sites) 15. GSS - GSS sequences (genome survey sequences)
  • 10. 16. HTG - HTG sequences (high-throughput genomic seq) 17. HTC - unfinished high-throughput cDNA sequencing 18. ENV - environmental sampling sequences  Modification Date shows the last date of modification.  Definition is a brief description of sequence that includes information such as source organism, gene name/protein name, or some description of the sequence's function.  Accession number indicates the unique identifier for a sequence record.  Records from the RefSeq NT_123456 constructed genomic contigs NM_123456 mRNAs NP_123456 proteins NC_123456 chromosomes  Version shows a nucleotide sequence identification number that represents a single, specific sequence in the GenBank database.  GI "GenInfo Identifier" is a sequence identification number for the nucleotide sequence.  Keywords describes word or phrase of the sequence.  Source indicates free-format information including an abbreviated form of the organism name, sometimes followed by a molecule type.  Organism describes the formal scientific name for the source organism and its lineage.  Reference includes publications by the authors of the sequence that discuss the data reported in the record.  Authors contains List of authors in the order in which they appear in the cited article. Entrez Search Field: Author [AUTH]  Title represents the title of the published work or tentative title of an unpublished word. Entrez Search Field: Text Word [WORD]  Journal: MEDLINE abbreviation of the journal name. Entrez Search Field: Journal Name [JOUR]  Pubmed: PubMed Identifier (PMID)  Features shows information about genes and gene products, as well as regions of biological significance reported in the sequence.
  • 11.  Source is a mandatory feature in each record that summarizes the length of the sequence, scientific name of the source organism, and Taxon ID number. Can also include other information such as map location, strain, clone, tissue type, etc., if provided by submitter.  Taxon is a stable unique identification number for the taxon of the source organism.  CDS (Coding sequence) represents region of nucleotides that corresponds with the sequence of amino acids in a protein. Protein sequence databases Introduction: The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeqand TPA, as well as records from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function. SWISS-PROT – Manually curated – high-quality annotations, less data GenPept/TREMBL – Translated coding sequences from GenBank/EMBL – Few annotations, more up to date PIR – Phylogenetic-based annotations All 3 now combining efforts to form UniProt (http://www.uniprot.org) PDB (Protein Databank)  Stores 3-dimensional atomic coordinates for biological molecules including protein and nucleic acids  Data obtained by X-ray crystallography, NMR, or computer modelling http://www.rcsb.org/pdb/ MMDB (Molecular Modelling database) Over 28,000 3D macromolecular structures, including proteins and polynucleotides(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure) SCOP (Structural Classification of Proteins) Classification of proteins according to structural and evolutionary relationships SWISS-PROT Introduction: SWISS-PROT is an annotated protein sequence database, which was created at the Department of Medical Biochemistry of the University of Geneva and has been a collaborative effort of the Department
  • 12. and the European Molecular Biology Laboratory (EMBL), since 1987. SWISS-PROT is now an equal partnership between the EMBL and the Swiss Institute of Bioinformatics (SIB). The EMBL activities are carried out by its Hinxton Outstation, the European Bioinformatics Institute (EBI). The SWISS-PROT protein sequence database consists of sequence entries. Sequence entries are composed of different line types, each with their own format. The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct criteria: (i) annotations (ii) (ii) minimal redundancy and (iii) (iii) integration with other databases. Annotations CORE DATA • The sequence data • The citation information (bibliographical references) • The taxonomic data (description of the biological source of the protein) Annotation- Additional Data • Descriptions include: • Function(s) of the protein • Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI- anchor • Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers, homeoboxes, and SH2 and SH3 domains • Secondary structure, e.g. alpha helix, beta sheet • Quaternary structure, i.g. homodimer, heterotrimer, etc. • Similarities to other proteins • Disease(s) associated with any number of deficiencies in the protein • Sequence conflicts, variants, etc. Minimal Redundancy • Much of data comes from more than one literature report • Data condensed and merged to appear more concise and coherent • Conflicts in data are listed for each entry Integration with other databases • 50+ databases for cross-reference • Nucleic acid sequences, protein tertiary structure, protein 3-D models, etc.
  • 13. • Allows Swiss-PROT to play a major role as the focal point for biomolecular interconnectivity Documentation • All files documented and indexed • Documentation kept up-to-date Applications for the Knowledgebase • Provides highly organized data and information on a wide variety of proteins • Can be used as a starting point for protein research • Allows searches to be conducted starting with various search strings • Biochemical encyclopedia
  • 15. ID - Identification. AC - Accession number(s). DT - Date. DE - Description. GN - Gene name(s). OS - Organism species. OG - Organelle. OC - Organism classification. RN - Reference number. RP - Reference position. RC - Reference comments. RX - Reference cross-references. RA - Reference authors. RL - Reference location. CC - Comments or notes. DR - Database cross-references. KW - Keywords. FT - Feature table data. SQ - Sequence header. - (blanks) sequence data. // - Termination line. Data retrieval tools Dedicated to access information for molecular biologists. Most widely used are, 1. Entrez 2. DBGET 3. SRS Each of these allows, - Text based searching of a no. of linked DBs.(Data Bases) - Sequence searching.
  • 16. They differ in, - The DBs they cover - How the retrieved information is accessed and presented. Entrez - WWW-based data retrieval system. - Developed by NCBI (National Centre for Biotechnology Information). - Integrates information held in different DBs. Data bases covered by Entrez are,  Nucleic acid - GenBank, RefSeq, PDB.  Protein seqs - SWISS-PROT, PIR.  3D structures – MMDB  Genomes – Many sources  PopSet – From GenBank  OMIM – OMIM  Taxonomy – NCBI taxonomy database  Books- Bookshelf  ProbeSet – GEO (Gene Expression Omnibus)  Literature - PubMed
  • 17. SRS SRS is a Sequence Retrieval System
  • 18. - Data retrieval tool developed by EBI - Integrates 80 molecular biology DBs - An Open source software (Can be installed locally) SRS has an associated scripting language called Icarus Central resource for molecular biology data - more than 250 databanks have been indexed. More than 35 SRS servers over the WWW(world wide) Data analysis applications server - 11 protein applications - 6 nucleic acid applications - Uniform query interface on the web History of SRS 1990 - Main author Dr. Thure Etzold – Development started in EMBL, Heidelberg 1997 – Moved to EBI in Cambridge. Development work was supported by various grants amongst others from the EMBnet. 1998 – Etzold and his group join LionBiosciences Information retrieval – Easy way to retrieve information from sequence and sequence-related databases – Possibility to search for multiple words/other criteria Linkage between different databases – E.g. Find all primary structures with known three-dimensional structure. Different types of database in SRS Sequence & structure – DNA, protein, three-dimensional structures Sequence-related Gene-related – Genome, mapping, mutations, transcription factors – SNP Bibliographic – Medline, enzyme User-defined SRS main toolbar tabs:
  • 19. Top Page: displays databases in different database groups Query: displays either the standard or extended query form Results or “the query manager”: maintains a history of all the results obtained during a session Projects or “the project manager”: maintains a history of all queries and views used during a session Views: allows a user to define a user specific view for one or more databases Databanks: contains a list and some facts about the databases available in the system Search terms in SRS SRS indexed fields can be searched using any of the following: – Single word search – Multiple word phrases – Numbers and dates – Regular expressions – Wildcards
  • 20. LocusLink Introduction: LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) is a National Center for Biotechnology Information (NCBI) online resource. It is principally intended for use by graduate students and professional researchers in the biomedical sciences. It is designed to bring together related information on genetic loci and gene products from several sources. LocusLink provides a central point of access for basic biomedical information and molecular data for genes, transcripts, and proteins from model organisms, currently including human, rat, mouse, fruit fly, and zebrafish. LocusLink relate to PubMed, RefSeq, and other NCBI databases NCBI has a large and growing number of search tools for biologists to obtain information. A few of these include: PubMed: a searchable biomedical literature citation index. For a given genetic locus, LocusLink leads directly to a short list of PubMed citations for that gene. (This list usually includes reports pertaining to central genetic or molecular biological discoveries, and to reports on disease-causing alleles, for the gene in question.) RefSeq: Another new NCBI database, RefSeq (Reference Sequence) entries are intended to serve as "authority files" for genetic sequence information. For a given genetic open reading frame, RefSeq provides a curated file on the gene sequence and its transcriptional and translational processing (where available). An professional review process helps to ensure the biological accuracy of these authority files. RefSeq files are accessible directly from the LocusLink entry for the genetic locus in question. OMIM (Online Mendelian Inheritance in Man): a database of human genes and genetic diseases, including knowledge of their molecular and physiological roles and causes. The writeups for genetic loci and their roles in physiology are often extensive and are frequently updated. OMIM files are accessible directly from the LocusLink entry for the genetic locus in question. GenBank, Protein Database, Homologene, UniGene, genetic variations database (single nucleotide polymorphisms): links to gene-specific information from each of these databases are directly available from the LocusLink entry for the genetic locus in question.
  • 21. Steps involved in the usage of LocusLink  Go to the LocusLink home page: http://www.ncbi.nlm.nih.gov/LocusLink.  Although an alphabetical list of entries is available, LocusLink can be most easily searched using the query box at the top of the page.  Users can enter a wide variety of terms, for example: gene name or gene symbol (e.g., SDHA), protein name (succinate dehydrogenase flavoprotein), protein symbol (SDH), EC (Enzyme Commission) number (1.3.5.1), and disease states (Leigh syndrome).  Type in your search query into the "Query:" box, then press "go".  If multiple terms are entered (e.g., succinate dehydrogenase) the search engine automatically searched for files containing both words (succinate and dehydrogenase) in the file. Searches can also be constructed using the terms AND, OR (to find files containing both or either search terms), and NOT (to find files containing the first but not the second term).  On the results page, first note that the number of entries returned is given. If you get no results, refer to the "help" section, linked in the left-hand bar on the page.  "Description" is a brief explanation of the function of the locus.  The "Position" column gives the chromosomal map location of the genetic locus. Clicking the blue entry links to a visual chromosomal map with the gene marked on it.  The rainbow-colored "Links" column gives links to several other NCBI databases: “P” PubMed “O” Online Mendelian Inheritance in Man (OMIM) “R” RefSeq database “G” GenBank database “P” Protein database “H” Homologene database “U” Unigene database “V” Variation data: single nucelotide polymorphism (SNP) database