1. GenBank (Genetic Sequence Databank)
Definition: GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known
genetic sequences.
It has a flat file structure that is an ASCII text file, readable & downloadable by both humans and
computers.
It is maintained by the National Center for Biotechnology (NCBI).
Entry data contains information on:
1.The sequence;
2.Accession numbers;
3.The scientific and gene names;
4.Taxonomy/phylogenetic classification of the source organism;
5.A feature that identifies coding regions;
6.References to published literature;
7.Transcription units &;
8.Mutation sites.
9. There are approximately 286,730,369,256 sequence records in the traditional GenBank divisions as of
2011.
2. GenBank flat file Format
1. The LOCUS field: It consists of five different subfields, namely:
1a Locus Name (e.g. HSHFE) - It is a tag for grouping similar sequences.
The first two or three letters usually designate the organism.
In this case HS stands for Homo sapiens. The last several characters are associated with another
group designation, such as gene product. In this example, the last three digits represent the gene
symbol, HFE.
1b Sequence Length (12146 bp) – It is the total number of nucleotide base pairs (or amino acid
residues) in the sequence record.
1c Molecule Type (e.g. DNA)- Type of molecule that was sequenced. All sequence data in an entry
must be of the same type.
1d GenBank Division (PRI) - GenBank has different divisions.
In this example, PRI stands for primate sequences.
3. Other divisions include ROD (rodent sequences), MAM (other mammal sequences), PLN (plant,
fungal, and algal sequences), & BCT (bacterial sequences).
1e Modification Date (23-July-1999) - Date of most recent modification made to the record. The
date of first public release is not available in the sequence record. This information can be obtained
only by contacting NCBI at info@ncbi.nlm.nih.gov.
2. DEFINITION: – It is a brief description of the sequence.
The description may include source organism name, gene or protein name, or designation as
untranscribed or untranslated sequences (e.g., a promoter region).
For sequences containing a coding region (CDS), the definition field may also contain a
“completeness” qualifier such as "complete CDS" or "exon 1."
3. ACCESSION (Z92910): – It is a unique identifier assigned to a complete sequence record.
This number never changes, even if the record is modified.
An “accession number” is a combination of letters and numbers that are usually in the format of
one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g.,
AC123456).
4. VERSION (Z92910.1) – It is an identification number assigned to a single, specific sequence in
the database.
This number is in the format “accession.version.”
If any changes are made to the sequence data, the version part of the number will increase by one.
E.g. U12345.1 becomes U12345.2.
A version number of Z92910.1 for this HFE sequence indicates that the sequence data has not been
altered thus it is an original submission.
5. Gene Identifier (GI) (1890179) - Also a sequence identification number.
Whenever a sequence is changed, the version number is increased and a new GI is assigned.
If a nucleotide sequence record contains a protein translation of the sequence, the translation will
have its own GI number.
6. KEYWORDS (haemochromatosis; HFE gene) – A “keyword” can be “any word or phrase used
to describe the sequence”.
4. 7. SOURCE (human) - Usually contains an abbreviated or common name of the source organism.
8. ORGANISM (Homo sapiens) - The scientific name (usually genus & species) & phylogenetic
lineage. Refer to the NCBI Taxonomy Homepage for more information about the classification
scheme used to construct taxonomic lineages.
9. REFERENCE – It is a citation of publications by sequence authors that supports information
presented in the sequence record.
Several references may be included in one record.
References are automatically sorted from the oldest to the newest.
Cited publications are searchable by author, article or publication title, journal title, or MEDLINE
unique identifier (UID).
The UID links the sequence record to the MEDLINE record.
When the REFERENCE TITLE contains the words "Direct Submission“, contact information for
the submitter(s) is provided.
10. . The FEATURES Table:
5. 11. BASE COUNT & ORIGIN:
BASECOUNT - Base Count gives the total number of adenine (A), cytosine (C), guanine (G), and thymine
(T) bases in the sequence.
12. ORIGIN - Origin contains the sequence data, which begins on the line immediately below the field
title.
6. //
EMBL
The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/), maintained at the
European Bioinformatics Institute (EBI),
It is used to incorporate and distributes nucleotide sequences from public sources.
The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA).
Data are exchanged between the collaborating databases on a daily basis.
The web-based tool, Webin, is the preferred system for individual submission of nucleotide
sequences, including Third Party Annotation (TPA) and alignment data.
Automatic submission procedures are used for submission of data from large-scale genome
sequencing
The latest data collection can be accessed via FTP, email and WWW interfaces.
The EBI's Sequence Retrieval System (SRS) integrates and links the main nucleotide and protein
databases as well as many other specialist molecular biology databases.
For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are available that
allow external users to compare their own sequences against the data in the EMBL Nucleotide
Sequence Database, the complete genomic component subsection of the database, the WGS data
sets and other databases.
All available resources can be accessed via the EBI home page at http://www.ebi.ac.uk.
7. EMBL format
ID LISOD standard; DNA; PRO; 756 BP.
XX
AC X64011; S78972;
XX
SV X64011.1
XX
DT 28-APR-1992 (Rel. 31, Created)
DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)
XX
DE L.ivanovii sod gene for superoxide dismutase
XX
KW sod gene; superoxide dismutase.
XX
OS Listeria ivanovii
OC Bacteria; Firmicutes; Bacillus/Clostridium group;
OC Bacillus/Staphylococcus group; Listeria.
XX
RN [1]
RX MEDLINE; 92140371.
RA Haas A., Goebel W.;
RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by
RT functional complementation in Escherichia coli and characterization of the
RT gene product.";
RL Mol. Gen. Genet. 231:313-322(1992).
XX
RN [2]
RP 1-756
RA Kreft J.;
RT ;
RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.
RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am
RL Hubland, 8700 Wuerzburg, FRG
XX
DR SWISS-PROT; P28763; SODM_LISIV.
XX
FH Key Location/Qualifiers
FH
FT source 1..756
FT /db_xref="taxon:1638"
FT /organism="Listeria ivanovii"
FT /strain="ATCC 19119"
FT RBS 95..100
FT /gene="sod"
FT terminator 723..746
FT /gene="sod"
FT CDS 109..717
FT /db_xref="SWISS-PROT:P28763"
FT /transl_table=11
FT /gene="sod"
FT /EC_number="1.15.1.1"
FT /product="superoxide dismutase"
FT /protein_id="CAA45406.1"
9. // - Termination line.
Some entries do not contain all of the line types, and some line types occur many times in a single entry.
Each entry must begin with an identification line (ID) and end with a terminator line (//). In addition the
following line types are always present in an entry: AC (once), DT (3 times), DE (1 or more), OS (1 or
more), OC (1 or more), RN (1 or more), RP (1 or more), RA (1 or more), RL (1 or more), SQ (once), and
at least one sequence data line. The other line types (GN, OG, RC, RM, CC, DR, KW and FT) are optional.
GenBank:
Locus name helps in group entries with similar sequences. The first 3 characters denotes the organism, the
fourth and fifth characters gives other group designations, such as gene product and the last character is a
series of sequential integers.
Sequence Length contains number of nucleotide base pairs (or amino acid residues) in the sequence
record.
Molecule Type shows the type of sequenced molecule.
Genbank Division shows the GenBank division to which a record belongs and is indicated by a three letter
abbreviation.
1. PRI - primate sequences
2. ROD - rodent sequences
3. MAM - other mammalian sequences
4. VRT - other vertebrate sequences
5. INV - invertebrate sequences
6. PLN - plant, fungal, and algal sequences
7. BCT - bacterial sequences
8. VRL - viral sequences
9. PHG - bacteriophage sequences
10. SYN - synthetic sequences
11. UNA - unannotated sequences
12. EST - EST sequences (expressed sequence tags)
13. PAT - patent sequences
14. STS - STS sequences (sequence tagged sites)
15. GSS - GSS sequences (genome survey sequences)
10. 16. HTG - HTG sequences (high-throughput genomic seq)
17. HTC - unfinished high-throughput cDNA sequencing
18. ENV - environmental sampling sequences
Modification Date shows the last date of modification.
Definition is a brief description of sequence that includes information such as source organism, gene
name/protein name, or some description of the sequence's function.
Accession number indicates the unique identifier for a sequence record.
Records from the RefSeq
NT_123456 constructed genomic contigs
NM_123456 mRNAs
NP_123456 proteins
NC_123456 chromosomes
Version shows a nucleotide sequence identification number that represents a single, specific sequence in
the GenBank database.
GI "GenInfo Identifier" is a sequence identification number for the nucleotide sequence.
Keywords describes word or phrase of the sequence.
Source indicates free-format information including an abbreviated form of the organism name, sometimes
followed by a molecule type.
Organism describes the formal scientific name for the source organism and its lineage.
Reference includes publications by the authors of the sequence that discuss the data reported in the record.
Authors contains List of authors in the order in which they appear in the cited article.
Entrez Search Field: Author [AUTH]
Title represents the title of the published work or tentative title of an unpublished word.
Entrez Search Field: Text Word [WORD]
Journal: MEDLINE abbreviation of the journal name.
Entrez Search Field: Journal Name [JOUR]
Pubmed: PubMed Identifier (PMID)
Features shows information about genes and gene products, as well as regions of biological significance
reported in the sequence.
11. Source is a mandatory feature in each record that summarizes the length of the sequence, scientific name
of the source organism, and Taxon ID number. Can also include other information such as map location,
strain, clone, tissue type, etc., if provided by submitter.
Taxon is a stable unique identification number for the taxon of the source organism.
CDS (Coding sequence) represents region of nucleotides that corresponds with the sequence of amino
acids in a protein.
Protein sequence databases
Introduction:
The Protein database is a collection of sequences from several sources, including translations from
annotated coding regions in GenBank, RefSeqand TPA, as well as records from SwissProt, PIR, PRF,
and PDB. Protein sequences are the fundamental determinants of biological structure and function.
SWISS-PROT
– Manually curated
– high-quality annotations, less data
GenPept/TREMBL
– Translated coding sequences from GenBank/EMBL
– Few annotations, more up to date
PIR
– Phylogenetic-based annotations
All 3 now combining efforts to form UniProt (http://www.uniprot.org)
PDB (Protein Databank)
Stores 3-dimensional atomic coordinates for biological molecules including protein and nucleic
acids
Data obtained by X-ray crystallography, NMR, or computer modelling http://www.rcsb.org/pdb/
MMDB (Molecular Modelling database)
Over 28,000 3D macromolecular structures, including proteins and
polynucleotides(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure)
SCOP (Structural Classification of Proteins)
Classification of proteins according to structural and evolutionary relationships
SWISS-PROT
Introduction:
SWISS-PROT is an annotated protein sequence database, which was created at the Department of
Medical Biochemistry of the University of Geneva and has been a collaborative effort of the Department
12. and the European Molecular Biology Laboratory (EMBL), since 1987. SWISS-PROT is now an equal
partnership between the EMBL and the Swiss Institute of Bioinformatics (SIB). The EMBL activities are
carried out by its Hinxton Outstation, the European Bioinformatics Institute (EBI). The SWISS-PROT
protein sequence database consists of sequence entries. Sequence entries are composed of different line
types, each with their own format.
The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct
criteria:
(i) annotations
(ii) (ii) minimal redundancy and
(iii) (iii) integration with other databases.
Annotations
CORE DATA
• The sequence data
• The citation information (bibliographical references)
• The taxonomic data (description of the biological source of the protein)
Annotation- Additional Data
• Descriptions include:
• Function(s) of the protein
• Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI-
anchor
• Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers,
homeoboxes, and SH2 and SH3 domains
• Secondary structure, e.g. alpha helix, beta sheet
• Quaternary structure, i.g. homodimer, heterotrimer, etc.
• Similarities to other proteins
• Disease(s) associated with any number of deficiencies in the protein
• Sequence conflicts, variants, etc.
Minimal Redundancy
• Much of data comes from more than one literature report
• Data condensed and merged to appear more concise and coherent
• Conflicts in data are listed for each entry
Integration with other databases
• 50+ databases for cross-reference
• Nucleic acid sequences, protein tertiary structure, protein 3-D models, etc.
13. • Allows Swiss-PROT to play a major role as the focal point for biomolecular interconnectivity
Documentation
• All files documented and indexed
• Documentation kept up-to-date
Applications for the Knowledgebase
• Provides highly organized data and information on a wide variety of proteins
• Can be used as a starting point for protein research
• Allows searches to be conducted starting with various search strings
• Biochemical encyclopedia
15. ID - Identification.
AC - Accession number(s).
DT - Date.
DE - Description.
GN - Gene name(s).
OS - Organism species.
OG - Organelle.
OC - Organism classification.
RN - Reference number.
RP - Reference position.
RC - Reference comments.
RX - Reference cross-references.
RA - Reference authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references.
KW - Keywords.
FT - Feature table data.
SQ - Sequence header.
- (blanks) sequence data.
// - Termination line.
Data retrieval tools
Dedicated to access information for molecular biologists.
Most widely used are,
1. Entrez
2. DBGET
3. SRS
Each of these allows,
- Text based searching of a no. of linked DBs.(Data Bases)
- Sequence searching.
16. They differ in,
- The DBs they cover
- How the retrieved information is accessed and presented.
Entrez
- WWW-based data retrieval system.
- Developed by NCBI (National Centre for Biotechnology Information).
- Integrates information held in different DBs.
Data bases covered by Entrez are,
Nucleic acid - GenBank, RefSeq, PDB.
Protein seqs - SWISS-PROT, PIR.
3D structures – MMDB
Genomes – Many sources
PopSet – From GenBank
OMIM – OMIM
Taxonomy – NCBI taxonomy database
Books- Bookshelf
ProbeSet – GEO (Gene Expression Omnibus)
Literature - PubMed
18. - Data retrieval tool developed by EBI
- Integrates 80 molecular biology DBs
- An Open source software (Can be installed locally)
SRS has an associated scripting language called Icarus
Central resource for molecular biology data
- more than 250 databanks have been indexed. More than 35 SRS servers over the WWW(world wide)
Data analysis applications server
- 11 protein applications
- 6 nucleic acid applications
- Uniform query interface on the web
History of SRS
1990 - Main author Dr. Thure Etzold
– Development started in EMBL, Heidelberg
1997
– Moved to EBI in Cambridge. Development work was supported by various grants amongst
others from the EMBnet.
1998
– Etzold and his group join LionBiosciences
Information retrieval
– Easy way to retrieve information from sequence and sequence-related databases
– Possibility to search for multiple words/other criteria
Linkage between different databases
– E.g. Find all primary structures with known three-dimensional structure.
Different types of database in SRS
Sequence & structure
– DNA, protein, three-dimensional structures
Sequence-related
Gene-related
– Genome, mapping, mutations, transcription factors
– SNP
Bibliographic
– Medline, enzyme
User-defined
SRS main toolbar tabs:
19. Top Page: displays databases in different database groups
Query: displays either the standard or extended query form
Results or “the query manager”: maintains a history of all the results obtained during a session
Projects or “the project manager”: maintains a history of all queries and views used during a
session
Views: allows a user to define a user specific view for one or more databases
Databanks: contains a list and some facts about the databases available in the system
Search terms in SRS
SRS indexed fields can be searched using any of the following:
– Single word search
– Multiple word phrases
– Numbers and dates
– Regular expressions
– Wildcards
20. LocusLink
Introduction:
LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) is a National Center for Biotechnology
Information (NCBI) online resource. It is principally intended for use by graduate students and
professional researchers in the biomedical sciences. It is designed to bring together related information on
genetic loci and gene products from several sources. LocusLink provides a central point of access for basic
biomedical information and molecular data for genes, transcripts, and proteins from model organisms,
currently including human, rat, mouse, fruit fly, and zebrafish.
LocusLink relate to PubMed, RefSeq, and other NCBI databases
NCBI has a large and growing number of search tools for biologists to obtain information. A few
of these include:
PubMed: a searchable biomedical literature citation index. For a given genetic locus, LocusLink
leads directly to a short list of PubMed citations for that gene. (This list usually includes reports pertaining
to central genetic or molecular biological discoveries, and to reports on disease-causing alleles, for the gene
in question.)
RefSeq: Another new NCBI database, RefSeq (Reference Sequence) entries are intended to serve
as "authority files" for genetic sequence information. For a given genetic open reading frame, RefSeq
provides a curated file on the gene sequence and its transcriptional and translational processing (where
available). An professional review process helps to ensure the biological accuracy of these authority files.
RefSeq files are accessible directly from the LocusLink entry for the genetic locus in question.
OMIM (Online Mendelian Inheritance in Man): a database of human genes and genetic diseases,
including knowledge of their molecular and physiological roles and causes. The writeups for genetic loci
and their roles in physiology are often extensive and are frequently updated. OMIM files are accessible
directly from the LocusLink entry for the genetic locus in question.
GenBank, Protein Database, Homologene, UniGene, genetic variations database (single
nucleotide polymorphisms): links to gene-specific information from each of these databases are directly
available from the LocusLink entry for the genetic locus in question.
21. Steps involved in the usage of LocusLink
Go to the LocusLink home page: http://www.ncbi.nlm.nih.gov/LocusLink.
Although an alphabetical list of entries is available, LocusLink can be most easily searched using
the query box at the top of the page.
Users can enter a wide variety of terms, for example: gene name or gene symbol (e.g., SDHA),
protein name (succinate dehydrogenase flavoprotein), protein symbol (SDH), EC (Enzyme
Commission) number (1.3.5.1), and disease states (Leigh syndrome).
Type in your search query into the "Query:" box, then press "go".
If multiple terms are entered (e.g., succinate dehydrogenase) the search engine automatically
searched for files containing both words (succinate and dehydrogenase) in the file. Searches can
also be constructed using the terms AND, OR (to find files containing both or either search terms),
and NOT (to find files containing the first but not the second term).
On the results page, first note that the number of entries returned is given. If you get no results,
refer to the "help" section, linked in the left-hand bar on the page.
"Description" is a brief explanation of the function of the locus.
The "Position" column gives the chromosomal map location of the genetic locus. Clicking the blue
entry links to a visual chromosomal map with the gene marked on it.
The rainbow-colored "Links" column gives links to several other NCBI databases:
“P” PubMed
“O” Online Mendelian Inheritance in Man (OMIM)
“R” RefSeq database
“G” GenBank database
“P” Protein database
“H” Homologene database
“U” Unigene database
“V” Variation data: single nucelotide polymorphism (SNP) database