Major biological nucleotide databases

Department of Zoology, GACW (2018-2019) Page 1
Major Biological databases
Introduction:
Database is convenient system to properly store, search and retrieve any type of data.
Its help to easily handle and share large amount of data. Biological databases are libraries of life sciences
information, collected from scientific experiments, published literature, high –throughput experiment
technology and computational analysis. They contain information from genomics, proteomics,
microarray gene expression etc.
Variants of Biological Database
1. Primary Database
2. Secondary database
3. Composite Database
Primary databases:
 Contains original data from the researchers.
 Public or open access mostly.
Biological Databases
Transcriptome
databases
Structure database Genome
databases
Sequence databases
 Nucleotide
 Protein
Model Organism
databases
 PlasmoDB,
 TAIR etc

Eg: NCBI GenBank, EmBL, DDBJ
Secondary databases:
A Secondary database contains additional information derived from the analysis of data
available in primary databases. Manually created or automatically generated data are available.
Eg TrEMBL, Pfam, Profiles, Scop, CATH
GenBank (Genetic Sequence Databank)
Introduction:
 GenBank® is the genetic sequence database at the National Center for Biotechnology Information
(NCBI).
 It wasestablished in the year 1982 and now maintained by the National Center for Biotechnology
(NCBI).
 DNA sequences can be submitted to GenBank using several different methods.
 It contains publicly available nucleotide sequences for more than 240 000 named organisms,
obtained primarily through submissions from individual laboratories and batch submissions from
large-scale sequencing projects.
 It has a flat file structure that is an ASCII text file, readable & downloadable by both humans and
computers.
Sequence Submission:
 GenBank is built by direct submissions from individual laboratories, as well as from bulk
submissions from large-scale sequencing centers.
 Only original sequences can be submitted to GenBank.
 Direct submissions are made to GenBank using BankIt, which is a Web-based form, or the stand-
alone submission program, Sequin.
 Upon receipt of a sequence submission, the GenBank staff examines the originality of the data
and assigns an accession number to the sequence and performs quality assurance checks.
 The submissions are then released to the public database, where the entries are retrievable
by Entrez or downloadable by FTP.
 Bulk submissions of Expressed Sequence Tag (EST), Sequence-tagged site (STS), Genome
Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are most often
submitted by large-scale sequencing centers.
 The GenBank direct submissions group also processes complete microbial genome sequences.

GenBank flat file Format
GenBank format (GenBank Flat File Format) consists of an annotation section and a sequence section.
The start of the annotation section is marked by a line beginning with the word "LOCUS". The start of
sequence section is marked by a line beginning with the word "ORIGIN" and the end of the section is
marked by a line with only "//".

1. The LOCUS field:
It consists of five different subfields, namely:
 1a Locus Name (e.g. HSHFE) - It is a tag for grouping similar sequences.
 The first two or three letters usually designate the organism.
 In this case HS stands for Homo sapiens. The last several characters are associated with another
group designation, such as gene product. In this example, the last three digits represent the gene
symbol, HFE.
 1b Sequence Length (12146 bp) – It is the total number of nucleotide base pairs (or amino acid
residues) in the sequence record.
 1c Molecule Type (e.g. DNA) - Type of molecule that was sequenced.
 1d GenBank Division (PRI) - GenBank has different divisions.
 In this example, PRI stands for primate sequences.
 Other divisions include ROD (rodent sequences), MAM (other mammal sequences), PLN (plant,
fungal, and algal sequences), &BCT (bacterial sequences).
2. 1e Modification Date (23-July-1999) - Date of most recent modification made to the record.
DEFINITION: – It is a brief description of the sequence.
 The description may include source organism name, gene or protein name, or designation as
untranscribed or untranslated sequences (e.g., a promoter region).
 For sequences containing a coding region (CDS), the definition field may also contain a
“completeness” qualifier such as "complete CDS" or "exon 1."
3. ACCESSION (Z92910): – It is a unique identifier assigned to a complete sequence record.
 This number never changes, even if the record is modified.
4. VERSION (Z92910.1) – It is an identification number assigned to a single, specific sequence in
the database.

 This number is in the format “accession.version.”
 If any changes are made to the sequence data, the version part of the number will increase by one.
 E.g. U12345.1 becomes U12345.2.
5. Gene Identifier (GI) (1890179) - Also a sequence identification number.
 Whenever a sequence is changed, the version number is increased and a new GI is assigned.
6. KEYWORDS (haemochromatosis; HFE gene) – A “keyword” can be “any word or phrase used
to describe the sequence”.
7. SOURCE (human) -Usually contains an abbreviated or common name of the source organism.
8. ORGANISM (Homo sapiens)- The scientific name (usually genus & species)
9. REFERENCE –It is a citation of publications by sequence authors that supports information
presented in the sequence record.
 Several references may be included in one record.
 References are automatically sorted from the oldest to the newest.
 Cited publications are searchable by author, article or publication title, journal title, or MEDLINE
unique identifier (UID).

10. . The FEATURES Table:
11. BASE COUNT & ORIGIN:
BASE COUNT - Base Count gives the total number of adenine (A), cytosine (C), guanine (G), and
thymine (T) bases in the sequence.
12. ORIGIN - Origin contains the sequence data, which begins on the line immediately below the field
title.

//
 Genbank Division shows the GenBank division to which a record belongs and is indicated by a three
letter abbreviation.
1. PRI - primate sequences
2. ROD - rodent sequences
3. MAM - other mammalian sequences
4. VRT - other vertebrate sequences
5. INV - invertebrate sequences
6. PLN - plant, fungal, and algal sequences
7. BCT - bacterial sequences
8. VRL - viral sequences
9. PHG - bacteriophage sequences
10. SYN - synthetic sequences
11. UNA - unannotated sequences
12. EST - EST sequences (expressed sequence tags)
13. PAT - patent sequences
14. STS - STS sequences (sequence tagged sites)
15. GSS - GSS sequences (genome survey sequences)
16. HTG - HTG sequences (high-throughput genomic seq)
17. HTC - unfinished high-throughput cDNA sequencing
18. ENV - environmental sampling sequences
European Molecular Biology Laboratory (EMBL)
 The European Molecular Biology Laboratory (EMBL) is a molecular biology research institution
supported by 22 member states, four prospect and two associate member states.
 EMBL was created in 1974 and is an inter-governmental organization funded by public research
money from its member states.
 The Laboratory operates from five sites: the main laboratory in Heidelberg, and outstations in
Hinxton (the European Bioinformatics Institute (EBI), in England), Grenoble (France), Hamburg
(Germany), and Monterotondo (near Rome).

 EMBL groups and laboratories perform basic research in molecular biology and molecular
medicine as well as training for scientists, students and visitors.
 Israel is the only Asian state that has full membership.
 The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/), maintained at the
European Bioinformatics Institute (EBI),
 It is used to incorporate and distributes nucleotide sequences from public sources.
 The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA).
 Data are exchanged between the collaborating databases on a daily basis.
 The web-based tool, Webin, is the preferred system for individual submission of nucleotide
sequences, including Third Party Annotation (TPA) and alignment data.
 Automatic submission procedures are used for submission of data from large-scale genome
sequencing
 The latest data collection can be accessed via FTP, email and WWW interfaces.
 The EBI's Sequence Retrieval System (SRS) integrates and links the main nucleotide and
protein databases as well as many other specialist molecular biology databases.
 For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are available that
allow external users to compare their own sequences against the data in the EMBL Nucleotide
Sequence Database and other databases.
 All available resources can be accessed via the EBI home page at http://www.ebi.ac.uk.

The EMBL Nucleotide Sequence database
 The main activity of the group is the development, maintenance and distribution of a
comprehensive database of nucleotide sequences.
 The EMBL nucleotide sequence database, produced in collaboration with GenBank and the DNA
database of Japan, is Europe’s primary nucleotide sequence data resource.
 Each of these three groups collects a portion of the total sequence data reported worldwide. All
new and updated database entries are exchanged between the groups on a daily basis.
 Important sources of data have been secured through collaborations with genomic sequencing
projects and other groups, such as phylogenetic research groups, who produce large quantities of
new nucleotide sequence data.
 A typical entry (Flat File) contains a sequence, a brief description for cataloging purposes, the
taxonomic description of the source organism, bibliographic information, and the feature table,
containing locations of coding regions and other biologically significant sites.
EMBL flat file format
ID LISOD standard; DNA; PRO; 756 BP.
XX
AC X64011; S78972;
XX
SV X64011.1
XX
DT 28-APR-1992 (Rel. 31, Created)
DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)
XX
DE L.ivanovii sod gene for superoxide dismutase
XX
KW sod gene; superoxide dismutase.
XX
OS Listeria ivanovii
OC Bacteria; Firmicutes; Bacillus/Clostridium group;
OC Bacillus/Staphylococcus group; Listeria.
XX
RN [1]
RX MEDLINE; 92140371.
RA Haas A., Goebel W.;
RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by
RT functional complementation in Escherichia coli and characterization of the
RT gene product.";
RL Mol. Gen. Genet. 231:313-322(1992).
XX
RN [2]
RP 1-756
RA Kreft J.;
RT ;
RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.
RL J. Kreft, Institut f. Mikrobiologie, UniversitaetWuerzburg, Biozentrum Am
RL Hubland, 8700 Wuerzburg, FRG

XX
DR SWISS-PROT; P28763; SODM_LISIV.
XX
FH Key Location/Qualifiers
FH
FT source 1..756
FT /db_xref="taxon:1638"
FT /organism="Listeria ivanovii"
FT /strain="ATCC 19119"
FT RBS 95..100
FT /gene="sod"
FT terminator 723..746
FT /gene="sod"
FT CDS 109..717
FT /db_xref="SWISS-PROT:P28763"
FT /transl_table=11
FT /gene="sod"
FT /EC_number="1.15.1.1"
FT /product="superoxide dismutase"
FT /protein_id="CAA45406.1"
FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG
FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA
FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL
FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"
XX
SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other;
cgttatttaaggtgttacatagttctatggaaatagggtctatacctttcgccttacaat 60
gtaatttcttttcacataaataataaacaatccgaggaggaatttttaatgacttacgaa 120
ttaccaaaattaccttatacttatgatgctttggagccgaattttgataaagaaacaatg 180
gaaattcactatacaaagcaccacaatatttatgtaacaaaactaaatgaagcagtctca 240
ggacacgcagaacttgcaagtaaacctggggaagaattagttgctaatctagatagcgtt 300
cctgaagaaattcgtggcgcagtacgtaaccacggtggtggacatgctaaccatacttta 360
ttctggtctagtcttagcccaaatggtggtggtgctccaactggtaacttaaaagcagca 420
atcgaaagcgaattcggcacatttgatgaattcaaagaaaaattcaatgcggcagctgcg 480
gctcgttttggttcaggatgggcatggctagtagtgaacaatggtaaactagaaattgtt 540
tccactgctaaccaagattctccacttagcgaaggtaaaactccagttcttggcttagat 600
gtttgggaacatgcttattatcttaaattccaaaaccgtcgtcctgaatacattgacaca 660
ttttggaatgtaattaactgggatgaacgaaataaacgctttgacgcagcaaaataatta 720
tcgaaaggctcacttaggtgggtctttttatttcta 756
//
Description of flat file information:
ID - Identification.
AC - Accession number(s).
DT - Date.
DE - Description.
GN - Gene name(s).
OS - Organism species.
OG - Organelle.
OC - Organism classification.
RN - Reference number.

RP - Reference position.
RC - Reference comments.
RX - Reference cross-references.
RA - Reference authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references.
KW - Keywords.
FT - Feature table data.
SQ - Sequence header.
- (blanks) sequence data.
// - Termination line.
Some entries do not contain all of the line types, and some line types occur many times in a single entry.
Each entry must begin with an identification line (ID) and end with a terminator line (//).
References:
 https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
 https://www.ncbi.nlm.nih.gov/genbank/
 https://www.embl.org/index.php
 https://www.slideshare.net/HafizMuhammadRaza/european-molecular-biology-
laboratory-embl-129985837?qid=38c50267-4b68-4a95-b353-
323d8826456f&v=&b=&from_search=1

Major biological nucleotide databases

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Major biological nucleotide databases

Ähnlich wie Major biological nucleotide databases (20)

Mehr von Vidya Kalaivani Rajkumar

Mehr von Vidya Kalaivani Rajkumar (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Major biological nucleotide databases