2. Contents
• Introduction
• Classification of databases
• Primary databases
• Nucleic acid databases
Gen Bank
EMBL
DDBJ
• Protein sequence databases
SWISS-PROT
UNIPROT
PIR
• Protein structure database
PDB
• Conclusion
• References
3. Introduction
• Bioinformatics databases or biological databases are storehouses of
biological information .
• They can be defined as libraries containing data collected from
scientific experiments, published literature and computational
analysis.
• It provides users an interface to facilitate easy and efficient
recording, storing, analyzing and retrieval of biological data through
application of computer software.
• Biological data comes in several different formats like text, sequence
data, structure, links, etc. and these needs to be taken into account
while creating the databases
4.
5. CLASSIFICATION OF DATABASES
The databases can be classified into 3 categories on the basis of the
information stored.
• Primary Database
• Secondary Database
• Composite Database
6. Primary Database
• Primary databases (also known as data repositories) are highly
organised, user-friendly gateways to the huge amount of biological data
produced by researchers around the world.
• The primary databases were first developed for the storage of
experimentally determined DNA and protein sequences in the 1980s and
90s.
• Nowadays, sequence submissions are made by individual laboratories,
as well as “in bulk” by sequencing centres around the world.
• Most protein sequences found in databases are the product of
conceptual translation of the genes and genomes determined using DNA
sequencing.
7. Primary databases
• Primary databases are also called as archieval database.
• They are populated with experimentally derived data such as
nucleotide sequence, protein sequence or macromolecular structure.
• Experimental results are submitted directly into the database by
researchers, and the data are essentially archival in nature.
• Once given a database accession number, the data in primary
databases are never changed: they form part of the scientific record.
8. • Once data are deposited in primary databases, they can be accessed
freely by anyone around the world.
• For example, researchers are working on a Staphylococcus aureus strain
that was isolated from a patient.
• After some investigations, the researchers suspect that this strain might
be genetically different from previously identified strains.
• They decide to sequence it and, after comparing the DNA sequences
already placed in the public repository (“known” strains), they conclude
that indeed their strain is different.
• The research community will benefit from having this new sequence in
the public repository so that the next time a researcher finds the same
strain, he/she will be able to recognise if their isolate is a novel one, or if
it is somehow related to strains previously sequenced.
9.
10.
11. • There are three nucleotide repositories or primary databases for the
submission of nucleotide and genome sequences:
• GenBank hosted by the National Center for Biotechnology Information
(or NCBI).
• The European Nucleotide archive or ENA hosted by the European
Molecular Biology Laboratories (EMBL).
• The DNA Data Bank of Japan or DDBJ hosted by the National Centre for
Genetics.
12. GenBank
• The GenBank sequence database is an open access, annotated
collection of all publicly available nucleotide sequences and their
protein translations.
• It is produced and maintained by the National Center for
Biotechnology Information as part of the International Nucleotide
Sequence Database Collaboration.
• Data format: XML; ASN.1; Genbank format
• Data types captured: Nucleotide sequence; Protein sequence
• A GenBank release occurs every two months and is available from
the ftp site.
13.
14. Access to GenBank
• There are several ways to search and retrieve data from GenBank.
• Search GenBank for sequence identifiers and annotations with Entrez
Nucleotide.
• Search and align GenBank sequences to a query sequence
using BLAST (Basic Local Alignment Search Tool). See BLAST info for
more information about the numerous BLAST databases.
• Search, link, and download sequences programatically using NCBI e-
utilities.
GenBank Data Usage
• NCBI places no restrictions on the use or distribution of the GenBank
data.
• However, some submitters may claim patent, copyright, or other
intellectual property rights in all or a portion of the data they have
submitted.
15. EMBL
• The European Molecular Biology Laboratory (EMBL) Nucleotide
Sequence Database is maintained at the European Bioinformatics
Institute (EBI) in an international collaboration with the DNA Data Bank
of Japan (DDBJ) and GenBank (USA).
• It was first established in 1974.
• Data is exchanged amongst the collaborative databases on a daily basis.
• The major contributors to the EMBL database are individual authors and
genome project groups.
• WEBIN is the preferred web-based submission system for individual
submitters, while automatic procedures allow incorporation of sequence
data from large-scale genome sequencing centres and from the
European Patent Office (EPO).
16.
17. • Database releases are produced quarterly. Network services allow free
access to the most up-to-date data collection via Internet and WWW
interfaces.
• EBI’s Sequence Retrieval System (SRS) is a network browser for
databanks in molecular biology, integrating and linking the main
nucleotide and protein databases plus many specialised databases.
• For sequence similarity searching a variety of tools (e.g., BLITZ, FASTA,
BLAST) are available which allow external users to compare their own
sequences against the most currently available data in the EMBL
Nucleotide Sequence Database and SWISS-PROT.
• Accesed through the URL, http://www.ebi.ac.uk/embl
18.
19.
20.
21.
22.
23.
24.
25.
26.
27. PIR database
• Protein Information Resource database
• Established in 1984, by National Biomedical Research Foundation (NBRF)
• It is an integrated public bioinformatics resource that support genomic
and proteomic research and scietific studies.
• It assists researchers in the identification and interpretation of protein
sequence information.
• PIR can be searched for entries or sequence similarity searches.
• It can be downloaded at http://www.pir.georgetown.edu/.
• PIR offers a variety of resources maily oriented to assist the propagation
and standardization of protein annotation.
28.
29.
30.
31.
32. Conclusion
• Bioinformatics databases are storehouses of biological information .
• They are populated with experimentally derived data such as
nucleotide sequence, protein sequence .
• Experimental results are submitted directly into the database by
researchers, and the data are essentially archival in nature.
• Once given a database accession number, the data in primary
databases are never changed: they form part of the scientific record.
• Examples include Gen bank, EMBL, DDBJ, PIR, SWISS-PROT, UNIPROT,
PDB etc.