SlideShare ist ein Scribd-Unternehmen logo
1 von 17
GenBank Databases
Hafiz.M.Zeeshan.Raza
Research Associate_HEC_NRPU
hafizraza26@gmail.com
COMSATS UNIVERSITY SAHIWAL
Overview
• Introduction
• Sections of Database
• Importance of GenBank
Historical background
• The first major bioinformatics project was undertaken by Margaret Dayhoff in 1965,
who developed a first protein sequence database called Atlas of Protein Sequence
and Structure.
• Subsequently, in the early 1970s, the Brookhaven National Laboratory established
the Protein Data Bank for archiving three-dimensional protein structures.
• The first sequence alignment algorithm was developed by Needleman and Wunsch
in 1970. This was a fundamental step in the development of the field of
bioinformatics, which paved the way for the routine sequence comparisons and
database searching practiced by modern biologists.
• The 1980s saw the establishment of GenBank and the development of fast database
searching algorithms such as FASTA by William Pearson and BLAST by Stephen
Altschul and coworkers.
Introduction
• GenBank is the most complete collection of annotated nucleic acid
sequence data for almost every organism.
• The content includes genomic DNA, mRNA, cDNA, ESTs, high throughput
raw sequence data, and sequence polymorphisms.
• There is also a GenPept database for protein sequences, the majority of
which are conceptual translations from DNA sequences, although a small
number of the amino acid sequences are derived using peptide
sequencing techniques.
How to search GenBank
• There are two ways to search for
sequences in GenBank.
• One is using text-based keywords
similar to a PubMed search.
• The other is using molecular sequences
to search by sequence similarity using
BLAST.
GenBank Sequence Format
• To search GenBank effectively using the text-based method requires an
understanding of the GenBank sequence format.
• GenBank is a relational database. However, the search output for
sequence files is produced as flat files for easy reading.
• The resulting flat files contain three sections; Header, Features, and
Sequence entry.
• There are many fields in the Header and Features sections. Each field has
an unique identifier for easy indexing by computer software.
• Understanding the structure of the GenBank files helps in designing
effective search strategies.
1st section…Header Part
• The line, “DEFINITION,” provides the summary information for the
sequence record including the name of the sequence, the name and
taxonomy of the source organism if known, and whether the sequence is
complete or partial.
• This is followed by an accession number for the sequence, which is a
unique number assigned to a piece of DNA when it was first submitted to
GenBank and is permanently associated with that sequence.
• This is the number that should be cited in publications. It has two different
formats: two letters with five digits or one letter with six digits.
Continue…
• For a nucleotide sequence that has been translated into a protein sequence, a new
“accession number” is given in the form of a string of alphanumeric characters.
• In addition to the accession number, there is also a version number and a gene
index (gi) number. The purpose of these numbers is to identify the current version
of the sequence.
• If the sequence annotation is revised at a later date, the accession number
remains the same, but the version number is incremented as is the gi number.
• A translated protein sequence also has a different gi number from the DNA
sequence it is derived from.
Continue…
• The next line in the Header section is the “ORGANISM” field,
which includes the source of the organism with the scientific
name of the species and sometimes the tissue type.
• Along with the scientific name is the information of taxonomic
classification of the organism.
• Different levels of the classification are hyperlinked to the
NCBI taxonomy database with more detailed descriptions.
Continue…
• This is followed by the “REFERENCE” field, which provides the publication citation
related to the sequence entry.
• The REFERENCE part includes author and title information of the published work
(or tentative title for unpublished work).
• The “JOURNAL” field includes the citation information as well as the date of
sequence submission.
• The citation is often hyperlinked to the PubMed record for access to the original
literature information.
• The last part of the Header is the contact information of the sequence submitter.
2nd section…Features
• The “Features” section includes annotation information about the gene and gene product, as
well as regions of biological significance reported in the sequence, with identifiers and
qualifiers.
• The “Source” field provides the length of the sequence, the scientific name of the organism,
and the taxonomy identification number. Some optional information includes the clone
source, the tissue type and the cell line.
• The “gene” field is the information about the nucleotide coding sequence and its name. For
DNA entries, there is a “CDS” field, which is information about the boundaries of the
sequence that can be translated into amino acids.
• For eukaryotic DNA, this field also contains information of the locations of exons and
translated protein sequences is entered.
3rd section…Sequence
• The third section of the flat file is the sequence itself starting with the
label “ORIGIN”.
• The format of the sequence display can be changed by choosing options at
a Display pull-down menu at the upper left corner.
• For DNA entries, there is a BASE COUNT report that includes the numbers
of A, G, C, and T in the sequence.
• This section, for both DNA or protein sequences, ends with two forward
slashes (the “//” symbol).
Importance
• In retrieving DNA or protein sequences from GenBank, the search can be limited to
different fields of annotation such as “organism,” “accession number,” “authors,”
and “publication date.”
• One can use a combination of the “Limits” and “Preview/Index” options as
described. Alternatively, a number of search qualifiers can be used, each defining
one of the fields in a GenBank file.
• The qualifiers are similar to but not the same as the field tags in PubMed. For
example, in GenBank, [GENE] represents field for gene name, [AUTH] for author
name, and [ORGN] for organism name.
• Frequently used GenBank qualifiers, which have to be in uppercase and in brackets
Alternative sequence Formats
• In bioinformatics, FASTA format is a text-based format for representing
either nucleotide sequences or peptide sequences, in which nucleotides or amino
acids are represented using single-letter codes.
• FASTA is one of the simplest and the most popular sequence formats because it
contains plain sequence information that is readable by many bioinformatics
analysis programs.
• It has a single definition line that begins with a right angle bracket (>) followed by a
sequence name.
• Sometimes, extra information such as gi number or comments can be given, which
are separated from the sequence name by a “|” symbol.
FASTA Format Sequence
>E01306.1 DNA encoding human insulin-like growth factor I(IGFI)
GAATTCTAACGGTCCCGAAACTCTGTGCGGTG TGAATGGTTGACGCTCTGCAG
TTGTTTGCGGTGACCGTGGTTTTTATTTTAACAAACCCACTGGTTATGGTTCTT
TTCTCGTCGTGCTCCCCAGACTGGTATTGTTGA GAATGCTGCTTTCGTTCTTG
GACCTGCGTCGTCTGGAAATGTATTGCGCTCCCCTGAAACCCGC
• The extra information is considered optional and is ignored by sequence analysis
programs.
• The plain sequence in standard one-letter symbols starts in the second line.
• Each line of sequence data is limited to sixty to eighty characters in width.
• The drawback of this format is that much annotation information is lost.
Gen bank databases

Weitere ähnliche Inhalte

Was ist angesagt? (20)

UniProt
UniProtUniProt
UniProt
 
NCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology InformationNCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology Information
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Introduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjIntroduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbj
 
Ddbj
DdbjDdbj
Ddbj
 
Fasta
FastaFasta
Fasta
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 
Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahu
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology Laboratory
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
EMBL
EMBLEMBL
EMBL
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
Cath
CathCath
Cath
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 

Ähnlich wie Gen bank databases

BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234alizain9604
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
Sequencedatabases
SequencedatabasesSequencedatabases
SequencedatabasesAbhik Seal
 
Data retreival system
Data retreival systemData retreival system
Data retreival systemShikha Thakur
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsAyeshaYousaf20
 
biological databases.pptx
biological databases.pptxbiological databases.pptx
biological databases.pptxscience lover
 
Rap db(rice annotation project data base)
Rap db(rice annotation project data base)Rap db(rice annotation project data base)
Rap db(rice annotation project data base)PrajaktaKale17
 
Apollo annotation guidelines for i5k projects Diaphorina citri
Apollo annotation guidelines for i5k projects Diaphorina citriApollo annotation guidelines for i5k projects Diaphorina citri
Apollo annotation guidelines for i5k projects Diaphorina citriMonica Munoz-Torres
 
database retrival.pdf
database retrival.pdfdatabase retrival.pdf
database retrival.pdfSrimathideviJ
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptxSilpa87
 
Submitting DNA sequences to the databases, SEQUIN.pptx
Submitting DNA sequences to the databases, SEQUIN.pptxSubmitting DNA sequences to the databases, SEQUIN.pptx
Submitting DNA sequences to the databases, SEQUIN.pptxVed Gharat
 

Ähnlich wie Gen bank databases (20)

Databases_L2.pptx
Databases_L2.pptxDatabases_L2.pptx
Databases_L2.pptx
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Genomic Databases-.pptx
Genomic Databases-.pptxGenomic Databases-.pptx
Genomic Databases-.pptx
 
Biological databases
Biological databasesBiological databases
Biological databases
 
BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
biological databases.pptx
biological databases.pptxbiological databases.pptx
biological databases.pptx
 
Rap db(rice annotation project data base)
Rap db(rice annotation project data base)Rap db(rice annotation project data base)
Rap db(rice annotation project data base)
 
Apollo annotation guidelines for i5k projects Diaphorina citri
Apollo annotation guidelines for i5k projects Diaphorina citriApollo annotation guidelines for i5k projects Diaphorina citri
Apollo annotation guidelines for i5k projects Diaphorina citri
 
database retrival.pdf
database retrival.pdfdatabase retrival.pdf
database retrival.pdf
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
 
Protein database
Protein  databaseProtein  database
Protein database
 
Important protein databases and proteomics softwares
Important protein databases and proteomics softwaresImportant protein databases and proteomics softwares
Important protein databases and proteomics softwares
 
Submitting DNA sequences to the databases, SEQUIN.pptx
Submitting DNA sequences to the databases, SEQUIN.pptxSubmitting DNA sequences to the databases, SEQUIN.pptx
Submitting DNA sequences to the databases, SEQUIN.pptx
 

Mehr von Hafiz Muhammad Zeeshan Raza

Car manufacturing is a complex and fascinating industry that plays a signific...
Car manufacturing is a complex and fascinating industry that plays a signific...Car manufacturing is a complex and fascinating industry that plays a signific...
Car manufacturing is a complex and fascinating industry that plays a signific...Hafiz Muhammad Zeeshan Raza
 
Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...
Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...
Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...Hafiz Muhammad Zeeshan Raza
 
TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...
TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...
TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...Hafiz Muhammad Zeeshan Raza
 
Quality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withQuality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withHafiz Muhammad Zeeshan Raza
 
DNA transcription & Post Transcriptional Modification
DNA transcription & Post Transcriptional ModificationDNA transcription & Post Transcriptional Modification
DNA transcription & Post Transcriptional ModificationHafiz Muhammad Zeeshan Raza
 

Mehr von Hafiz Muhammad Zeeshan Raza (13)

Car manufacturing is a complex and fascinating industry that plays a signific...
Car manufacturing is a complex and fascinating industry that plays a signific...Car manufacturing is a complex and fascinating industry that plays a signific...
Car manufacturing is a complex and fascinating industry that plays a signific...
 
Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...
Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...
Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...
 
TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...
TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...
TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...
 
OMANTEL
OMANTELOMANTEL
OMANTEL
 
Quality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withQuality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained with
 
Cell organelles
Cell organellesCell organelles
Cell organelles
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Translation & Post Translational Modifications
Translation & Post Translational ModificationsTranslation & Post Translational Modifications
Translation & Post Translational Modifications
 
DNA transcription & Post Transcriptional Modification
DNA transcription & Post Transcriptional ModificationDNA transcription & Post Transcriptional Modification
DNA transcription & Post Transcriptional Modification
 
Recombinant DNA technology
Recombinant DNA technologyRecombinant DNA technology
Recombinant DNA technology
 
Restriction Fragment Length Polymorphism (RFLP)
Restriction Fragment Length Polymorphism (RFLP)Restriction Fragment Length Polymorphism (RFLP)
Restriction Fragment Length Polymorphism (RFLP)
 
Mendeley software beginers
Mendeley software beginersMendeley software beginers
Mendeley software beginers
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 

Kürzlich hochgeladen

Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 

Kürzlich hochgeladen (20)

Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 

Gen bank databases

  • 2. Overview • Introduction • Sections of Database • Importance of GenBank
  • 3. Historical background • The first major bioinformatics project was undertaken by Margaret Dayhoff in 1965, who developed a first protein sequence database called Atlas of Protein Sequence and Structure. • Subsequently, in the early 1970s, the Brookhaven National Laboratory established the Protein Data Bank for archiving three-dimensional protein structures. • The first sequence alignment algorithm was developed by Needleman and Wunsch in 1970. This was a fundamental step in the development of the field of bioinformatics, which paved the way for the routine sequence comparisons and database searching practiced by modern biologists. • The 1980s saw the establishment of GenBank and the development of fast database searching algorithms such as FASTA by William Pearson and BLAST by Stephen Altschul and coworkers.
  • 4. Introduction • GenBank is the most complete collection of annotated nucleic acid sequence data for almost every organism. • The content includes genomic DNA, mRNA, cDNA, ESTs, high throughput raw sequence data, and sequence polymorphisms. • There is also a GenPept database for protein sequences, the majority of which are conceptual translations from DNA sequences, although a small number of the amino acid sequences are derived using peptide sequencing techniques.
  • 5. How to search GenBank • There are two ways to search for sequences in GenBank. • One is using text-based keywords similar to a PubMed search. • The other is using molecular sequences to search by sequence similarity using BLAST.
  • 6. GenBank Sequence Format • To search GenBank effectively using the text-based method requires an understanding of the GenBank sequence format. • GenBank is a relational database. However, the search output for sequence files is produced as flat files for easy reading. • The resulting flat files contain three sections; Header, Features, and Sequence entry. • There are many fields in the Header and Features sections. Each field has an unique identifier for easy indexing by computer software. • Understanding the structure of the GenBank files helps in designing effective search strategies.
  • 7.
  • 8. 1st section…Header Part • The line, “DEFINITION,” provides the summary information for the sequence record including the name of the sequence, the name and taxonomy of the source organism if known, and whether the sequence is complete or partial. • This is followed by an accession number for the sequence, which is a unique number assigned to a piece of DNA when it was first submitted to GenBank and is permanently associated with that sequence. • This is the number that should be cited in publications. It has two different formats: two letters with five digits or one letter with six digits.
  • 9. Continue… • For a nucleotide sequence that has been translated into a protein sequence, a new “accession number” is given in the form of a string of alphanumeric characters. • In addition to the accession number, there is also a version number and a gene index (gi) number. The purpose of these numbers is to identify the current version of the sequence. • If the sequence annotation is revised at a later date, the accession number remains the same, but the version number is incremented as is the gi number. • A translated protein sequence also has a different gi number from the DNA sequence it is derived from.
  • 10. Continue… • The next line in the Header section is the “ORGANISM” field, which includes the source of the organism with the scientific name of the species and sometimes the tissue type. • Along with the scientific name is the information of taxonomic classification of the organism. • Different levels of the classification are hyperlinked to the NCBI taxonomy database with more detailed descriptions.
  • 11. Continue… • This is followed by the “REFERENCE” field, which provides the publication citation related to the sequence entry. • The REFERENCE part includes author and title information of the published work (or tentative title for unpublished work). • The “JOURNAL” field includes the citation information as well as the date of sequence submission. • The citation is often hyperlinked to the PubMed record for access to the original literature information. • The last part of the Header is the contact information of the sequence submitter.
  • 12. 2nd section…Features • The “Features” section includes annotation information about the gene and gene product, as well as regions of biological significance reported in the sequence, with identifiers and qualifiers. • The “Source” field provides the length of the sequence, the scientific name of the organism, and the taxonomy identification number. Some optional information includes the clone source, the tissue type and the cell line. • The “gene” field is the information about the nucleotide coding sequence and its name. For DNA entries, there is a “CDS” field, which is information about the boundaries of the sequence that can be translated into amino acids. • For eukaryotic DNA, this field also contains information of the locations of exons and translated protein sequences is entered.
  • 13. 3rd section…Sequence • The third section of the flat file is the sequence itself starting with the label “ORIGIN”. • The format of the sequence display can be changed by choosing options at a Display pull-down menu at the upper left corner. • For DNA entries, there is a BASE COUNT report that includes the numbers of A, G, C, and T in the sequence. • This section, for both DNA or protein sequences, ends with two forward slashes (the “//” symbol).
  • 14. Importance • In retrieving DNA or protein sequences from GenBank, the search can be limited to different fields of annotation such as “organism,” “accession number,” “authors,” and “publication date.” • One can use a combination of the “Limits” and “Preview/Index” options as described. Alternatively, a number of search qualifiers can be used, each defining one of the fields in a GenBank file. • The qualifiers are similar to but not the same as the field tags in PubMed. For example, in GenBank, [GENE] represents field for gene name, [AUTH] for author name, and [ORGN] for organism name. • Frequently used GenBank qualifiers, which have to be in uppercase and in brackets
  • 15. Alternative sequence Formats • In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. • FASTA is one of the simplest and the most popular sequence formats because it contains plain sequence information that is readable by many bioinformatics analysis programs. • It has a single definition line that begins with a right angle bracket (>) followed by a sequence name. • Sometimes, extra information such as gi number or comments can be given, which are separated from the sequence name by a “|” symbol.
  • 16. FASTA Format Sequence >E01306.1 DNA encoding human insulin-like growth factor I(IGFI) GAATTCTAACGGTCCCGAAACTCTGTGCGGTG TGAATGGTTGACGCTCTGCAG TTGTTTGCGGTGACCGTGGTTTTTATTTTAACAAACCCACTGGTTATGGTTCTT TTCTCGTCGTGCTCCCCAGACTGGTATTGTTGA GAATGCTGCTTTCGTTCTTG GACCTGCGTCGTCTGGAAATGTATTGCGCTCCCCTGAAACCCGC • The extra information is considered optional and is ignored by sequence analysis programs. • The plain sequence in standard one-letter symbols starts in the second line. • Each line of sequence data is limited to sixty to eighty characters in width. • The drawback of this format is that much annotation information is lost.