SlideShare ist ein Scribd-Unternehmen logo
1 von 27
BIOINFORMARICS
SEQUENCE FILE
FORMATS
Presented By: Alphy Joseph
Date: 03 March 2016
Important file formats
•Genbank
•FASTA
•PIR
•ALN/ClustalW2
•GCG/MSF
Early Data Formats
•These early databases stored sequence
data in a file. The file held the sequence
in ASCII (plain)text and had a
descriptive filename.
• This method became limiting when
researchers wanted to include
annotations and information about the
source of the sequence.
• Difficulty in searching for sequences
was also an issue.
Flat File Storage Data
Formats
•When GenBank, EMBL and DDBJ
formed a collaboration (1986),
sequence databases had moved to a
defined flat file format with a shared
feature table format and annotation
standards.
•The PIR also adopted a similar format
for protein sequences
•The flat file formats from the
sequence databases are still used to
access and display sequence and
annotation. They are also convenient
for storage of local copies.
FASTA Format
• Bioinformaticists have developed a
standard format for nucleotide and
protein sequences that allows them to
be read by a wide range of programs.
This format is called FASTA format.
•FASTA format each nucleotide or
amino acid is represented using a
single letter.
•The first line of a FASTA is the
comment line, identified with either the
greater than symbol ‘>’. This line
identifies the sequence and includes the
accession number from NCBI,
Genbank or another repository.
•The remaining lines contain the
sequence,in lines of 80 or 120
characters per line.
PIR FORMAT
•A sequence in PIR format consists of:
–One line starting with
•a ">" (greater-than) sign, followed
by
•a two-letter code describing the
sequence type (P1, F1, DL, DC, RL,
RC, or XX), followed by
•a semicolon, followed by
•the sequence identification
–One line containing a textual
description of the sequence.
–One or more lines containing the
sequence itself. The end of the
sequence is marked by a "*"
(asterisk) character.
–Optionally, this can be followed by
one or more lines describing the
sequence. Software that is
supposed to read only the sequence
should ignore these.
•A file in PIR format may comprise
more than one sequence.
•The PIR format is also often referred
to as the NBRF format.
ALN/ClustalW
• The first line in the file must start with
the words "CLUSTALW". Other
information in the first line is ignored.
• One or more empty lines.
• One or more blocks of sequence data. Each
block consists of:
– One line for each sequence in the alignment.
Each line consists of:
•the sequence name
•white space
•up to 60 sequence symbols.
•optional - white space followed by a cumulative
count of residues for the sequences
– A line showing the degree of
conservation for the columns of the
alignment in this block.
– One or more empty lines
•Some rules about representing
sequences:
•Case doesn't matter.
•Sequence symbols should be from a
valid alphabet.
•Gaps are represented using hyphens
("-").
•The characters used to represent the
degree of conservation are
* -all residues or nucleotides in that
column are identical
: - conserved substitutions have been
observed
. -semi-conserved substitutions have
been observed
- no match.
GCG/MSF
•msf formatted multiple sequence files
are most often created when using
programs of the GCG suite.
• msf files include the sequence name
and the sequence itself, which is
usually aligned with other sequences
in the file.
• You can specify a single sequence or
many sequences within an msf file.
•Some of the hallmarks of a msf
formatted sequence are the same as a
single sequence gcg format file:
•Begins with the line (all uppercase) !!
NA_MULTIPLE_ALIGNMENT 1.0
for nucleic acid sequences or !!
AA_MULTIPLE_ALIGNMENT 1.0
for amino acid sequences.
• Do not edit or delete the file type if
its present.
•A description line which contains
informative text describing what is in
the file. You can add this information
to the top of the MSF file using a text
editor.
•A dividing line which contains the
number of bases or residues in the
sequence, when the file was created,
and importantly, two dots (..) which
act as a divider between the
descriptive information and the
•msf files contain some other
information as well:
•Name/Weight: The name of each
sequence included in the alignment, as
well as its length and checksum (both
non-editable) and weight (editable).
•Separating Line. Must include two
slashes (//) to divide the name/weight
information from the sequence
alignment.
•Multiple Sequence Alignment. Each
sequence named in the above
Name/Weight lines is included. The
alignment allows you to view the
relationship among sequences
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Protein databases
Protein databasesProtein databases
Protein databases
sarumalay
 

Was ist angesagt? (20)

Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
Fasta
FastaFasta
Fasta
 
UniProt
UniProtUniProt
UniProt
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
Cath
CathCath
Cath
 
Phylogenetic analysis
Phylogenetic analysisPhylogenetic analysis
Phylogenetic analysis
 
History and scope in bioinformatics
History and scope in bioinformaticsHistory and scope in bioinformatics
History and scope in bioinformatics
 
BLAST
BLASTBLAST
BLAST
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
EMBL
EMBLEMBL
EMBL
 
Scop database
Scop databaseScop database
Scop database
 
NCBI
NCBINCBI
NCBI
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted Mutation
 
Protein structure visualization tools-RASMOL
Protein structure visualization tools-RASMOLProtein structure visualization tools-RASMOL
Protein structure visualization tools-RASMOL
 

Andere mochten auch (12)

Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babel
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303
 
molecular file formats in bioinformatics
molecular file formats in bioinformaticsmolecular file formats in bioinformatics
molecular file formats in bioinformatics
 
Design your own test automation tool
Design your own test automation toolDesign your own test automation tool
Design your own test automation tool
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Biological databases
Biological databasesBiological databases
Biological databases
 

Ähnlich wie Sequence file formats

picard_poster_12_16_15
picard_poster_12_16_15picard_poster_12_16_15
picard_poster_12_16_15
David E. Kling
 

Ähnlich wie Sequence file formats (20)

Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Avro intro
Avro introAvro intro
Avro intro
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquet
 
16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
 
ELF(executable and linkable format)
ELF(executable and linkable format)ELF(executable and linkable format)
ELF(executable and linkable format)
 
(Very u seful) different file format
(Very u seful) different file format(Very u seful) different file format
(Very u seful) different file format
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
SRA-System (7).ppsx
SRA-System (7).ppsxSRA-System (7).ppsx
SRA-System (7).ppsx
 
FS Mod2@AzDOCUMENTS.in.pdf
FS Mod2@AzDOCUMENTS.in.pdfFS Mod2@AzDOCUMENTS.in.pdf
FS Mod2@AzDOCUMENTS.in.pdf
 
picard_poster_12_16_15
picard_poster_12_16_15picard_poster_12_16_15
picard_poster_12_16_15
 
Ch6
Ch6Ch6
Ch6
 
Bibliographic format ISO 2709
Bibliographic format ISO 2709 Bibliographic format ISO 2709
Bibliographic format ISO 2709
 
Data.ppt
Data.pptData.ppt
Data.ppt
 
Oracle
OracleOracle
Oracle
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 

Kürzlich hochgeladen

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Kürzlich hochgeladen (20)

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 

Sequence file formats

  • 1. BIOINFORMARICS SEQUENCE FILE FORMATS Presented By: Alphy Joseph Date: 03 March 2016
  • 3. Early Data Formats •These early databases stored sequence data in a file. The file held the sequence in ASCII (plain)text and had a descriptive filename. • This method became limiting when researchers wanted to include annotations and information about the source of the sequence. • Difficulty in searching for sequences was also an issue.
  • 4. Flat File Storage Data Formats •When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence databases had moved to a defined flat file format with a shared feature table format and annotation standards. •The PIR also adopted a similar format for protein sequences
  • 5. •The flat file formats from the sequence databases are still used to access and display sequence and annotation. They are also convenient for storage of local copies.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. FASTA Format • Bioinformaticists have developed a standard format for nucleotide and protein sequences that allows them to be read by a wide range of programs. This format is called FASTA format. •FASTA format each nucleotide or amino acid is represented using a single letter.
  • 11. •The first line of a FASTA is the comment line, identified with either the greater than symbol ‘>’. This line identifies the sequence and includes the accession number from NCBI, Genbank or another repository. •The remaining lines contain the sequence,in lines of 80 or 120 characters per line.
  • 12.
  • 13. PIR FORMAT •A sequence in PIR format consists of: –One line starting with •a ">" (greater-than) sign, followed by •a two-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by •a semicolon, followed by •the sequence identification
  • 14. –One line containing a textual description of the sequence. –One or more lines containing the sequence itself. The end of the sequence is marked by a "*" (asterisk) character. –Optionally, this can be followed by one or more lines describing the sequence. Software that is supposed to read only the sequence should ignore these.
  • 15. •A file in PIR format may comprise more than one sequence. •The PIR format is also often referred to as the NBRF format.
  • 16.
  • 17. ALN/ClustalW • The first line in the file must start with the words "CLUSTALW". Other information in the first line is ignored. • One or more empty lines. • One or more blocks of sequence data. Each block consists of: – One line for each sequence in the alignment. Each line consists of: •the sequence name •white space •up to 60 sequence symbols. •optional - white space followed by a cumulative count of residues for the sequences
  • 18. – A line showing the degree of conservation for the columns of the alignment in this block. – One or more empty lines •Some rules about representing sequences: •Case doesn't matter. •Sequence symbols should be from a valid alphabet. •Gaps are represented using hyphens ("-").
  • 19. •The characters used to represent the degree of conservation are * -all residues or nucleotides in that column are identical : - conserved substitutions have been observed . -semi-conserved substitutions have been observed - no match.
  • 20.
  • 21. GCG/MSF •msf formatted multiple sequence files are most often created when using programs of the GCG suite. • msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. • You can specify a single sequence or many sequences within an msf file.
  • 22. •Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file: •Begins with the line (all uppercase) !! NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !! AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. • Do not edit or delete the file type if its present.
  • 23. •A description line which contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor. •A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the
  • 24. •msf files contain some other information as well: •Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable). •Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.
  • 25. •Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences
  • 26.