BITS: Basics of sequence databases

Basic bioinformatics concepts,
databases and tools

Introduction to the training
and Sequence databases

Joachim Jacob
http://www.bits.vib.be

Updated 22 February 2012
http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod1-intro_H1_2012_SeqDBs.pdf

Scope
Introductory training to Bioinformatics

Exploring and understanding
databases and software
for everyday bioinformatics use

If there is any term which is unclear,
please stop me and ask me!

Bioinformatics ...

Bio
all data is derived from living samples

Informatics
that data is stored and analyzed in and with computers to obtain
understanding

Extremely broad description, for which however we
will extract common principles during the course

Bioinformatics is present into every aspect
of life sciences research

Bioinformatics is present into every aspect
of life sciences research

, sequences

Bioinformatics ...

Bio
- different types of living samples
Informatics
- storing and categorizing the information
and making it easily accessible
- interpreting that information reliably

Bioinformatics … and his companion

Bio
- different types of living samples
Informatics
- storing and categorizing the information
and making it easily accessible
- interpreting that information reliably
Statistics
- large numbers, observational data

The siblings of Bioinformatics
Based on the biological component extracted from life, the
measured properties and the ultimate goal of the
analysis, different sub-disciplines of bioinformatics exist.

DNA RNA proteins metabolites
Genomics
Transcriptomics
Proteomics
Metabolomics

Epigenomics Structural bioinformatics
Systems biology Microbiomics Interactomics
Metagenomics Functional genomics Comparative gx

Mere data is worth nothing

CGCTACGCATATCGCT Data = symbols

- Dasypus novemcinctus Information = data that are processed to be useful;
- found in my garden provides answers to "who", "what", "where", and
- Part of genome
- sequenced on June 2010 "when" questions. Also called metadata.

This species seems to be Knowledge: application of data and information;
related to my neighbor's pet,
because it has also this answers "how" questions
sequence

Has the same mother Understanding: appreciation of "why"

Wisdom

http://www.systems-thinking.org/dikw/dikw.htm

? ! Life sciences
research as major
'end user' for the
data knowledge bioinformatics tools
and conclusions
'tool user'
Tools and approaches

Bioinformatics
research, as a
specific branch on
Biology Computer Statistics the boundary of life
science,
mathematics and
computer science
'tool manufacturer'

This course is organised in several modules

Module 1: Sequence databases: what, where, how
Module 2: Sequence comparisons: searching, aligning
Module 3: Sequence analysis – domains in protein sequences and
predicting functionality, standardisation and useful links
Module 4: Beyond sequences - additional important data sources
Module 5: Genome Browsers - integrating biological data and performing
reproducible bioinformatics research in the Galaxy

One tip for the future

Be prepared for change...
Information is fluid
So are bioinfo tools

Learn how to accommodate for change
Major resources are more stable
Important concepts do not change often

Module 1

Sequence databases

Module 1: Sequence databases

Sequence databases store DNA and RNA sequences. In
Bioinformatics, they are by far (still) the largest
collections of biological data, and used by many
subdisciplines of bioinformatics.

http://www.ebi.ac.uk/embl/Services/DBStats/

... and growing

http://www.ebi.ac.uk/embl/Services/DBStats/

Three major nucleotide databanks host primary
sequence data
European Nucleotide Archive (ENA) at EBI - http://www.ebi.ac.uk/
Division EMBL-bank (European Molecular Biology Laboratory) (single)
Trace Archive
SRA Archive

GenBank at NCBI - http://www.ncbi.nlm.nih.gov/
maintained at NCBI (National Center for Biotechnology Information,
(USA)

DDBJ (DNA Data Bank of Japan) - http://www.ddbj.nig.ac.jp/
maintained at NIG/CIB (National Institute of Genetics, Center for
Information Biology, Mishima, Japan)

These databases are filled with NA sequence
information by scientists and consortia
Large-scale Individual Patent
sequencing scientists Offices ACTGCTGCTA
GCTAGCTGAT
projects CTATGCTAGC
TGTAGCTGAG

Primary
sequence data

each primary sequence
=
one experiment Primary
sequence
Basically, all 'source' nucleotide
material database

Jennifer McDowall - http://www.biotnet.org/training-materials/nucleotide-sequence-databases-ena

Primary NA sequence can be produced by
Sanger-based technologies or NGS technologies

Sanger
sample
Low output in number of seqs, high quality, 400-850 bp.
Read profiles in .abi format. Stored in Trace Archive.
RNA DNA
RT
NGS
Different technologies. Extremely high output rate, low
cDNA quality, 30 bp – 600 bp. Reads in .fastq format, stored in
the SRA.

These techniques can only read DNA strands,
so RNA needs first to be converted to cDNA
with reverse transcriptases prior to loading to
the machines.

Sanger overview: http://www.bio.davidson.edu/Courses/Molbio/MolStudents/spring2003/Obenrader
NGS overview: http://seqanswers.com/forums/showthread.php?t=3561

Overview major DNA reading technologies

Dennis Wall, NGS Data Analysis and Computation I course, Wall Lab

In the primary sequence dbs a major distinction
can be made in two major categories
High quality single submission (Sanger)
- gene sequence (genomic – 'STD' data class)
- mRNA sequence (via cDNA – 'STD')
- BAC/YAC/cosmid sequences
- genome sequencing projects (contigs,
assemblies, WGS)
DNA
cDNA RNA - genome markers, STS (sequence tagged
sites, unique short sequences from a
genome)

Low quality batch submissions
- Expressed Sequence Tags (EST)
- Genome Survey Sequences (GSS)
- high-throughput sequence data (e.g. NGS)
http://www.ebi.ac.uk/ena/about/formats

The batch submissions originate mostly from
sequencing centers
Large-scale
sequencing
projects chromosome

fragment

sequencing library

submission sequence reads
e.g. whole genome shotgun

submission assemble
sequence

submission annotation
cyp30 cyp309 insv
cg343

Each primary database stores their sequences
and batch submissions in their own way...
- NCBI: ESTs are stored in dbEST (separate database)
- ENA: ESTs are part of EMBL-bank in 'EST' data class

Similar for GSS (see dbGSS at NCBI)

ESTs : expressed sequence tag, often partial sequence
derived from RNA in batch. See example
>est1
ATCGACTAGCATCA
sample >est2
TCGACTAGCGACTA
RNA-seq >est3
RNA CAGCATCATCGAC

http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt

Batch submissions are marked and/or stored
differently than single submissions
Data class ESTs are
ENA-Annotation: also batch submissions
Feature annotation

1) EMBL-Bank

ENA-Assembly:
Assembly information
Batch submissions

ENA-Reads: 2) Trace Archive
Sequencing and - Raw data (capillary sequencing)
sampling information
3) Sequence Read Archive
- Raw data (Next Gen sequencing)

TIER CLASS TYPE ENA structure

The 'normal' submissions are a minority in
primary sequence databases

http://www.ebi.ac.uk/ena/about/statistics#embl_bases_per_dataclass

Primary sequence dbs are synchronised and
every sequence receives a unique identifier
All database maintainers assign and share a unique accession number (AC) to each
sequence – besides their own ID number – (info at NCBI). Sequences can get updated,
and the accession number is extended with a version number, e.g. .1 (see SVA)
Example of acc number: BC010109.2

http://www.insdc.org/
Collaboration on GenBank DDBJ
Features, taxonomy,... + SRA

Synchronized
International nucleotide
Sequence databases collaboration daily

All use the same
- Accession Ids
ENA - Project Ids
- Feature tables (see later)

http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)

One sequence entry contains three categories
of different types of information

1. Info about sequence, submitters and literature (metadata)
2. Annotations of the sequence (metadata related to the seq)
3. Stretch of ATGC / AUGC sequence (the 'data', at the bottom)
•
A sequence record is called 'annotated' when biological information is
added and linked to a position in the sequence
•
Annotations, also called 'features', are abbreviated as codes, which
can be found in the Feature Tables

http://www.ebi.ac.uk/embl/Documentation/FT_d

This sequence information can be written in
different formats
(plain) Text format, e.g. GenBank
1. General info

Official shared accession

Genbank specific identifier
(just sums up with each new)

A lot of different identifiers!
~number of databases
→ conversion tools can translate
identifiers needed (see exercises)

*In humans: HUGO Nomenclature committee determines the right gene
name
http://mobyle.pasteur.fr/cgi-bin/portal.py#tutorials::seqfmt

2. Annotation
db_xref = cross references,

= links to records of other
databases which are related
to this record (see later). The
format dbname:identifier

Feature name Qualifier name

3. Sequence

Each protein sequence receives also an
accession number

Other sequence formats
Fasta (minimal metadata, basically only sequence)
>genename And a description
ATCGATGCAGCTATATCCTCGCGATCAGC
CGGACAGCTCTCGAGCGCATCGACGACGAC
ASN.1 Abstract Syntax Notation (ASN.1)

EMBL :all info as in gb, online referred to as 'plain text'
XML
Fastq : sequence info and base 'call' quality
Important
'Format' has nothing to do with which program you save your file! You don't
have a choice: it needs to be 'plain text format' (.txt - not a file which can be
opened with MS Word such as .doc or .rtf files). Wordpad is a good choice for
this. 'Format' in bioinfo is all about how the information is structured and written
down in the plain text file.
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt

Degree of annotation differs between entries
Batch submitted sequences are
ENA-Annotation: annotated poorly, single
Feature annotation
submissions are annotated better

Good seq
1) EMBL-Bank
annotations
ENA-Assembly:
Assembly information

ENA-Reads: 2)Experiment information
Trace Archive
is- of most(capillary sequencing)
Raw data importance in
Sequencing and
sampling information batch submissions (e.g.
3) Sequence Read which
which species, Archive
- Raw data (Next Gen sequencing)
technique, ...)

TIER CLASS TYPE ENA structure

SRA contains batch submitted records of which
experiment information is of most importance

Since the sequences are barely (not) annotated, is
experiment description important: which machine, which
organism, which tissue, which developmental stage,
disease, treatment, …

How to get sequences into the db, and back out

Submit Retrieve
Always submit your sequence data (mostly One or few sequences
obliged by journals) and include your ACC
number in articles (not any other number). → Use one of the
numerous webbased tools
GenBank: Entrez
EMBL: EB-eye
MRS: developed for easy
Sequin (GenBank retrieval
stand alone)
retrieve Many sequences (Batch
Bankit (GenBank submit
web tool) retrieval)
Webin (EMBL → use ftp (file transfer
protocol)
online submission) → use perl (flexible pro-
gramming language)
→ BioMart
http://www.biomart.org/

Example of a primary NA sequence record (ENA)


Example of a primary NA sequence record (ENA)
Text format

Code usable for Data linked to that
searching code


Primary sequence data contains a lot of
redundancy!

Chromosome sequence

Several gene sequences
from different labs

EST sequences
from transcripts

cDNA sequence

Al match to the same gene. Often you end up in your
database search with all these sequences...
A lot of redundancy!

The primary sequences are the basis for
analyses that generate derived sequence data
Scientists/Consortia → primary databases
– Source for further analyses. Which?
• Create protein sequences
• Curate the sequence database
• Assemble genomes
• Searching similarities
• Aggregate information about one gene
• …

Results stored in derived databases

Protein databases come in two kinds

The most important protein db is UniProt and
contains 'automatic' and manual entries
UniProt Knowledge Base - 'the best annotated protein
database of the world'
http://www.uniprot.org/

The most important protein db is UniProt and
contains 'automatic' and manual entries

Refseq - The NCBI way to reduce redundancy in
primary sequence data
RefSeq is NCBI 'Reference Sequences' (prot and nuc)
Redundancy from primary sequence data is reduced both
automatically and by manual annotation of NA and protein
sequences. 'one natural biological molecule = one entry'. Links
back to the original primary sequences. Hugely popular and a
basis for a lot of analyses.

Click to apply
refseq filter in
entrez search

http://www.ncbi.nlm.nih.gov/RefSeq/

RefSeq has its own identifiers, not to be mixed
up with accession numbers
Refseq entry codes looks similar as ACC numbers (but are not ACC numbers –
underscore!); and RefSeq is also in GenBank format. Note: in 'Features'
section one can find the raw sequences from what is was derived. (typical
mistake: search with refseq code in uniprot)
NC_* (curated) complete genomic element (chromosome, plasmid,...)
NT_* (automated) intermediate assembly from BAC
NZ_* (automated) incomplete genomic sequence from WGS
NW_* (automated) intermediate assembly from WGS
NG_* (curated) incomplete genomic element corresponding to gene
NM_* (curated) mRNA
NR_* (curated) non-coding RNA or predicted transcript of pseudogene
NP_* (curated) protein
ZP_* (automated) protein predicted from WGS sequence (NZ_*)
YP_* (curated) other predicted protein sequences from NCBI Genome Annotation Pipeline
XM_* (automated) mRNA
XR_* (automated) non-coding RNA or predicted transcript of pseudogene
XP_* (automated) protein

http://www.ncbi.nlm.nih.gov/RefSeq/key.html
http://www.ncbi.nlm.nih.gov/RefSeq/

UniRef – UniProt redundancy reducing system for
proteins sequences

Non redundant protein sequences from
UniProt
~ refseq
Hiding redundant sequences by clustering them
•
UniRef100 = complete identical sequences
•
UniRef90 = 90% identical sequences
•
UniRef50 = 50% identical sequences
See http://www.uniprot.org/help/uniref

NCBI's Gene – summarizes gene information
including sequence information from primary dbs
Example of the gene NPR1 from A. thaliana

UniGene – summarizes transcriptomic
information around genes

And a lot more derived databases with
sequence information exist
Repbase :
repeats (Alu, …), maintained by Jerzy Jurka at the Genetic
Information Research Institute (Mountain View CA, USA).
CENSOR server allows to "clean" sequences.
http://www.girinst.org/repbase
MiRBase → published miRNA sequences
http://www.mirbase.org/
Eukaryotic promoter database
http://www.epd.isb-sib.ch/
UniVec
GenBank subset + some sequences from commercial sources -
ftp://ftp.ncbi.nih.gov/pub/UniVec/

The most important sequence databases
overview

Integrated
Prim seq data
Search
Derive Curat
d ed Portals
GB GenPept RefSeq Entrez

ENA trEMBL
ENA search
EB-eye
DDBJ
UNIPROT SwissProt UniProt

Common gene annotations on sequences

Genome sequence: e.g. Chr6

Enhancers/promotors terminator

Intron
Gene sequence exon

mRNA AAAAAAAAAAAAA

5'UTR CDS 3'UTR poly(A) tail

protein Genetic code tables

Searching the database for your gene of interest

First you have to determine for yourself
which information you want

- NA sequences vs. protein sequences
- If NA, genomic sequences, or RNA derived
- All possible sequences that exists, or curated ones
- Protein sequences of which quality
- ...

Entrez is a starting point for searches at NCBI
http://www.ncbi.nlm.nih.gov/sites/gquery

Visualising the db_xrefs in records at NCBI

ENA has its text-search portal
http://www.ebi.ac.uk/ena/

Results from an ENA search are organised
following the ENA database structure

UniProt has a simple search box leading to a
sophisticated search results page

Complex searches can be achieved by using the
index codes in the database
e.g.

“oc=Primates and
de=complete and
de=cds and
de=MHC”

Code usable for Could answer: give me
searching all coding sequence
of MHC available in
primates.

Meta-search tools can search different
sequence databases at once.
MRS
Open Source, developed by Maarten Hekkelman at Radboud U.
(Nijmegen, the Netherlands). Allows searching in different databases at
once, and provides also statistics on the databases.

Alternatives: ACNUC, SRS

Logical operators
Searching involves making combinations of conditions.
Here the difference between a logic and, or and not explained by
venn diagrams.

Q1 AND Q2
&

Q1 NOT Q2
!

Q1 OR Q2
|

Hands-on!

Every module ends with an exercise
session.

We will now explore how data is stored in different
sequence databases. You get …. minutes for this
exercise.
Afterwards, we summarizes some of the difficulties
some of you might have experienced.

Summary
This course is organised in several modules
Module 1: Sequence databases
Three major nucleotide databanks host primary sequence data
These databases are filled with NA sequence information by scientists and consortia
The batch submissions originate mostly from sequencing centers
Each primary database stores their sequences and batch submissions in their own way...
Batch submissions are marked and/or stored differently than single submissions
The 'normal' submissions are a minority in primary sequence databases
Primary sequence dbs are synchronised and every sequence receives a unique identifier
One sequence entry contains three categories of different types of information
This sequence information can be written in different formats
Degree of annotation differs between entries
SRA contains batch submitted records of which experiment information is of most importance
How to get sequences into the db, and back out
Primary sequence data contains a lot of redundancy!
The primary sequences are the basis for analyses that generate derived sequence data
Protein databases come in two kinds
The most important protein db is UniProt and contains 'automatic' and manual entries
Refseq - The NCBI way to reduce redundancy in primary sequence data
RefSeq has its own identifiers, not to be mixed up with accession numbers
UniRef – UniProt redundancy reducing system for proteins sequences
NCBI's Gene – summarizes gene information including sequence information from primary dbs
UniGene – summarizes transcriptomic information around genes
And a lot more derived databases with sequence information exist
Searching the database for your gene of interest
Entrez is a starting point for searches at NCBI
Visualising the db_xrefs in records at NCBI
ENA has its text-search portal
Results from an ENA search are organised following the ENA database structure
UniProt has a simple search box leading to a sophisticated search results page
Complex searches can be achieved by using the index codes in the database
Meta-search tools can search different sequence databases at once.
Hands-on!

BITS: Basics of sequence databases

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to BITS: Basics of sequence databases

Similar to BITS: Basics of sequence databases (20)

More from BITS

More from BITS (20)

Recently uploaded

Recently uploaded (20)

BITS: Basics of sequence databases