1. Basic bioinformatics concepts,
databases and tools
Introduction to the training
and Sequence databases
Joachim Jacob
http://www.bits.vib.be
Updated 22 February 2012
http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod1-intro_H1_2012_SeqDBs.pdf
2. Scope
Introductory training to Bioinformatics
Exploring and understanding
databases and software
for everyday bioinformatics use
If there is any term which is unclear,
please stop me and ask me!
3. Bioinformatics ...
Bio
all data is derived from living samples
Informatics
that data is stored and analyzed in and with computers to obtain
understanding
Extremely broad description, for which however we
will extract common principles during the course
13. Bioinformatics ...
Bio
- different types of living samples
Informatics
- storing and categorizing the information
and making it easily accessible
- interpreting that information reliably
14. Bioinformatics … and his companion
Bio
- different types of living samples
Informatics
- storing and categorizing the information
and making it easily accessible
- interpreting that information reliably
Statistics
- large numbers, observational data
15. The siblings of Bioinformatics
Based on the biological component extracted from life, the
measured properties and the ultimate goal of the
analysis, different sub-disciplines of bioinformatics exist.
DNA RNA proteins metabolites
Genomics
Transcriptomics
Proteomics
Metabolomics
Epigenomics Structural bioinformatics
Systems biology Microbiomics Interactomics
Metagenomics Functional genomics Comparative gx
16. Mere data is worth nothing
CGCTACGCATATCGCT Data = symbols
- Dasypus novemcinctus Information = data that are processed to be useful;
- found in my garden provides answers to "who", "what", "where", and
- Part of genome
- sequenced on June 2010 "when" questions. Also called metadata.
This species seems to be Knowledge: application of data and information;
related to my neighbor's pet,
because it has also this answers "how" questions
sequence
Has the same mother Understanding: appreciation of "why"
Wisdom
http://www.systems-thinking.org/dikw/dikw.htm
17. ? ! Life sciences
research as major
'end user' for the
data knowledge bioinformatics tools
and conclusions
'tool user'
Tools and approaches
Bioinformatics
research, as a
specific branch on
Biology Computer Statistics the boundary of life
science,
mathematics and
computer science
'tool manufacturer'
18. This course is organised in several modules
Module 1: Sequence databases: what, where, how
Module 2: Sequence comparisons: searching, aligning
Module 3: Sequence analysis – domains in protein sequences and
predicting functionality, standardisation and useful links
Module 4: Beyond sequences - additional important data sources
Module 5: Genome Browsers - integrating biological data and performing
reproducible bioinformatics research in the Galaxy
20. One tip for the future
Be prepared for change...
Information is fluid
So are bioinfo tools
Learn how to accommodate for change
Major resources are more stable
Important concepts do not change often
22. Module 1: Sequence databases
Sequence databases store DNA and RNA sequences. In
Bioinformatics, they are by far (still) the largest
collections of biological data, and used by many
subdisciplines of bioinformatics.
http://www.ebi.ac.uk/embl/Services/DBStats/
24. Three major nucleotide databanks host primary
sequence data
European Nucleotide Archive (ENA) at EBI - http://www.ebi.ac.uk/
Division EMBL-bank (European Molecular Biology Laboratory) (single)
Trace Archive
SRA Archive
GenBank at NCBI - http://www.ncbi.nlm.nih.gov/
maintained at NCBI (National Center for Biotechnology Information,
(USA)
DDBJ (DNA Data Bank of Japan) - http://www.ddbj.nig.ac.jp/
maintained at NIG/CIB (National Institute of Genetics, Center for
Information Biology, Mishima, Japan)
25. These databases are filled with NA sequence
information by scientists and consortia
Large-scale Individual Patent
sequencing scientists Offices ACTGCTGCTA
GCTAGCTGAT
projects CTATGCTAGC
TGTAGCTGAG
Primary
sequence data
each primary sequence
=
one experiment Primary
sequence
Basically, all 'source' nucleotide
material database
Jennifer McDowall - http://www.biotnet.org/training-materials/nucleotide-sequence-databases-ena
26. Primary NA sequence can be produced by
Sanger-based technologies or NGS technologies
Sanger
sample
Low output in number of seqs, high quality, 400-850 bp.
Read profiles in .abi format. Stored in Trace Archive.
RNA DNA
RT
NGS
Different technologies. Extremely high output rate, low
cDNA quality, 30 bp – 600 bp. Reads in .fastq format, stored in
the SRA.
These techniques can only read DNA strands,
so RNA needs first to be converted to cDNA
with reverse transcriptases prior to loading to
the machines.
Sanger overview: http://www.bio.davidson.edu/Courses/Molbio/MolStudents/spring2003/Obenrader
NGS overview: http://seqanswers.com/forums/showthread.php?t=3561
27. Overview major DNA reading technologies
Dennis Wall, NGS Data Analysis and Computation I course, Wall Lab
28. In the primary sequence dbs a major distinction
can be made in two major categories
High quality single submission (Sanger)
- gene sequence (genomic – 'STD' data class)
- mRNA sequence (via cDNA – 'STD')
- BAC/YAC/cosmid sequences
- genome sequencing projects (contigs,
assemblies, WGS)
DNA
cDNA RNA - genome markers, STS (sequence tagged
sites, unique short sequences from a
genome)
Low quality batch submissions
- Expressed Sequence Tags (EST)
- Genome Survey Sequences (GSS)
- high-throughput sequence data (e.g. NGS)
http://www.ebi.ac.uk/ena/about/formats
29. The batch submissions originate mostly from
sequencing centers
Large-scale
sequencing
projects chromosome
fragment
sequencing library
submission sequence reads
e.g. whole genome shotgun
submission assemble
sequence
submission annotation
cyp30 cyp309 insv
cg343
30. Each primary database stores their sequences
and batch submissions in their own way...
- NCBI: ESTs are stored in dbEST (separate database)
- ENA: ESTs are part of EMBL-bank in 'EST' data class
Similar for GSS (see dbGSS at NCBI)
ESTs : expressed sequence tag, often partial sequence
derived from RNA in batch. See example
>est1
ATCGACTAGCATCA
sample >est2
TCGACTAGCGACTA
RNA-seq >est3
RNA CAGCATCATCGAC
31. http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt
Batch submissions are marked and/or stored
differently than single submissions
Data class ESTs are
ENA-Annotation: also batch submissions
Feature annotation
1) EMBL-Bank
ENA-Assembly:
Assembly information
Batch submissions
ENA-Reads: 2) Trace Archive
Sequencing and - Raw data (capillary sequencing)
sampling information
3) Sequence Read Archive
- Raw data (Next Gen sequencing)
TIER CLASS TYPE ENA structure
32. The 'normal' submissions are a minority in
primary sequence databases
http://www.ebi.ac.uk/ena/about/statistics#embl_bases_per_dataclass
33. Primary sequence dbs are synchronised and
every sequence receives a unique identifier
All database maintainers assign and share a unique accession number (AC) to each
sequence – besides their own ID number – (info at NCBI). Sequences can get updated,
and the accession number is extended with a version number, e.g. .1 (see SVA)
Example of acc number: BC010109.2
http://www.insdc.org/
Collaboration on GenBank DDBJ
Features, taxonomy,... + SRA
Synchronized
International nucleotide
Sequence databases collaboration daily
All use the same
- Accession Ids
ENA - Project Ids
- Feature tables (see later)
http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)
34. One sequence entry contains three categories
of different types of information
1. Info about sequence, submitters and literature (metadata)
2. Annotations of the sequence (metadata related to the seq)
3. Stretch of ATGC / AUGC sequence (the 'data', at the bottom)
•
A sequence record is called 'annotated' when biological information is
added and linked to a position in the sequence
•
Annotations, also called 'features', are abbreviated as codes, which
can be found in the Feature Tables
http://www.ebi.ac.uk/embl/Documentation/FT_d
35. This sequence information can be written in
different formats
(plain) Text format, e.g. GenBank
1. General info
Official shared accession
Genbank specific identifier
(just sums up with each new)
A lot of different identifiers!
~number of databases
→ conversion tools can translate
identifiers needed (see exercises)
*In humans: HUGO Nomenclature committee determines the right gene
name
http://mobyle.pasteur.fr/cgi-bin/portal.py#tutorials::seqfmt
36. 2. Annotation
db_xref = cross references,
= links to records of other
databases which are related
to this record (see later). The
format dbname:identifier
Feature name Qualifier name
37. 3. Sequence
Each protein sequence receives also an
accession number
38. Other sequence formats
Fasta (minimal metadata, basically only sequence)
>genename And a description
ATCGATGCAGCTATATCCTCGCGATCAGC
CGGACAGCTCTCGAGCGCATCGACGACGAC
ASN.1 Abstract Syntax Notation (ASN.1)
EMBL :all info as in gb, online referred to as 'plain text'
XML
Fastq : sequence info and base 'call' quality
Important
'Format' has nothing to do with which program you save your file! You don't
have a choice: it needs to be 'plain text format' (.txt - not a file which can be
opened with MS Word such as .doc or .rtf files). Wordpad is a good choice for
this. 'Format' in bioinfo is all about how the information is structured and written
down in the plain text file.
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
39. http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt
Degree of annotation differs between entries
Batch submitted sequences are
ENA-Annotation: annotated poorly, single
Feature annotation
submissions are annotated better
Good seq
1) EMBL-Bank
annotations
ENA-Assembly:
Assembly information
ENA-Reads: 2)Experiment information
Trace Archive
is- of most(capillary sequencing)
Raw data importance in
Sequencing and
sampling information batch submissions (e.g.
3) Sequence Read which
which species, Archive
- Raw data (Next Gen sequencing)
technique, ...)
TIER CLASS TYPE ENA structure
40. SRA contains batch submitted records of which
experiment information is of most importance
Since the sequences are barely (not) annotated, is
experiment description important: which machine, which
organism, which tissue, which developmental stage,
disease, treatment, …
41. How to get sequences into the db, and back out
Submit Retrieve
Always submit your sequence data (mostly One or few sequences
obliged by journals) and include your ACC
number in articles (not any other number). → Use one of the
numerous webbased tools
GenBank: Entrez
EMBL: EB-eye
MRS: developed for easy
Sequin (GenBank retrieval
stand alone)
retrieve Many sequences (Batch
Bankit (GenBank submit
web tool) retrieval)
Webin (EMBL → use ftp (file transfer
protocol)
online submission) → use perl (flexible pro-
gramming language)
→ BioMart
http://www.biomart.org/
42. Example of a primary NA sequence record (ENA)
http://www.ebi.ac.uk/ena/about/formats
43. Example of a primary NA sequence record (ENA)
Text format
Code usable for Data linked to that
searching code
http://www.ebi.ac.uk/ena/about/formats
44. Primary sequence data contains a lot of
redundancy!
Chromosome sequence
Several gene sequences
from different labs
EST sequences
from transcripts
cDNA sequence
Al match to the same gene. Often you end up in your
database search with all these sequences...
A lot of redundancy!
45. The primary sequences are the basis for
analyses that generate derived sequence data
Scientists/Consortia → primary databases
– Source for further analyses. Which?
• Create protein sequences
• Curate the sequence database
• Assemble genomes
• Searching similarities
• Aggregate information about one gene
• …
Results stored in derived databases
47. The most important protein db is UniProt and
contains 'automatic' and manual entries
UniProt Knowledge Base - 'the best annotated protein
database of the world'
http://www.uniprot.org/
48. The most important protein db is UniProt and
contains 'automatic' and manual entries
49. Refseq - The NCBI way to reduce redundancy in
primary sequence data
RefSeq is NCBI 'Reference Sequences' (prot and nuc)
Redundancy from primary sequence data is reduced both
automatically and by manual annotation of NA and protein
sequences. 'one natural biological molecule = one entry'. Links
back to the original primary sequences. Hugely popular and a
basis for a lot of analyses.
Click to apply
refseq filter in
entrez search
http://www.ncbi.nlm.nih.gov/RefSeq/
50. RefSeq has its own identifiers, not to be mixed
up with accession numbers
Refseq entry codes looks similar as ACC numbers (but are not ACC numbers –
underscore!); and RefSeq is also in GenBank format. Note: in 'Features'
section one can find the raw sequences from what is was derived. (typical
mistake: search with refseq code in uniprot)
NC_* (curated) complete genomic element (chromosome, plasmid,...)
NT_* (automated) intermediate assembly from BAC
NZ_* (automated) incomplete genomic sequence from WGS
NW_* (automated) intermediate assembly from WGS
NG_* (curated) incomplete genomic element corresponding to gene
NM_* (curated) mRNA
NR_* (curated) non-coding RNA or predicted transcript of pseudogene
NP_* (curated) protein
ZP_* (automated) protein predicted from WGS sequence (NZ_*)
YP_* (curated) other predicted protein sequences from NCBI Genome Annotation Pipeline
XM_* (automated) mRNA
XR_* (automated) non-coding RNA or predicted transcript of pseudogene
XP_* (automated) protein
http://www.ncbi.nlm.nih.gov/RefSeq/key.html
http://www.ncbi.nlm.nih.gov/RefSeq/
51. UniRef – UniProt redundancy reducing system for
proteins sequences
Non redundant protein sequences from
UniProt
~ refseq
Hiding redundant sequences by clustering them
•
UniRef100 = complete identical sequences
•
UniRef90 = 90% identical sequences
•
UniRef50 = 50% identical sequences
See http://www.uniprot.org/help/uniref
52. NCBI's Gene – summarizes gene information
including sequence information from primary dbs
Example of the gene NPR1 from A. thaliana
54. And a lot more derived databases with
sequence information exist
Repbase :
repeats (Alu, …), maintained by Jerzy Jurka at the Genetic
Information Research Institute (Mountain View CA, USA).
CENSOR server allows to "clean" sequences.
http://www.girinst.org/repbase
MiRBase → published miRNA sequences
http://www.mirbase.org/
Eukaryotic promoter database
http://www.epd.isb-sib.ch/
UniVec
GenBank subset + some sequences from commercial sources -
ftp://ftp.ncbi.nih.gov/pub/UniVec/
55. The most important sequence databases
overview
Integrated
Prim seq data
Search
Derive Curat
d ed Portals
GB GenPept RefSeq Entrez
ENA trEMBL
ENA search
EB-eye
DDBJ
UNIPROT SwissProt UniProt
56. Common gene annotations on sequences
Genome sequence: e.g. Chr6
Enhancers/promotors terminator
Intron
Gene sequence exon
mRNA AAAAAAAAAAAAA
5'UTR CDS 3'UTR poly(A) tail
protein Genetic code tables
57. Searching the database for your gene of interest
First you have to determine for yourself
which information you want
- NA sequences vs. protein sequences
- If NA, genomic sequences, or RNA derived
- All possible sequences that exists, or curated ones
- Protein sequences of which quality
- ...
58. Entrez is a starting point for searches at NCBI
http://www.ncbi.nlm.nih.gov/sites/gquery
60. ENA has its text-search portal
http://www.ebi.ac.uk/ena/
61. Results from an ENA search are organised
following the ENA database structure
62. UniProt has a simple search box leading to a
sophisticated search results page
63. Complex searches can be achieved by using the
index codes in the database
e.g.
“oc=Primates and
de=complete and
de=cds and
de=MHC”
Code usable for Could answer: give me
searching all coding sequence
of MHC available in
primates.
64. Meta-search tools can search different
sequence databases at once.
MRS
Open Source, developed by Maarten Hekkelman at Radboud U.
(Nijmegen, the Netherlands). Allows searching in different databases at
once, and provides also statistics on the databases.
Alternatives: ACNUC, SRS
65. Logical operators
Searching involves making combinations of conditions.
Here the difference between a logic and, or and not explained by
venn diagrams.
Q1 AND Q2
&
Q1 NOT Q2
!
Q1 OR Q2
|
66. Hands-on!
Every module ends with an exercise
session.
We will now explore how data is stored in different
sequence databases. You get …. minutes for this
exercise.
Afterwards, we summarizes some of the difficulties
some of you might have experienced.
67. Summary
This course is organised in several modules
Module 1: Sequence databases
Three major nucleotide databanks host primary sequence data
These databases are filled with NA sequence information by scientists and consortia
The batch submissions originate mostly from sequencing centers
Each primary database stores their sequences and batch submissions in their own way...
Batch submissions are marked and/or stored differently than single submissions
The 'normal' submissions are a minority in primary sequence databases
Primary sequence dbs are synchronised and every sequence receives a unique identifier
One sequence entry contains three categories of different types of information
This sequence information can be written in different formats
Degree of annotation differs between entries
SRA contains batch submitted records of which experiment information is of most importance
How to get sequences into the db, and back out
Primary sequence data contains a lot of redundancy!
The primary sequences are the basis for analyses that generate derived sequence data
Protein databases come in two kinds
The most important protein db is UniProt and contains 'automatic' and manual entries
Refseq - The NCBI way to reduce redundancy in primary sequence data
RefSeq has its own identifiers, not to be mixed up with accession numbers
UniRef – UniProt redundancy reducing system for proteins sequences
NCBI's Gene – summarizes gene information including sequence information from primary dbs
UniGene – summarizes transcriptomic information around genes
And a lot more derived databases with sequence information exist
Searching the database for your gene of interest
Entrez is a starting point for searches at NCBI
Visualising the db_xrefs in records at NCBI
ENA has its text-search portal
Results from an ENA search are organised following the ENA database structure
UniProt has a simple search box leading to a sophisticated search results page
Complex searches can be achieved by using the index codes in the database
Meta-search tools can search different sequence databases at once.
Hands-on!