3. What is bioinformatics?
Bioinformatics is an interdisciplinary field that
develops methods and software tools for
understanding biological data. As an
interdisciplinary field of science, bioinformatics
combines computer science, statistics,
mathematics, and engineering to study and
process biological data.
http://en.wikipedia.org/wiki/Bioinformatics
2015-03-23 3
4. A little bit of history
• 1951 – Sequencing peptide (Frederick Sanger)
• 1965 – Sequencing RNA (Robert Holley)
• 1970 – Term BIOINFORMATICS coined by
Paulien Hogeweg & Ben Hesper
• 1977 – Sequencing DNA (Frederick Sanger)
• 1990 – Human Genome Project started
(expected duration 15 years)
• 2003 – Human Genome Project completed
2015-03-23 4
5. • It’s all about money!!!!
2015-03-23
Why is bioinformatics so important?
5
6. Cost of sequencing
Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-125
2015-03-23 6
7. Cost of sequencing & data analysis
Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-125
2015-03-23 7
9. Future of biological research
• With rapidly advancing automation there
will be less human efforts needed for
sample preparation
• With increasing amount of information data
analysis will be more important
• The information output of experiments is
growing beyond human capability: need of
high level summaries and statistics
2015-03-23 9
16. Quality filtering and trimming
TAGCGCAATACTTTCTGTTAGCGCAAATCCTAGTAGTGCAT
AGTGGTATCAACGCAGAGTACGGG
2015-03-23 16
17. Sequence search (BLAST)
• BLAST is one of the most commonly used
bioinformatics software
• It finds small sub-sequences of your query
in the subject sequence
• Uses word to match with the database of
subject and then uses heuristics to verify
and extend match
2015-03-23 17
20. Sequence/genome alignment
• Global alignment
– global optimization that "forces" the alignment
to span the entire sequences
(Needleman–Wunsch algorithm or Clustal style)
• Local alignment
– identify short regions of similarity within long
divergent sequences
(Smith–Waterman algorithm or BLAST style)
2015-03-23 20
22. Genome alignment
• Glocal alignment
• Uses a word matching method
• Creates suffix tree for faster search
• Searches suffix tree for exact matches of
words clusters them and then uses local
alignment methods to extend match
2015-03-23 22
25. Assembly
• Short read assembly is extremely difficult
and computationally intensive!
• For longer reads an Overlap Consensus
(OLC) assemblers are used
• For shorter reads (and in
high numbers) De Bruijn
Graph assemblers are
better
2015-03-23 25Source: Commins, Toft & Fares (CC BY-SA 2.5)
34. PDB and structural information
• Protein Data Bank holds information about
structure of proteins, nucleic acids and
complexes – over 100 000 entries!
• The 3D structure can be resolved by:
– X-ray diffraction
– NMR
– Electron microscopy
– Simulations
2015-03-23 34
35. PDB and structural information
HEADER TRANSCRIPTION 18-MAR-04 1VD4
TITLE SOLUTION STRUCTURE OF THE ZINC FINGER DOMAIN OF TFIIE ALPHA
COMPND 2 MOLECULE: TRANSCRIPTION INITIATION FACTOR IIE, ALPHA
COMPND 8 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;
SOURCE 10 EXPRESSION_SYSTEM_PLASMID: PET11D
KEYWDS ZINC FINGER, TRANSCRIPTION
EXPDTA SOLUTION NMR
NUMMDL 20
AUTHOR M.OKUDA,A.TANAKA,Y.ARAI,M.SATOH,H.OKAMURA,A.NAGADOI,
REMARK 500 CHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400
REMARK 500
REMARK 500 M RES CSSEQI PSI PHI
REMARK 500 1 GLU A 118 -36.12 -163.20
REMARK 500 1 ARG A 119 -92.03 -138.92
REMARK 500 1 THR A 122 -70.74 -110.33
SITE 1 AC1 5 CYS A 129 CYS A 132 CYS A 154 CYS A 157
SITE 2 AC1 5 THR A 159
CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 1.000000 0.000000 0.000000 0.00000
SCALE3 0.000000 0.000000 1.000000 0.00000
MODEL 1
ATOM 1 N ARG A 113 1.980 -19.277 -19.127 1.00 0.00 N
ATOM 2 CA ARG A 113 1.202 -19.280 -17.853 1.00 0.00 C
ATOM 3 C ARG A 113 0.666 -17.875 -17.557 1.00 0.00 C
ATOM 4 O ARG A 113 0.625 -17.023 -18.421 1.00 0.00 O
ATOM 5 CB ARG A 113 2.199 -19.713 -16.778 1.00 0.00 C
ATOM 6 CG ARG A 113 2.435 -21.222 -16.875 1.00 0.00 C
ATOM 7 CD ARG A 113 3.604 -21.619 -15.971 1.00 0.00 C
ATOM 8 NE ARG A 113 2.986 -21.899 -14.645 1.00 0.00 N
ATOM 9 CZ ARG A 113 3.125 -23.073 -14.094 1.00 0.00 C
2015-03-23 35
38. Molecular networks
• Bioinformatics is needed to describe
interactions between proteins, DNA, drugs…
• When thousands of interactions are
analyzed network science come to use
• The set of all protein-protein interactions in
single cell is called interactome
• A single interaction can be researched in
vivo/in vitro but more complex network can
be only investigated in silico
2015-03-23 38
40. Metabolic pathways
• To describe a series of biochemical reactions
that often happen in different cellular
compartments, bioinformatics is also useful
• For description of pathways special
databases (graph) had to be designed
• Modeling of metabolites flow in pathway is
virtually impossible without the use of
computers
2015-03-23 40
43. Simulation of biological systems
• Simulation of cell-cell interactions
• Description of interactions inside population
• Between species interactions
• Food chains => food web
• Social relations
• Evolution of populations
• Modeling in pharmacology
2015-03-23 43
46. Databases
• Different types public resources available:
2015-03-23 46
Nucleic sequence
Protein sequence
EST
Genome
Sequence
data
Metadata/Ontologies
Functional
annotation
Gene models
Gene ontologies
Protein structure
Structural data Complexes
structure
RNA structure
Variation dataSNP
SSRindels
Interactions
Metabolic data
Pathways
48. Databases
• How to use them?
– Browsing websites
directly
– Downloading
– Using API
2015-03-23 48
49. Text/data mining
• Obtaining information from several
scientific resources becoming is more
difficult as the volume of information grows
• Number of different resources/databases is
growing and simple search has to be
repeated for each of them
• Filtering relevant information is a big
intellectual/computational burden
2015-03-23 49
50. Text mining
• Retrieval, analysis and formatting (parsing)
of information into searchable databases
• Recognition of patterns
• Recognition of natural language
• Extraction of semantic or grammatical
relationships
• Coreference: terms that refer to the same
object
2015-03-23 50
51. Text mining example
• Query: Find promoters known to work in
E.coli with s70 holenzyme (Es70) aka sD
• PREFIX sbol:http://sbols.org/sbol.owl#
PREFIX pr:http://partsregistry.org/#
SELECT DISTINCT ?name
WHERE {
?part a sbol:Part;
sbol:status ?st;
sbol:name ?name;
sbol:dnaSequence ?seq;
a pr:promoter;
a ?cl.
FILTER (?cl =pr:sigma70_ecoli_prokaryote_rnap
&& ?st !='Deleted')}
2015-03-23 51
52. Open source software
• Software that anyone can use, modify, share
and distribute.
• Source code is known and can (should!) be
modified to fit the user requirements
• Society driven development
• Dynamic development and early releases
• Security and transparency
2015-03-23 52
53. Open source software repositories
2015-03-23 53
CRAN
The Comprehensive R Archive Network
CodePlex
55. CAN I BE A BIOINFORMATICIAN, TOO?
2015-03-23 55
56. How to become a bioinformatician?
• Get a computer with Linux
• Learn how to use bash shell
and how to run programs
command line
• Learn to code in python or Perl
• Try solving basic problems on
2015-03-23 56
57. How to become a bioinformatician?
• Read blogs:
• Read fora for geeks:
• Get an account on:
2015-03-23 57
58. Want to know more?
• Join my network on
http://nl.linkedin.com/in/andrzejstefanczech
• Come to Wageningen for an internship at
Genetwister Technologies B.V.
http://www.genetwister.nl/
• Slides from this lecture are also available on
SlideShare
2015-03-23 58