Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Intro to field_of_bioinformatics
1. 09/05/13 K-INBRE Bioinformatics Core KSU
Bioinformatics
1
Introduction to the field of
bioinformatics
Sept, 2013
Jennifer Shelton
K-INBRE Bioinformatics Core KSU
2. 09/05/13 K-INBRE Bioinformatics Core KSU
Outline
2
I. Basic concepts
i. Definition of bioinformatics
ii. Databases (flat-file and
relational)
iii. Assembly (Overlap-layout-
consensus)
II. Steps you can take on your
own
3. 09/05/13 K-INBRE Bioinformatics Core KSU
Definition of bioinformatics
3
Acquire
data
Store/archive data
Organize data
Analyzedata
Visualizedata
Biological,
Medical,
Behavioral, or
Health
“Bioinformatics: Research,
development, or application of
computational tools and
approaches for expanding the
use of biological, medical,
behavioral or health data,
including those to acquire, store,
organize, archive, analyze, or
visualize such data.”
-NIH Biomedical Information
Science and Technology
Initiative Consortium 2000
4. 09/05/13 K-INBRE Bioinformatics Core KSU
Definition of bioinformatics
4
Acquire
data
Store/archive data
Organize data
Analyzedata
Visualizedata
Biological,
Medical,
Behavioral, or
Health
Acquire
data
Store/archive data
Organize data
Analyzedata
Visualizedata
Biological,
Medical,
Behavioral, or
Health
“Bioinformatics: Research,
development, or application of
computational tools and
approaches for expanding the
use of biological, medical,
behavioral or health data,
including those to acquire, store,
organize, archive, analyze, or
visualize such data.”
-NIH Biomedical Information
Science and Technology
Initiative Consortium 2000
5. 09/05/13 K-INBRE Bioinformatics Core KSU
Problem with volume
5
“We believe the field of
bioinformatics for genetic
analysis will be one of the
biggest areas of disruptive
innovation in life science tools
over the next few years,”
-Isaac Ro, Goldman Sachs
Mark Smiciklas, Flickr.com/photos/intersectionconsulting
Ro, Goldman Sachs
Per year worldwide we can
generate
~13,000,000,000,000,000 bp
of data
6. 09/05/13 K-INBRE Bioinformatics Core KSU
"This unprecedented amount of
sequencing information poses
bottlenecks that vary, depending on
application, at the level of data
extraction, analysis, and
interpretation”
"These challenges have become part
and parcel of the biomedical research
community where investigators have
increasingly needed to incorporate
bioinformatics and biostatistics into
their armamentarium."
Problem with volume
6
Mark Smiciklas, Flickr.com/photos/intersectionconsulting
Opportunities and Challenges Associated with Clinical
Diagnostic Genome Sequencing: A Report of the
Association for Molecular Pathology. The Journal of
Molecular Diagnostics - November 2012
7. 09/05/13 K-INBRE Bioinformatics Core KSU
“It sounds like an analog
solution in a digital age,”-Sifei
He, head of cloud computing
for BGI (referring to FedExing
disks of data because internet
connections are often too slow)
NY Times 2011 article: DNA
Sequencing Caught in a
Deluge of Data http://
www.nytimes.com/
2011/12/01/business/dna-
sequencing-caught-in-
deluge-of-data.html?
pagewanted=all&_r=0
Problem with volume
7
9. 09/05/13 K-INBRE Bioinformatics Core KSU
Outline
9
I. Basic concepts
i. Definition of bioinformatics
ii. Databases (flat-file and
relational)
iii. Assembly (Overlap-layout-
consensus)
II. Steps you can take on your
own
10. 09/05/13 K-INBRE Bioinformatics Core KSU
Flat-file databases
‘records’ about one unique
object
‘fields’ same kind of data
about different object
http://www.ncbi.nlm.nih.gov/
genbank/
10
GenBank:
11. 09/05/13 K-INBRE Bioinformatics Core KSU 11
Flat-file databases
Any flat-file database, like GenBank can be thought of as a single
spreadsheet called a ‘table’ of ‘fields’ and ‘records’
12. 09/05/13 K-INBRE Bioinformatics Core KSU
Relational databases
Have multiple tables
with some shared
fields and some
different
**‘fields’ same kind of
data about different
objects
http://www.genome.jp/kegg/
pathway.html
12
13. 09/05/13 K-INBRE Bioinformatics Core KSU
Relational databases
Relational databases are like multiple tables that are linked with a
shared field. This acts like a “key” between them
13
9/25/12 KEGG PATHWAY: hsa05204
2/10www.genome.jp/dbget-‐‑bin/www_bget?pathway+hsa05204
Organism Homo sapiens (human) [GN:hsa]
Gene 1543 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1
(EC:1.14.14.1) [KO:K07408] [EC:1.14.14.1]
1576 CYP3A4; cytochrome P450, family 3, subfamily A, polypeptide 4
(EC:1.14.13.67 1.14.13.97 1.14.13.32) [KO:K07424]
[EC:1.14.14.1]
1577 CYP3A5; cytochrome P450, family 3, subfamily A, polypeptide 5
(EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]
1551 CYP3A7; cytochrome P450, family 3, subfamily A, polypeptide 7
(EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]
64816 CYP3A43; cytochrome P450, family 3, subfamily A, polypeptide
43 (EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]
5743 PTGS2; prostaglandin-endoperoxide synthase 2 (prostaglandin
G/H synthase and cyclooxygenase) (EC:1.14.99.1) [KO:K11987]
[EC:1.14.99.1]
10 NAT2; N-acetyltransferase 2 (arylamine N-acetyltransferase)
(EC:2.3.1.5) [KO:K00622] [EC:2.3.1.5]
9 NAT1; N-acetyltransferase 1 (arylamine N-acetyltransferase)
(EC:2.3.1.5) [KO:K00622] [EC:2.3.1.5]
1544 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2
(EC:1.14.14.1) [KO:K07409] [EC:1.14.14.1]
6799 SULT1A2; sulfotransferase family, cytosolic, 1A, phenol-
preferring, member 2 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]
6817 SULT1A1; sulfotransferase family, cytosolic, 1A, phenol-
preferring, member 1 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]
6818 SULT1A3; sulfotransferase family, cytosolic, 1A, phenol-
preferring, member 3 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]
445329 SULT1A4; sulfotransferase family, cytosolic, 1A, phenol-
preferring, member 4 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]
1545 CYP1B1; cytochrome P450, family 1, subfamily B, polypeptide 1
(EC:1.14.14.1) [KO:K07410] [EC:1.14.14.1]
1558 CYP2C8; cytochrome P450, family 2, subfamily C, polypeptide 8
(EC:1.14.14.1) [KO:K07413] [EC:1.14.14.1]
1562 CYP2C18; cytochrome P450, family 2, subfamily C, polypeptide
18 (EC:1.14.14.1) [KO:K07413] [EC:1.14.14.1]
1557 CYP2C19; cytochrome P450, family 2, subfamily C, polypeptide
19 (EC:1.14.13.48 1.14.13.49 1.14.13.80) [KO:K07413]
[EC:1.14.14.1]
1559 CYP2C9; cytochrome P450, family 2, subfamily C, polypeptide 9
(EC:1.14.13.48 1.14.13.49 1.14.13.80) [KO:K07413]
[EC:1.14.14.1]
2052 EPHX1; epoxide hydrolase 1, microsomal (xenobiotic)
14. 09/05/13 K-INBRE Bioinformatics Core KSU
Outline
14
I. Basic concepts
i. Definition of bioinformatics
ii. Databases (flat-file and
relational)
iii. Assembly (Overlap-layout-
consensus)
II. Steps you can take on your
own
15. 09/05/13 K-INBRE Bioinformatics Core KSU
Assembly
15
Of the ~13,000,000,000,000,000bp of sequence data we can generate
each year, most is not the full length of the molecule of DNA or
RNA.
Instead, scientists get back multiple copies of their genome (or
transcriptome) but all in short segments (between 50bp and several
kbs)
Steps of Overlap-Layout-
Consensus (OLC):
1) Lets’ think of a genome like the
text of a book. We get back multiple
copies of the book
16. 09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
16
1) Instead of being nicely bound, we get randomly shredded text all
mixed together from our multiple copies
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
17. 09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
17
2) We look for lines that overlap for more than some minimum number
of letters (in these programs all overlaps are found, then a single “path”
is found through this “graph” of overlaps)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
18. 09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
18
2) We look for lines that overlap for more than some minimum number of
letters (in these programs overlaps are found, then a single “path” is found
through this “graph” of overlaps)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
19. 09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
19
3) We move column by column counting the letters in a column a make
a note of the most common letter (take the consensus)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do
20. 09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
20
3) We move column by column counting the letters in a column a make
a note of the most common letter (take the consensus)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do
21. 09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
21
3) We move column by column counting the letters in a column a make
a note of the most common letter (take the consensus)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do
22. 09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
22
3) We move column by column counting the letters in a column a make
a note of the most common letter (take the consensus)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do
23. 09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
23
3) We move column by column counting the letters in a column a make
a note of the most common letter (take the consensus)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do
24. 09/05/13 K-INBRE Bioinformatics Core KSU
0"
10"
20"
30"
40"
50"
60"
400! 500! 600! 700! 800!
Sand"bluestem"
(removed)"
Sand"bluestem"
(intact)"
0!
10!
20!
30!
40!
50!
60!
400! 500! 600! 700! 800!
Big$bluestem$
(removed)$
Big$bluestem$(intact)$
RelativereflectanceofEWC
Wavelength (nm)
Big bluestem Sand bluestem
Bischof B.
Bittersweet Balsam
Assemblies
homenursery.com gardeninginsomnia.com
24
60
145
230
315
400
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
MIRA(454)
MIRAcluster
0
75
150
225
300
375
450
525
600
Sand bluestem assembly length and number of contigs
Cumulativelengthofsequences(Mb)
Assembly k-mer value or name
Numberofsequences(k)
Cumulative length of sequences (Mb)
Number of sequences x 10^5
0.4
1.6
2.7
3.9
5.0
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
MIRA(454)
MIRAcluster
Sand bluestem N values
Contiglength(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)
N25 (kb)
k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative
length of
sequences
(Mb)
Number of
sequences x
105
k-mer N75 (kb) N50 (kb)
27
37
47
57
merge
CDH cluster
MIRA cluster
1.219 2.028 3.126 142.633358 1.28113 27 1.219 2.0
1.206 2.008 3.087 128.100083 1.1091 37 1.206 2.0
1.195 1.977 3.051 113.176134 0.93839 47 1.195 1.9
1.271 2.035 3.096 102.507455 0.82755 57 1.271 2.0
1.41 2.211 3.331 345.752982 2.31102 merge 1.41 2.2
1.44 2.27 3.422 84.202533 0.59174 CDH cluster 1440 2270
1.804 2.69 3.941 105.920843 0.50279 MIRA cluster 1804 2690
1.1
1.7
2.3
2.8
3.4
4.0
27
37
47
57
merge
CDHcluster
MIRAcluster
Balsam N values
Contiglength(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)
N25 (kb)
80
185
290
395
500
27
37
47
57
merge
CDHcluster
MIRAcluster
0
0.75
1.5
2.25
3
Balsam assembly length and number of contigs
Cumulativelengthofsequences(Mb)
Assembly k-mer value or name
Numberofsequencesx10^5
Cumulative length of sequences (Mb)
Number of sequences x 10^5
k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative
length of
sequences
(Mb)
Number of
sequences x
105
27
37
47
57
merge
CDH cluster
MIRA cluster
1.213 2.11 3.221 175.505163 1.61952
1.176 2.026 3.068 154.222168 1.36947
1.168 1.948 2.932 129.331497 1.07545
1.218 1.974 2.95 111.672465 0.90385
1.404 2.23 3.299 418.762352 2.77833
1.399 2.274 3.339 96.411479 0.70852 CDH cluster 1399 2274 3339 96411479 70852
1.825 2.676 3.856 123.666263 0.59598 MIRA cluster 1825 2676 3856 123666263 59598
100
200
300
400
500
27
37
47
57
merge
CDHcluster
MIRAcluster
0
0.75
1.5
2.25
3
Bittersweet assembly length and number of contigs
Cumulativelengthofsequences(Mb)
Assembly k-mer value or name
Numberofsequencesx10^5
Cumulative length of sequences (Mb)
Number of sequences x 10^5
1.1
1.8
2.6
3.3
4.0
27
37
47
57
merge
CDHcluster
MIRAcluster
Bittersweet N values
Contiglength(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)
N25 (kb)
Red flour beetle
Day E.
25. 09/05/13 K-INBRE Bioinformatics Core KSU
Outline
25
I. Basic concepts
i. Definition of bioinformatics
ii. Databases (flat-file and
relational)
iii. Assembly (Overlap-layout-
consensus)
II. Steps you can take on your
own
26. 09/05/13 K-INBRE Bioinformatics Core KSU
What can you do to get prepared?
26
-Manoj Samanta http://www.homolog.us/blogs/2011/07/22/a-beginners-
guide-to-bioinformatics-part-i/
•Layer 1 – Using web to analyze biological data
•Layer 2 – Ability to install and run new programs
•Layer 3 – Writing own scripts for analysis in PERL,
python or R
•Layer 4 – High level coding in C/C++/Java for
implementing existing algorithms or
modifying existing codes for new functionality
•Layer 5 – Thinking mathematically, developing own
algorithms and implementing in C/C++/
Java
If you are interested in studying bioinformatics here is an outline of
increasingly complex levels of skills you might work towards
27. 09/05/13 K-INBRE Bioinformatics Core KSU
K-INBRE resources
27
Over the fall semester the Bioinformatics Core and Virginia Rider
from Pittsburg State University will be hosting an undergraduate
bioinformatics club.
Our first topic will be command-line blast. Students will get an
account on Beocat (Kansas’ largest compute cluster).
http://bioinformaticsk-state-undergrad.blogspot.com
28. 09/05/13 K-INBRE Bioinformatics Core KSU
K-INBRE resources
28
K-INBRE hosts a journal club, Wednesday at noon, via PolyCom
to discuss current bioinformatics tools.
http://bioinformaticsk-state.blogspot.com/
30. 09/05/13 K-INBRE Bioinformatics Core KSU
K-INBRE resources
30
K-INBRE and i5K have begun a Github script sharing
organization to archive and share scripts.
https://github.com/i5K-KINBRE-script-share
i5K-KINBRE-
script-share
RNA-Seq
annotation and
comparison
genome
annotation and
comparison
genome and
transcriptome
assembly
read cleaning
and format
conversion
KSU
bioinfo
lab
Olson
lab
read
me
KSU
bioinfo
lab
Olson
lab
read
me
read
me
KSU
bioinfo
lab
Olson
lab
read
me
GitHub organization
Category of ‘omics’ tool
Lab or research group
List and description of
scripts
31. 09/05/13 K-INBRE Bioinformatics Core KSU
K-INBRE resources
31
-Git has very well developed version control built-in http://git-
scm.com/video/what-is-version-control
-Easy to search
-More advantages are reviewed in this quick introduction http://
git-scm.com/video/quick-wins
-Provides continuity within labs (as students and post docs
rotate out)
- Increases collaboration and sharing of workflows between our
community
- It is also a good way to distribute the code you describe in a
publication.
- Git is also widely used by beginners as well as developers of
technology and software in the omics community. Including:
https://github.com/broadinstitute (The Broad Institute)
https://github.com/lh3 (Li H. developer of BWA etc)
https://github.com/dzerbino (Daniel Zerbino developer of oases
and velvet)
https://github.com/PacificBiosciences