Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU
Bioinformatics
1
Introduction to the field of
bioinformatics
Sept, 2013
Jennifer Shelton
K-INBRE Bioinformatics Core KSU

Outline
2
I. Basic concepts
i. Definition of bioinformatics
ii. Databases (flat-file and
relational)
iii. Assembly (Overlap-layout-
consensus)
II. Steps you can take on your
own

Definition of bioinformatics
3
Acquire
data
Store/archive data
Organize data
Analyzedata
Visualizedata
Biological,
Medical,
Behavioral, or
Health
“Bioinformatics: Research,
development, or application of
computational tools and
approaches for expanding the
use of biological, medical,
behavioral or health data,
including those to acquire, store,
organize, archive, analyze, or
visualize such data.”
-NIH Biomedical Information
Science and Technology
Initiative Consortium 2000

Definition of bioinformatics
4
Acquire
data
Store/archive data
Organize data
Analyzedata
Visualizedata
Biological,
Medical,
Behavioral, or
Health
Acquire
data
Store/archive data
Organize data
Analyzedata
Visualizedata
Biological,
Medical,
Behavioral, or
Health
“Bioinformatics: Research,
development, or application of
computational tools and
approaches for expanding the
use of biological, medical,
behavioral or health data,
including those to acquire, store,
organize, archive, analyze, or
visualize such data.”
-NIH Biomedical Information
Science and Technology
Initiative Consortium 2000

Problem with volume
5
“We believe the field of
bioinformatics for genetic
analysis will be one of the
biggest areas of disruptive
innovation in life science tools
over the next few years,”
-Isaac Ro, Goldman Sachs
Mark Smiciklas, Flickr.com/photos/intersectionconsulting
Ro, Goldman Sachs
Per year worldwide we can
generate
~13,000,000,000,000,000 bp
of data

"This unprecedented amount of
sequencing information poses
bottlenecks that vary, depending on
application, at the level of data
extraction, analysis, and
interpretation”
"These challenges have become part
and parcel of the biomedical research
community where investigators have
increasingly needed to incorporate
bioinformatics and biostatistics into
their armamentarium."
Problem with volume
6
Mark Smiciklas, Flickr.com/photos/intersectionconsulting
Opportunities and Challenges Associated with Clinical
Diagnostic Genome Sequencing: A Report of the
Association for Molecular Pathology. The Journal of
Molecular Diagnostics - November 2012

“It sounds like an analog
solution in a digital age,”-Sifei
He, head of cloud computing
for BGI (referring to FedExing
disks of data because internet
connections are often too slow)
NY Times 2011 article: DNA
Sequencing Caught in a
Deluge of Data http://
www.nytimes.com/
2011/12/01/business/dna-
sequencing-caught-in-
deluge-of-data.html?
pagewanted=all&_r=0
Problem with volume
7

Examples of bioinformatics tools
8
9/4/13 tumblr_m5sa3oXBAB1rrtrfso1_500.jpg (500×500)
?
? ?
?
?
?
?
?
?

Outline
9
I. Basic concepts
relational)
consensus)
own

Flat-file databases
‘records’ about one unique
object
‘fields’ same kind of data
about different object
http://www.ncbi.nlm.nih.gov/
genbank/
10
GenBank:

09/05/13 K-INBRE Bioinformatics Core KSU 11
Flat-file databases
Any flat-file database, like GenBank can be thought of as a single
spreadsheet called a ‘table’ of ‘fields’ and ‘records’

Relational databases
Have multiple tables
with some shared
fields and some
different
**‘fields’ same kind of
data about different
objects
http://www.genome.jp/kegg/
pathway.html
12

Relational databases
Relational databases are like multiple tables that are linked with a
shared field. This acts like a “key” between them
13
9/25/12 KEGG PATHWAY: hsa05204
2/10www.genome.jp/dbget-‐‑bin/www_bget?pathway+hsa05204
Organism Homo sapiens (human) [GN:hsa]
Gene 1543 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1
(EC:1.14.14.1) [KO:K07408] [EC:1.14.14.1]
1576 CYP3A4; cytochrome P450, family 3, subfamily A, polypeptide 4
(EC:1.14.13.67 1.14.13.97 1.14.13.32) [KO:K07424]
[EC:1.14.14.1]
(EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]
(EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]
64816 CYP3A43; cytochrome P450, family 3, subfamily A, polypeptide
43 (EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]
5743 PTGS2; prostaglandin-endoperoxide synthase 2 (prostaglandin
G/H synthase and cyclooxygenase) (EC:1.14.99.1) [KO:K11987]
[EC:1.14.99.1]
10 NAT2; N-acetyltransferase 2 (arylamine N-acetyltransferase)
(EC:2.3.1.5) [KO:K00622] [EC:2.3.1.5]
9 NAT1; N-acetyltransferase 1 (arylamine N-acetyltransferase)
(EC:2.3.1.5) [KO:K00622] [EC:2.3.1.5]
(EC:1.14.14.1) [KO:K07409] [EC:1.14.14.1]
6799 SULT1A2; sulfotransferase family, cytosolic, 1A, phenol-
preferring, member 2 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]
1545 CYP1B1; cytochrome P450, family 1, subfamily B, polypeptide 1
(EC:1.14.14.1) [KO:K07410] [EC:1.14.14.1]
1558 CYP2C8; cytochrome P450, family 2, subfamily C, polypeptide 8
(EC:1.14.14.1) [KO:K07413] [EC:1.14.14.1]
1562 CYP2C18; cytochrome P450, family 2, subfamily C, polypeptide
18 (EC:1.14.14.1) [KO:K07413] [EC:1.14.14.1]
1557 CYP2C19; cytochrome P450, family 2, subfamily C, polypeptide
19 (EC:1.14.13.48 1.14.13.49 1.14.13.80) [KO:K07413]
[EC:1.14.14.1]
1559 CYP2C9; cytochrome P450, family 2, subfamily C, polypeptide 9
(EC:1.14.13.48 1.14.13.49 1.14.13.80) [KO:K07413]
[EC:1.14.14.1]
2052 EPHX1; epoxide hydrolase 1, microsomal (xenobiotic)

Outline
14
I. Basic concepts
relational)
consensus)
own

Assembly
15
Of the ~13,000,000,000,000,000bp of sequence data we can generate
each year, most is not the full length of the molecule of DNA or
RNA.
Instead, scientists get back multiple copies of their genome (or
transcriptome) but all in short segments (between 50bp and several
kbs)
Steps of Overlap-Layout-
Consensus (OLC):
1) Lets’ think of a genome like the
text of a book. We get back multiple
copies of the book

OLC Assembly
16
1) Instead of being nicely bound, we get randomly shredded text all
mixed together from our multiple copies
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing

OLC Assembly
17
2) We look for lines that overlap for more than some minimum number
of letters (in these programs all overlaps are found, then a single “path”
is found through this “graph” of overlaps)
Alice was
of having nothing

OLC Assembly
18
2) We look for lines that overlap for more than some minimum number of
letters (in these programs overlaps are found, then a single “path” is found
through this “graph” of overlaps)
Alice was
of having nothing

OLC Assembly
19
3) We move column by column counting the letters in a column a make
a note of the most common letter (take the consensus)
Alice was
of having nothing
Alice was
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do

OLC Assembly
20
Alice was
of having nothing
Alice was
of having nothing

OLC Assembly
21
Alice was
of having nothing
Alice was
of having nothing

OLC Assembly
22
Alice was
of having nothing
Alice was
of having nothing

OLC Assembly
23
Alice was
of having nothing

0"
10"
20"
30"
40"
50"
60"
400! 500! 600! 700! 800!
Sand"bluestem"
(removed)"
Sand"bluestem"
(intact)"
0!
10!
20!
30!
40!
50!
60!
400! 500! 600! 700! 800!
Big$bluestem$
(removed)$
Big$bluestem$(intact)$
RelativereﬂectanceofEWC
Wavelength (nm)
Big bluestem Sand bluestem
Bischof B.
Bittersweet Balsam
Assemblies
homenursery.com gardeninginsomnia.com
24
60
145
230
315
400
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
MIRA(454)
MIRAcluster
0
75
150
225
300
375
450
525
600
Sand bluestem assembly length and number of contigs
Cumulativelengthofsequences(Mb)
Assembly k-mer value or name
Numberofsequences(k)
Cumulative length of sequences (Mb)
Number of sequences x 10^5
0.4
1.6
2.7
3.9
5.0
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
MIRA(454)
MIRAcluster
Sand bluestem N values
Contiglength(kb)
N75 (kb) N50 (kb)
N25 (kb)
k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative
length of
sequences
(Mb)
Number of
sequences x
105
k-mer N75 (kb) N50 (kb)
27
37
47
57
merge
CDH cluster
MIRA cluster
1.219 2.028 3.126 142.633358 1.28113 27 1.219 2.0
1.206 2.008 3.087 128.100083 1.1091 37 1.206 2.0
1.195 1.977 3.051 113.176134 0.93839 47 1.195 1.9
1.271 2.035 3.096 102.507455 0.82755 57 1.271 2.0
1.41 2.211 3.331 345.752982 2.31102 merge 1.41 2.2
1.44 2.27 3.422 84.202533 0.59174 CDH cluster 1440 2270
1.804 2.69 3.941 105.920843 0.50279 MIRA cluster 1804 2690
1.1
1.7
2.3
2.8
3.4
4.0
27
37
47
57
merge
CDHcluster
MIRAcluster
Balsam N values
Contiglength(kb)
N75 (kb) N50 (kb)
N25 (kb)
80
185
290
395
500
27
37
47
57
merge
CDHcluster
MIRAcluster
0
0.75
1.5
2.25
3
Balsam assembly length and number of contigs
Numberofsequencesx10^5
k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative
length of
sequences
(Mb)
Number of
sequences x
105
27
37
47
57
merge
CDH cluster
MIRA cluster
1.213 2.11 3.221 175.505163 1.61952
1.176 2.026 3.068 154.222168 1.36947
1.168 1.948 2.932 129.331497 1.07545
1.218 1.974 2.95 111.672465 0.90385
1.404 2.23 3.299 418.762352 2.77833
1.399 2.274 3.339 96.411479 0.70852 CDH cluster 1399 2274 3339 96411479 70852
1.825 2.676 3.856 123.666263 0.59598 MIRA cluster 1825 2676 3856 123666263 59598
100
200
300
400
500
27
37
47
57
merge
CDHcluster
MIRAcluster
0
0.75
1.5
2.25
3
Bittersweet assembly length and number of contigs
Numberofsequencesx10^5
1.1
1.8
2.6
3.3
4.0
27
37
47
57
merge
CDHcluster
MIRAcluster
Bittersweet N values
Contiglength(kb)
N75 (kb) N50 (kb)
N25 (kb)
Red flour beetle
Day E.

Outline
25
I. Basic concepts
relational)
consensus)
own

What can you do to get prepared?
26
-Manoj Samanta http://www.homolog.us/blogs/2011/07/22/a-beginners-
guide-to-bioinformatics-part-i/
•Layer 1 – Using web to analyze biological data
•Layer 2 – Ability to install and run new programs
•Layer 3 – Writing own scripts for analysis in PERL,
python or R
•Layer 4 – High level coding in C/C++/Java for
implementing existing algorithms or
modifying existing codes for new functionality
•Layer 5 – Thinking mathematically, developing own
algorithms and implementing in C/C++/
Java
If you are interested in studying bioinformatics here is an outline of
increasingly complex levels of skills you might work towards

K-INBRE resources
27
Over the fall semester the Bioinformatics Core and Virginia Rider
from Pittsburg State University will be hosting an undergraduate
bioinformatics club.
Our first topic will be command-line blast. Students will get an
account on Beocat (Kansas’ largest compute cluster).
http://bioinformaticsk-state-undergrad.blogspot.com

K-INBRE resources
28
K-INBRE hosts a journal club, Wednesday at noon, via PolyCom
to discuss current bioinformatics tools.
http://bioinformaticsk-state.blogspot.com/

K-INBRE resources
29
Bradley Olson and K-INBRE – Perl
Justin Blumenstiel et al. – Python
http://bioinformaticskstateperl.blogspot.com/

K-INBRE resources
30
K-INBRE and i5K have begun a Github script sharing
organization to archive and share scripts.
https://github.com/i5K-KINBRE-script-share
i5K-KINBRE-
script-share
RNA-Seq
annotation and
comparison
genome
annotation and
comparison
genome and
transcriptome
assembly
read cleaning
and format
conversion
KSU
bioinfo
lab
Olson
lab
read
me
KSU
bioinfo
lab
Olson
lab
read
me
read
me
KSU
bioinfo
lab
Olson
lab
read
me
GitHub organization
Category of ‘omics’ tool
Lab or research group
List and description of
scripts

K-INBRE resources
31
-Git has very well developed version control built-in http://git-
scm.com/video/what-is-version-control
-Easy to search
-More advantages are reviewed in this quick introduction http://
git-scm.com/video/quick-wins
-Provides continuity within labs (as students and post docs
rotate out)
- Increases collaboration and sharing of workflows between our
community
- It is also a good way to distribute the code you describe in a
publication.
- Git is also widely used by beginners as well as developers of
technology and software in the omics community. Including:
https://github.com/broadinstitute (The Broad Institute)
https://github.com/lh3 (Li H. developer of BWA etc)
https://github.com/dzerbino (Daniel Zerbino developer of oases
and velvet)
https://github.com/PacificBiosciences

Questions?
32
9/4/13 tumblr_mp3qolvEiS1rr34bqo1_500.jpg (497×628)
Contact information:
sheltonj@ksu.edu
K-INBRE Bioinformatics
Core:
http://www.kumc.edu/kinbre/
bioinformatics.html
http://bioinformatics.k-
state.edu/

Intro to field_of_bioinformatics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (14)

Mehr von Jennifer Shelton

Mehr von Jennifer Shelton (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Intro to field_of_bioinformatics