38. Quality is the key
Need high fidelity between prediction and observed
50 bytes per base
20 bytes per base
2 bytes per base
3 bits per base
<ddooling@wustl.edu>
48. The goal of this project is to provide a system
for storing and retrieving huge amounts of
data, distributed among a large number of
heterogenous server nodes, under a single
virtual filesystem tree with a variety of
standard access methods.
<ddooling@wustl.edu>
49. Write-only databases
Search limited to sequence and
values of specific XML entities
submitted as metadata
<ddooling@wustl.edu>
50. Write-only databases
x
Search limited to sequence and
values of specific XML entities
submitted as metadata
<ddooling@wustl.edu>
54. The Cathedral and the Bazaar
Linux overturned much of what I thought I
knew. I had been preaching the Unix gospel of
small tools, rapid prototyping and evolutionary
programming for years. But I also believed
there was a certain critical complexity above
which a more centralized, a priori approach was
required. I believed that the most important
software (operating systems and really large
tools like the Emacs programming editor)
needed to be built like cathedrals, carefully
crafted by individual wizards or small bands of
mages working in splendid isolation, with no
beta to be released before its time.
<ddooling@wustl.edu>
61. The Human Reference
(a) 2
A
4(24) B
82
3(2)
5 7 16(2)
3(3) 2
3 3(2) 2
5 58(2)
3(2)
2(2) 8 2(3)
6(2)
2(219) 2 2
23(2) 3
2
2 3 81
3(21)
4(22) 4(3)
13 3(24)
3
A 2(2)
2(2) 2(202)
19(8)
2(19) 2(15) 2 2(34)
2(13)
158
C
5(7) 2(42) 4(9)
2(15)
2(4)
7(8)
3(3) 71
B
18
2
C
2
D
37
F
139 6
E E
13(2) 13(2) 55(3)
2(6) 2(7) 6(3)
4(7)
4 5 2
F 3 D
38(6) 3(5)
160 3(50) 2
G G
2 2(61)
4(51) 2(49)
3(50) 8
2(7)
H
4
2(4)
142 2(50) 5
5(5) 8(6) 5(7)
158
3
3(41)
173
H
(b) (c) 142
G 160
81
13(7) 158
117 93 29
D H
A 184
9(6)
H
48(10)
140 8
8(5)
38(6)
114 G
F
13(2)
13(2) 55(3)
132
207 D
139
A 82
127(2) B E
62
E
37 71
B
F 37
139 D
F
13(2) 55(3)
E D 21 158
32(3) 45(3)
A
13(2) C
s5766
13(2)
38(6)
20(2)
18
F G
B
8
8(5) A
81
18(6) 58(7) E
171 C
G 123(2) 82
B
D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in
large genomes. Genome Biology 2006, 7:R7
<ddooling@wustl.edu>
What are the challenges that the large genome centers are currently facing that the typical researcher will be facing soon?
Do not store images
Do not store SRF
Keep FASTQ
This acceleration breaks everything
3.4*125/75*35 = 198.333333333333
We need to stop having to deal with images
It should be transparent to the end user
LHC http://atlasexperiment.org/
(90*2+90/125*50)*35 = 7560
Uncompressed
For 75 b read, you need 200 bytes, 25% is the headers
Save 12.5% by simply not replicating the sequence header
8*90/12*35 = 2100
Cost of software
The chain is only as strong as its weakest link.
Images: Assembly line backing up? Keystone cops piling up? Stooges?
Transition: situation not-unlike that faced by PC manufacturers over the past decade
This analogy works on another level as well...
Intel convinced everyone that the speed of the computer was equal to the clock speed of the processor
Many people believed this
Even when using a 56k modem
Even when AML Opteron came out
Even when Intel went to multi-core and lower clock speeds
A cautionary tale for those joining the Gb race
Which wraps up the scale up...
... and leads us into quality
... and leads us into quality
Make the best small engine in the world
Made high-quality cars for years
Recognized after years of consistent performance
Now enjoy premium cost and high resale value
Everyone I know has a Honda Odyssey
Money from the T-bird allowed them to design, develop, and introduce the...
It’s gotten better
Google image search second or third result
Draw your own conclusions
This distrust of base calls and quality values has reinforced the cult of traces
This does not scale for human resources, disk space, etc.
This leads to a very bad situation for those of us responsible for the computing, storage, and network infrastrcuture
Quality is at the core of all other issues, storage, compute, throughput, etc.
If it’s a bad base, call it a bad base
Don’t forget the GHz race
Reducing data to base calls and quality values does reduce its value
Especially for data not natively in “base space”
Is there a richness in this data that is lost?
But you gain not having to have custom tool tails for each native data type
2 bits/base is absolute minimum
Grid
No one ever feels lucky
No one ever feels lucky
They have learned their lesson, by creating an incredible amount of XML to submit
Study, Sample, Experiment, Run
He may know a lot about software, but he does not know anything about building cathedrals
Currently, revisions are tightly controlled by central repositories, NCBI, UCSC, EBI
Push and pull around diff’s
Balance curation with rapid advances
Debian web of trust
How far will FASTA get you?
C. elegans - part of genome repeat structure
http://genomebiology.com/2006/7/1/R7
Can you use the current de Bruijn graph assembly engines for alignment?