Before we dive in, let me ask a couple of questions:
Biologists?
Spark experts?
There are always at least three different constituencies in the room:
* biologists
* programmers
* someone thinking about how to build a business around this
Gonna tell you a lot of lies today.
Wonât satisfy everyone. Where I skip over the truth, maybe there will be at least a breadcrumb of truth left over.
This will not be a very technical talk.
Scared/pissed off some bio people in the past.
Bioinformatics is a field with a long history, thirty or more years as a separate discipline.
At the same time, the fundamental technology is changing.
So if I talk about âproblems of bioinformaticsâ today, itâs OK because
WE COME IN PEACE!
Bioinformatics software development has been *remarkably* effective, for decades.
If there are problems to be solved, these are the result of new technologies, new ambitions of scale.
What even is genomics?
Who here has heard the terms âchromosomeâ and âgeneâ before, and could explain the difference?
So before we dive into the main part of the talk, Iâm going to spend a few minutes discussing some of the basic biological concepts.
Fundamentally, weâre interested in studying individuals (and populations of individuals)
[ADVANCE]
But each individual is actually a population: of cells
[ADVANCE]
But each of those cells has, ideally, an identical genome.
The genome is a collection of 23 linear molecules. These are called âpolymers,â theyâre built (like Legos) out of a small number of repeated interlocking parts â these are the A, T, G, and C youâve probably heard about.
The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
Fundamentally, weâre interested in studying individuals (and populations of individuals)
[ADVANCE]
But each individual is actually a population: of cells
[ADVANCE]
But each of those cells has, ideally, an identical genome.
The genome is a collection of 23 linear molecules. These are called âpolymers,â theyâre built (like Legos) out of a small number of repeated interlocking parts â these are the A, T, G, and C youâve probably heard about.
The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
Fundamentally, weâre interested in studying individuals (and populations of individuals)
[ADVANCE]
But each individual is actually a population: of cells
[ADVANCE]
But each of those cells has, ideally, an identical genome.
The genome is a collection of 23 linear molecules. These are called âpolymers,â theyâre built (like Legos) out of a small number of repeated interlocking parts â these are the A, T, G, and C youâve probably heard about.
The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
Without losing much, assume that our genomes are contained on just a single chromosome.
Now, not only do all the cells in your body have identical genomesâŠ
[ADVANCE]
But individual humans have genomes that are very similar to each other.
So similar that I can define âthe sameâ chromosome between individuals⊠and that meansâŠ
[ADVANCE]
That we can define a âbaseâ or a âreferenceâ chromosome.
Now that there is a reference that all of us adhere toâŠ
[ADVANCE]
We can define a concept of âlocationâ across chromosomes.
This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system.
This also means that we can talk about differences between individuals in terms of diffs to a common reference genome.
But where does this reference genome come from?
Without losing much, assume that our genomes are contained on just a single chromosome.
Now, not only do all the cells in your body have identical genomesâŠ
[ADVANCE]
But individual humans have genomes that are very similar to each other.
So similar that I can define âthe sameâ chromosome between individuals⊠and that meansâŠ
[ADVANCE]
That we can define a âbaseâ or a âreferenceâ chromosome.
Now that there is a reference that all of us adhere toâŠ
[ADVANCE]
We can define a concept of âlocationâ across chromosomes.
This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system.
This also means that we can talk about differences between individuals in terms of diffs to a common reference genome.
But where does this reference genome come from?
Without losing much, assume that our genomes are contained on just a single chromosome.
Now, not only do all the cells in your body have identical genomesâŠ
[ADVANCE]
But individual humans have genomes that are very similar to each other.
So similar that I can define âthe sameâ chromosome between individuals⊠and that meansâŠ
[ADVANCE]
That we can define a âbaseâ or a âreferenceâ chromosome.
Now that there is a reference that all of us adhere toâŠ
[ADVANCE]
We can define a concept of âlocationâ across chromosomes.
This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system.
This also means that we can talk about differences between individuals in terms of diffs to a common reference genome.
But where does this reference genome come from?
Without losing much, assume that our genomes are contained on just a single chromosome.
Now, not only do all the cells in your body have identical genomesâŠ
[ADVANCE]
But individual humans have genomes that are very similar to each other.
So similar that I can define âthe sameâ chromosome between individuals⊠and that meansâŠ
[ADVANCE]
That we can define a âbaseâ or a âreferenceâ chromosome.
Now that there is a reference that all of us adhere toâŠ
[ADVANCE]
We can define a concept of âlocationâ across chromosomes.
This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system.
This also means that we can talk about differences between individuals in terms of diffs to a common reference genome.
But where does this reference genome come from?
Here is Bill Clinton (and Craig Venter and Francis Collins), announcing in June of 2000 the ârough draftâ of the Human Genome â this is the Human Genome Project.
Took >10 years and $2 billion
What did this actually do?
An ASCII text file with a linear sequence of 3 billion ACGTs
This is the reference. Now go cure cancer.
If this looks uninterpretable, it is!
Anyone recognize this?
Want to make an analogy.
Difficult to understand. How do I make it more comprehensible?
Mapmakers work to add ANNOTATIONS to the map.
Annotations are keyed by geo coordinates. Points, lines, and polygons in 2d space
And often, itâs only the annotations that are interesting, so mapmakers focus on *annotation* of the maps themselves.
The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes.
This is what we want to do for the genome.
What does the annotated map of the genome look like?
Chromosome on top. Highlighted red portion is what weâre zoomed in on.
See the scale: total of about 600,000 bases (ACGTs) arranged from left to right.
Multiple annotation âtracksâ are overlaid on the genome sequence, marking functional elements, positions of observed human differences, similarity to other animals.
In part itâs the product of numerous additional large biology annotation projects (e.g., HapMap project, 1000 Genomes, ENCODE).
Lot's of bioinformatics is computing these elements, or evaluating models on top of the elements.
How are these annotations actually generated? Shift gears and talk about the technology.
DNA SEQUENCING
If satellites provide images of the world for cartography, sequencers are the microscopes that give you âimagesâ of the genome.
Over past decade, massive EXPONENTIAL increase in throughput (much faster than Mooreâs law)
Get sample
Extract DNA (possibly other manipulations)
Dump into sequencer
Spits out text file (actually looks just like that)
But how to get from the text file to an annotation track that reconstructs a genome or shows position of certain functional elements?
[ADVANCE]
Bioinformatics is the computational process to reconstruct the genomic information. ButâŠ
[ADVANCE]
Often considered simply a black box.
What does it actually look like inside?
Get sample
Extract DNA (possibly other manipulations)
Dump into sequencer
Spits out text file (actually looks just like that)
But how to get from the text file to an annotation track that reconstructs a genome or shows position of certain functional elements?
[ADVANCE]
Bioinformatics is the computational process to reconstruct the genomic information. ButâŠ
[ADVANCE]
Often considered simply a black box.
What does it actually look like inside?
Get sample
Extract DNA (possibly other manipulations)
Dump into sequencer
Spits out text file (actually looks just like that)
But how to get from the text file to an annotation track that reconstructs a genome or shows position of certain functional elements?
[ADVANCE]
Bioinformatics is the computational process to reconstruct the genomic information. ButâŠ
[ADVANCE]
Often considered simply a black box.
What does it actually look like inside?
Pipelines, of course.
Example pipeline: raw sequencing data => a single individualâs âdiffâ from the reference.
How are these typically structured?
Each step is typically written as a standalone program â passing files from stage to stage
These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem. This has important implications for scalability.
What does one of these files look like?
Text is highly inefficient
Compresses poorly
Values must be parsed
Text is semi-structured
Flexible schemas make parsing difficult
Difficult to make assumptions on data structure
Text poorly separates the roles of delimiters and data
Requires escaping of control characters
(ASCII actually includes RS 0x1E and FS 0x1F, but theyâre never used)
Imposes severe constraint: global sort invariant. => Many impls depend on this, even if itâs not necessary or conducive to distributed computing.
Bioinformaticians LOVE hand-coded file formats.
But only store several fundamental data types.
Strong assumptions in the formats. Inconsistent implementations in multiple languages.
Doesnât allow different storage backends.
OK, we discussed what the data/files are like that are passed around. What about the computation itself?
Letâs take one of the transformations in the pipeline. Basically a more complex version of a DISTINCT operation.
Actual code from the standard Picard implementation of MarkDuplicates.
Two things should be going on:
Algorithm/Method overall
Actual code implementation.
Start by building some data structures from the input files.
Then iterate over file and rewrite is as necessary.
But what if we jump into one of these functions. Youâll find a dependence onâŠ
[ADVANCE]
An input option related to Unix file handle limits?
WTF?
Why should this METHOD need know anything about the platform that this is running on? LEAKY ABSTRACTIONS
Most bioinformatics tools make strong assumptions about their environments, and also the structure of the data (e.g., global sort), when it shouldnât be necessary.
Ok, but thatâs not allâŠ
[ADVANCE]
Weâve looked at the data and a bit of code for one of these tools. But this runs the pipeline on a single individual.
But of course, itâs never one pipelineâŠ
[ADVANCE]
Itâs a pipeline per person!
But since each pipeline runs (essentially) serially, scaling it up is easyâŠ
[ADVANCE]
Scale out!
Typically managed with a pretty low-level job scheduler.
Weâve looked at the data and a bit of code for one of these tools. But this runs the pipeline on a single individual.
But of course, itâs never one pipelineâŠ
[ADVANCE]
Itâs a pipeline per person!
But since each pipeline runs (essentially) serially, scaling it up is easyâŠ
[ADVANCE]
Scale out!
Typically managed with a pretty low-level job scheduler.
Weâve looked at the data and a bit of code for one of these tools. But this runs the pipeline on a single individual.
But of course, itâs never one pipelineâŠ
[ADVANCE]
Itâs a pipeline per person!
But since each pipeline runs (essentially) serially, scaling it up is easyâŠ
[ADVANCE]
Scale out!
Typically managed with a pretty low-level job scheduler.
MANUAL split and merge
MANUAL resource request
BABYSIT for failures/errors
CUSTOM intermediate ser/de
But this basically works and the parallelism is pretty simple. This architecture has kept up with the pace of sequencing for some time now.
Pipelines. Managed by job schedulers. Passing files around.
SO WHY AM I EVEN UP HERE TALKING? Two reasonsâŠ
SCALE!
New levels of ambition for large biology projects.
100k genomes at Genomics England in collaboration with National Health Service.
Raw data for a single individual can be in the hundreds of GB
But even before we hit that huge scale (which is soon)âŠ
For latest algorithms, we donât want to analyze each sample separately. We want to use ALL THE DATA we generate.
Well, these pipelines often include lots of aggregation, perhaps we can justâŠ
[ADVANCE]
Do the easy thing! Not ideal, especially as the amount of data goes up (data transfer), number of files increases (we saw file handles). May start hitting the cracks.
But even worseâŠ
[ADVANCE]
God help you if you want to jointly use all the data in earlier part of the pipeline.
2 Problems:
Large scale
Using all data simultaneously
For latest algorithms, we donât want to analyze each sample separately. We want to use ALL THE DATA we generate.
Well, these pipelines often include lots of aggregation, perhaps we can justâŠ
[ADVANCE]
Do the easy thing! Not ideal, especially as the amount of data goes up (data transfer), number of files increases (we saw file handles). May start hitting the cracks.
But even worseâŠ
[ADVANCE]
God help you if you want to jointly use all the data in earlier part of the pipeline.
2 Problems:
Large scale
Using all data simultaneously
For latest algorithms, we donât want to analyze each sample separately. We want to use ALL THE DATA we generate.
Well, these pipelines often include lots of aggregation, perhaps we can justâŠ
[ADVANCE]
Do the easy thing! Not ideal, especially as the amount of data goes up (data transfer), number of files increases (we saw file handles). May start hitting the cracks.
But even worseâŠ
[ADVANCE]
God help you if you want to jointly use all the data in earlier part of the pipeline.
2 Problems:
Large scale
Using all data simultaneously
How do we solve these problems?
Things like global sort order are overly restrictive and leads to algos relying on it when itâs not necessary.
A lot of the problems go away with a tool like Spark.
Example of an algo. Bioinformatics loves evaluating probabilistic models on the genome annotations.
We can easily extract parallelism at different parts of our pipelines.
With easiest language, we can describe a high-level computation.
Use higher level distributed computing primitives and let the system figure out all the platform issues for you: storage, job scheduling, fault tolerance, shuffles, serde.
Layered abstractions.
Use multiple storage engines with different characteristics. Multiple execution engines.
Avro ties it all together.
Application code/algos should only touch the top of the abstraction layer.
Cheap scalable STORAGE at bottom
Resource management middle
EXECUTION engines that can run your code on the cluster and provide parallelism
Consistent SERIALIZATION framework
Scientist should NOT WORRY about lower levels (coordination, file formats, storage details, fault tolerance)
Weâve implemented this vision with Spark, starting from the Amplab (same people that gave you Spark) into a project called
ADAM
The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but alsoâŠ
In addition to some of the standard pipeline transformations, implemented the core spatial join operations (analogous to a geospatial library).
Another computation for a statistical aggregate on genome variant data. Details not important.
Spark data flow:
Distributed data load
High level joins/spatial computations that are parallelized as necessary.
But really nice thing is because our data is stored using the Avro data modelâŠ
[ADVANCE]
You can execute the exact same computation using, for example, SQL!
Pick the best tool for the job.
Single-node performance improvements.
Free scalability: fixed price, significant wall-clock improvements
See most recent SIGMOD.
Controversial and disagree with many.
#8 similar to assuming primitive lowest common denominator
For especially for the last âmythâ, being able to achieve the ambition that people are proposing will require moving beyond âanything is okâ to making some important technical decisions.
Not to be outdone, Craig Venter proposes 1 million genomes at Human Longevity Inc.
Cloudera is hiring.
Including the data science team.