Presentation from Strata-Hadoop 2015 (http://strataconf.com/big-data-conference-ny-2015/public/schedule/speaker/197575) -- a brief introduction to genomics followed by an overview of approaches to bioinformatics coding using Spark. Pretty high-level.
22. A Tale of Three File Formats
BAM Files: Do You Read
Me?
Compressed text files & custom index formats
User-defined attributes
Multi-record structure
23. “Not wishing to be outdone
by Amazon, Sanger
Institute develops drone
deliver system for BAM
files.”
28. Why Are We Still Defining
File Formats By Hand?
• Instead of defining custom file
formats for each data type and
access pattern…
• Parquet creates a
compressed format for each
Avro-defined data model.
• Improvement over existing
formats1
• 20-22% for BAM
• ~95% for VCF
1
compression % quoted from 1K Genomes
29. Spark + Genomics =
ADAM
• Hosted at Berkeley and the
AMPLab
• Apache 2 License
• Contributors from both
research and commercial
organizations
• Core spatial primitives,
variant calling
• Avro and Parquet for data
models and file formats
31. The Terrible Trouble
with Existing Pipelines
Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)
32. “I think you know what the
problem is, just as well as I
do.”
A single piece of a
filtering stage for a
somatic variant caller
“11-base-pair window
centered on a candidate
mutation” actually turns
out to be optimized for
a particular file format
and sort order
33. “Myths of Bioinformatics
Software”
1. Somebody will build on your code
2. You should have assembled a team to build your software
3. If you choose the right license, more people will use and build on your software.
4. Making software free for commercial use shows you are not against companies.
5. You should maintain your software indefinitely
6. Your “stable URL” can exist forever
7. You should make your software “idiot proof”
8. You used the right programming language for the task.
Lior Pachter
https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
W
e
Can
M
ake
O
ur
O
w
n
M
yths
I’m nervous, so I’ll be speaking fast.
Before we dive in, let me ask a couple of questions:
biologists?
Spark experts?
This entire presentation is a lie.
There are always at least three different constituencies in the room:
* biologists
* programmers
* someone thinking about how to build a business around this
I am going to try and split the difference, but I won’t be able to satisfy everyone. In all the places where I have to skip over the truth, maybe there will be at least a breadcrumb back to the truth
This isn’t a technical talk.
Let’s talk about the title –
Next generations? I didn’t realize that there was a *first* generation!
Bioinformatics is a field with a long history, thirty or more years as a separate discipline.
At the same time, the fundamental technology is changing.
So if I talk about ‘problems’ today, it’s OK
[animation]
I come in peace! Bioinformatics software development has been *remarkalbly* effective, for decades.
If there are problems to be solved, these are the result of new technologies, new conceptions of scale.
So that’s “next generation,” but what about…
Genomics?
What even is genomics?
Who here has heard the terms ‘chromosome’ and ‘gene’ before, and could explain the difference?
So before we dive into the main part of the talk, I’m going to spend a few minutes discussing some of the basic biological concepts.
Fundamentally, we’re interested in studying individuals (and populations of individuals)
Each individual is *itself* a population: of cells
But each of those cells has, ideally, an identical genome.
The genome is a collection of 23 molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about.
The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
Now, not only do all the cells in your body have identical genomes…
[ANIMATE]
But individual humans have genomes that are very similar to each other.
So similar that I can define “the same” chromosome between individuals… and that means
[ANIMATE]
That we can define a ‘base’ or a ‘reference’ chromosome
[ANIMATE]
And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome.
But where does this reference genome come from?
Now, not only do all the cells in your body have identical genomes…
[ANIMATE]
But individual humans have genomes that are very similar to each other.
So similar that I can define “the same” chromosome between individuals… and that means
[ANIMATE]
That we can define a ‘base’ or a ‘reference’ chromosome
[ANIMATE]
And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome.
But where does this reference genome come from?
Now, not only do all the cells in your body have identical genomes…
[ANIMATE]
But individual humans have genomes that are very similar to each other.
So similar that I can define “the same” chromosome between individuals… and that means
[ANIMATE]
That we can define a ‘base’ or a ‘reference’ chromosome
[ANIMATE]
And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome.
But where does this reference genome come from?
Now, not only do all the cells in your body have identical genomes…
[ANIMATE]
But individual humans have genomes that are very similar to each other.
So similar that I can define “the same” chromosome between individuals… and that means
[ANIMATE]
That we can define a ‘base’ or a ‘reference’ chromosome
[ANIMATE]
And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome.
But where does this reference genome come from?
Here is Bill Clinton (and Craig Venter and Francis Crick), announcing in June of 2000 the “rough draft” of the Human Genome – this is the Human Genome Project.
1570: Theatrum Orbis Terrarum
“Theater of the world”
First modern atlas.
A direct byproduct of the first 100 years of PRINTING, and a tool for describing and exploring the world around us.
It’s direct descendants are still with us, today!
Google maps!
But what does the genomic version of this look like?
Mapmakers today focus on *annotation* of the maps themselves.
The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes.
This is a manhattan plot, of alzheimer’s related genes and sequence markers.
Now let’s shift gears, and talk about how this was performed – through sequencers.
Sequencers are microscopes that read the genome.
If there’s one graph you should remember, in order to understand the last (and the next) ten years of bioinformatics and genomics, it’s this one
The Human Genome Project was thousands of researchers, billions of dollars, spent over a decade, all to sequence on-the-order-of half a dozen individuals.
Today, we’re close to the “thousand dollar genome” – and already we’re seeing prototype sequencers with the form factor of a USB stick.
So sequencers will drive everything before it – but sequencers are only ever half the story.
Bioinformatics is a computational reversal of the sequencing process.
[ANIMATE]
But to most
So… what’s in the box?
It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?)
It’s never *one* pipeline, we do this once for every person
Let me talk a little bit about the structure of one of these pipelines
Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes
These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem
But of course, it’s never one pipeline
[ANIMATE]
It’s a pipeline per person
[ANIMATE]
But since each pipeline runs (essentially) serially, scaling up is easy: scale out!
[ANIMATE]
It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?)
It’s never *one* pipeline, we do this once for every person
Let me talk a little bit about the structure of one of these pipelines
Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes
These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem
But of course, it’s never one pipeline
[ANIMATE]
It’s a pipeline per person
[ANIMATE]
But since each pipeline runs (essentially) serially, scaling up is easy: scale out!
[ANIMATE]
It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?)
It’s never *one* pipeline, we do this once for every person
Let me talk a little bit about the structure of one of these pipelines
Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes
These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem
But of course, it’s never one pipeline
[ANIMATE]
It’s a pipeline per person
[ANIMATE]
But since each pipeline runs (essentially) serially, scaling up is easy: scale out!
[ANIMATE]
That was the data side, but let’s open up the computation as well. Take one of those boxes, that I drew earlier. Here’s alignment, but it could be…
[ANIMATE]
any bioinformatics tool. I assert that there are *two* things going on inside any bioinformatics tool –
[ANIMATE]
There is the method, and there is the implementation of that method. I think this is an important distinction to make…
But even that is a lie, because there is a third thing…
[ANIMATE]
“Platform.” That’s why I’ve included this code snippet up above.
So what’s the problem? Faster sequencers means we sequence more people, but we have tools that work and a natural path to parallelism! Why does there need to be a “next generation?”
The answer, of course, is that when you have all that data, you want to *USE* all that data.
When you want to *use* all the data, now your entire system will start to show cracks.
This is an example, variant calling.
But [ANIMATE]
God help you if you want to combine statistical information at an earlier phase of the process.
But this is by no means a unique problem. And what is one solution? You might have guessed it from the title to my talk….
There’s more parallelism that we can extract from our pipelines.
Spark.
The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but also…