Strata-Hadoop 2015 Presentation

•Als PPT, PDF herunterladen•

1 gefällt mir•1,644 views

Presentation from Strata-Hadoop 2015 (http://strataconf.com/big-data-conference-ny-2015/public/schedule/speaker/197575) -- a brief introduction to genomics followed by an overview of approaches to bioinformatics coding using Spark. Pretty high-level.

Daten & Analysen

Next-Generation
Genomics
Using Spark and ADAM
Timothy Danford
Tamr Inc.
AMPLab

One chromosome
per person
defines a
reference
chromosome

One chromosome
per person
defines a
reference
chromosome
and
location

Lambert et al. “Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease” (2013)

Down the
Long Slide,
To
Happiness
Endlessly

We often treat
‘bioinformatics’ as a
black box
Vials into Files

A Tale of Three File Formats
BAM Files: Do You Read
Me?
Compressed text files & custom index formats
User-defined attributes
Multi-record structure

“Not wishing to be outdone
by Amazon, Sanger
Institute develops drone
deliver system for BAM
files.”

Bioinformaticians
❤️
Probabilistic
Models
Our Data Scattered Back and
Forth
Across Space by this Gadget

Why Are We Still Defining
File Formats By Hand?
• Instead of defining custom file
formats for each data type and
access pattern…
• Parquet creates a
compressed format for each
Avro-defined data model.
• Improvement over existing
formats1
• 20-22% for BAM
• ~95% for VCF
1
compression % quoted from 1K Genomes

Spark + Genomics =
ADAM
• Hosted at Berkeley and the
AMPLab
• Apache 2 License
• Contributors from both
research and commercial
organizations
• Core spatial primitives,
variant calling
• Avro and Parquet for data
models and file formats

Core Genomics Primitives:
The Needs of the Many

The Terrible Trouble
with Existing Pipelines
Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)

“I think you know what the
problem is, just as well as I
do.”
A single piece of a
filtering stage for a
somatic variant caller
“11-base-pair window
centered on a candidate
mutation” actually turns
out to be optimized for
a particular file format
and sort order

“Myths of Bioinformatics
Software”
1. Somebody will build on your code
2. You should have assembled a team to build your software
3. If you choose the right license, more people will use and build on your software.
4. Making software free for commercial use shows you are not against companies.
5. You should maintain your software indefinitely
6. Your “stable URL” can exist forever
7. You should make your software “idiot proof”
8. You used the right programming language for the task.
Lior Pachter
https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
W
e
Can
M
ake
O
ur
O
w
n
M
yths

Empfohlen

Spark Summit East 2015Timothy Danford

Why is Bioinformatics a Good Fit for Spark?Timothy Danford

"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media

Scaling up genomic analysis with ADAMfnothaft

Fast Variant Calling with ADAM and avocadofnothaft

Lightning fast genomics with Spark, Adam and ScalaAndy Petrella

Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella

Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella

Empfohlen

Spark Summit East 2015Timothy Danford

Why is Bioinformatics a Good Fit for Spark?Timothy Danford

"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media

Scaling up genomic analysis with ADAMfnothaft

Fast Variant Calling with ADAM and avocadofnothaft

Lightning fast genomics with Spark, Adam and ScalaAndy Petrella

Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella

Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella

Design for Scalability in ADAMfnothaft

Scalable up genomic analysis with ADAMfnothaft

Scalable Genome Analysis With ADAMfnothaft

Scalable Genome Analysis with ADAMfnothaft

Managing Genomes At Scale: What We Learned - StampedeCon 2014StampedeCon

A Closer Look at the Changing Dynamics of DBpedia MappingsMaribel Acosta Deibe

Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyMaribel Acosta Deibe

Challenges and Opportunities of Big Data GenomicsYasin Memari

Datat and donuts: how to write a data management planC. Tobin Magle

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong

Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit

Lisp Machine PrunciplesVsevolod Dyomkin

NgspTim Clark

Automating biostatistics workflows using R-based webtoolsBioinformatics and Computational Biosciences Branch

Getting Started with RNA-Seq Data AnalysisAndreas Wilm

2014 nicta-reproducibilityc.titus.brown

Next-generation sequencing: Data mangementGuy Coates

Future Architectures for genomicsGuy Coates

Introduction to rgslicraf

Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Spark Summit

Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD

Weitere ähnliche Inhalte

Was ist angesagt?

Design for Scalability in ADAMfnothaft

Scalable up genomic analysis with ADAMfnothaft

Scalable Genome Analysis With ADAMfnothaft

Scalable Genome Analysis with ADAMfnothaft

Managing Genomes At Scale: What We Learned - StampedeCon 2014StampedeCon

A Closer Look at the Changing Dynamics of DBpedia MappingsMaribel Acosta Deibe

Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyMaribel Acosta Deibe

Challenges and Opportunities of Big Data GenomicsYasin Memari

Datat and donuts: how to write a data management planC. Tobin Magle

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong

Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit

Lisp Machine PrunciplesVsevolod Dyomkin

NgspTim Clark

Automating biostatistics workflows using R-based webtoolsBioinformatics and Computational Biosciences Branch

Getting Started with RNA-Seq Data AnalysisAndreas Wilm

2014 nicta-reproducibilityc.titus.brown

Next-generation sequencing: Data mangementGuy Coates

Future Architectures for genomicsGuy Coates

Introduction to rgslicraf

Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Spark Summit

Was ist angesagt? (20)

Design for Scalability in ADAM

Scalable up genomic analysis with ADAM

Scalable Genome Analysis With ADAM

Scalable Genome Analysis with ADAM

Managing Genomes At Scale: What We Learned - StampedeCon 2014

A Closer Look at the Changing Dynamics of DBpedia Mappings

Crowdsourcing the Quality of Knowledge Graphs:A DBpedia Study

Challenges and Opportunities of Big Data Genomics

Datat and donuts: how to write a data management plan

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...

Spark Summit EU talk by Erwin Datema and Roeland van Ham

Lisp Machine Prunciples

Ngsp

Automating biostatistics workflows using R-based webtools

Getting Started with RNA-Seq Data Analysis

2014 nicta-reproducibility

Next-generation sequencing: Data mangement

Future Architectures for genomics

Introduction to r

Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...

Andere mochten auch

Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD

Distributed machine learning 101 using apache spark from the browserAndy Petrella

Genomics isn't SpecialAllen Day, PhD

Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala

Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD

Deep Advances in Generative Modelingindico data

Deep Learning And Business Models (VNITC 2015-09-13)Ha Phuong

Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson

Andere mochten auch (9)

Processing 70Tb Of Genomics Data With ADAM And Toil

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

Distributed machine learning 101 using apache spark from the browser

Genomics isn't Special

Building cloud-enabled genomics workflows with Luigi and Docker

Genome Analysis Pipelines with Spark and ADAM

Deep Advances in Generative Modeling

Deep Learning And Business Models (VNITC 2015-09-13)

Hadoop for Bioinformatics: Building a Scalable Variant Store

Ähnlich wie Strata-Hadoop 2015 Presentation

ADAM—Spark Summit, 2014fnothaft

2015 illinois-talkc.titus.brown

Closing the Gap in Time: From Raw Data to Real ScienceJustin Johnson

Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol

Data analysis & integration challenges in genomicsmikaelhuss

B.sc biochem i bobi u 2 databaseRai University

The seven-deadly-sins-of-bioinformatics3960mare34

The Seven Deadly Sins of BioinformaticsDuncan Hull

amrapali builders -- hacking the genome.pdfamrapalibuildersreviews

2012 hpcuserforum talkc.titus.brown

Finding knowledge, data and answers on the Semantic Webebiquity

Next generation sequencing requires next generation publishing: the Biodivers...Vince Smith

Ontologies for life sciences: examples from the gene ontologyMelanie Courtot

2013 caltech-edrn-talkc.titus.brown

2014 uclc.titus.brown

Bioinformatics Final ReportShruthi Choudary

Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)Numenta

An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow

Thoughts on the feasibility of an Assemblathon 3 contestKeith Bradnam

Computation and KnowledgeIan Foster

Ähnlich wie Strata-Hadoop 2015 Presentation (20)

ADAM—Spark Summit, 2014

2015 illinois-talk

Closing the Gap in Time: From Raw Data to Real Science

Standarization in Proteomics: From raw data to metadata files

Data analysis & integration challenges in genomics

B.sc biochem i bobi u 2 database

The seven-deadly-sins-of-bioinformatics3960

The Seven Deadly Sins of Bioinformatics

amrapali builders -- hacking the genome.pdf

2012 hpcuserforum talk

Finding knowledge, data and answers on the Semantic Web

Next generation sequencing requires next generation publishing: the Biodivers...

Ontologies for life sciences: examples from the gene ontology

2013 caltech-edrn-talk

2014 ucl

Bioinformatics Final Report

Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)

An-Exploration-of-scientific-literature-using-Natural-Language-Processing

Thoughts on the feasibility of an Assemblathon 3 contest

Computation and Knowledge

Kürzlich hochgeladen

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

B2 Creative Industry Response Evaluation.docxStephen266013

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Invezz.com - Grow your wealth with trading signalsInvezz1

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

April 2024 - Crypto Market Report's Analysismanisha194592

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一ffjhghh

Kürzlich hochgeladen (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

Schema on read is obsolete. Welcome metaprogramming..pdf

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Brighton SEO | April 2024 | Data Storytelling

B2 Creative Industry Response Evaluation.docx

BigBuy dropshipping via API with DroFx.pptx

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

Unveiling Insights: The Role of a Data Analyst

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Log Analysis using OSSEC sasoasasasas.pptx

Invezz.com - Grow your wealth with trading signals

Smarteg dropshipping via API with DroFx.pptx

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

RA-11058_IRR-COMPRESS Do 198 series of 1998

Call Girls In Mahipalpur O9654467111 Escorts Service

100-Concepts-of-AI by Anupama Kate .pptx

April 2024 - Crypto Market Report's Analysis

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一

Strata-Hadoop 2015 Presentation

1. Next-Generation Genomics Using Spark and ADAM Timothy Danford Tamr Inc. AMPLab

2. Next Generation? We come in peace.

3. What even is genomics?

4. Organism Cell Genome

5. One chromosome

6. One chromosome per person

7. One chromosome per person defines a reference chromosome

8. One chromosome per person defines a reference chromosome and location

9. “… decoding the Book of Life”

10. Ortellius, 1570

11. Google, 2005

12.

13.

14.

15. Lambert et al. “Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease” (2013)

16. Down the Long Slide, To Happiness Endlessly

17. We often treat ‘bioinformatics’ as a black box Vials into Files

18. What’s In The Box?

19.

20. My God, It’s Full of Pipelines

21. My God, It’s Full of Pipelines

22. A Tale of Three File Formats BAM Files: Do You Read Me? Compressed text files & custom index formats User-defined attributes Multi-record structure

23. “Not wishing to be outdone by Amazon, Sanger Institute develops drone deliver system for BAM files.”

24. Open the Pod Bay Doors, Pal

25. I Had a Dream It Would End This Way

26. What to do, what to do?

27. Bioinformaticians ❤️ Probabilistic Models Our Data Scattered Back and Forth Across Space by this Gadget

28. Why Are We Still Defining File Formats By Hand? • Instead of defining custom file formats for each data type and access pattern… • Parquet creates a compressed format for each Avro-defined data model. • Improvement over existing formats1 • 20-22% for BAM • ~95% for VCF 1 compression % quoted from 1K Genomes

29. Spark + Genomics = ADAM • Hosted at Berkeley and the AMPLab • Apache 2 License • Contributors from both research and commercial organizations • Core spatial primitives, variant calling • Avro and Parquet for data models and file formats

30. Core Genomics Primitives: The Needs of the Many

31. The Terrible Trouble with Existing Pipelines Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)

32. “I think you know what the problem is, just as well as I do.” A single piece of a filtering stage for a somatic variant caller “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order

33. “Myths of Bioinformatics Software” 1. Somebody will build on your code 2. You should have assembled a team to build your software 3. If you choose the right license, more people will use and build on your software. 4. Making software free for commercial use shows you are not against companies. 5. You should maintain your software indefinitely 6. Your “stable URL” can exist forever 7. You should make your software “idiot proof” 8. You used the right programming language for the task. Lior Pachter https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/ W e Can M ake O ur O w n M yths

34. Thanks to... And thank you! Questions?

Hinweis der Redaktion

I’m nervous, so I’ll be speaking fast. Before we dive in, let me ask a couple of questions: biologists? Spark experts? This entire presentation is a lie. There are always at least three different constituencies in the room: * biologists * programmers * someone thinking about how to build a business around this I am going to try and split the difference, but I won’t be able to satisfy everyone. In all the places where I have to skip over the truth, maybe there will be at least a breadcrumb back to the truth This isn’t a technical talk. Let’s talk about the title –
Next generations? I didn’t realize that there was a *first* generation! Bioinformatics is a field with a long history, thirty or more years as a separate discipline. At the same time, the fundamental technology is changing. So if I talk about ‘problems’ today, it’s OK [animation] I come in peace! Bioinformatics software development has been *remarkalbly* effective, for decades. If there are problems to be solved, these are the result of new technologies, new conceptions of scale. So that’s “next generation,” but what about…
Genomics? What even is genomics? Who here has heard the terms ‘chromosome’ and ‘gene’ before, and could explain the difference? So before we dive into the main part of the talk, I’m going to spend a few minutes discussing some of the basic biological concepts.
Fundamentally, we’re interested in studying individuals (and populations of individuals) Each individual is *itself* a population: of cells But each of those cells has, ideally, an identical genome. The genome is a collection of 23 molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about. The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
Here is Bill Clinton (and Craig Venter and Francis Crick), announcing in June of 2000 the “rough draft” of the Human Genome – this is the Human Genome Project.
1570: Theatrum Orbis Terrarum “Theater of the world” First modern atlas. A direct byproduct of the first 100 years of PRINTING, and a tool for describing and exploring the world around us. It’s direct descendants are still with us, today!
Google maps! But what does the genomic version of this look like?
Mapmakers today focus on *annotation* of the maps themselves. The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes.
This is a manhattan plot, of alzheimer’s related genes and sequence markers. Now let’s shift gears, and talk about how this was performed – through sequencers. Sequencers are microscopes that read the genome.
If there’s one graph you should remember, in order to understand the last (and the next) ten years of bioinformatics and genomics, it’s this one The Human Genome Project was thousands of researchers, billions of dollars, spent over a decade, all to sequence on-the-order-of half a dozen individuals. Today, we’re close to the “thousand dollar genome” – and already we’re seeing prototype sequencers with the form factor of a USB stick. So sequencers will drive everything before it – but sequencers are only ever half the story.
Bioinformatics is a computational reversal of the sequencing process. [ANIMATE] But to most
So… what’s in the box?
It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
That was the data side, but let’s open up the computation as well. Take one of those boxes, that I drew earlier. Here’s alignment, but it could be… [ANIMATE] any bioinformatics tool. I assert that there are *two* things going on inside any bioinformatics tool – [ANIMATE] There is the method, and there is the implementation of that method. I think this is an important distinction to make… But even that is a lie, because there is a third thing… [ANIMATE] “Platform.” That’s why I’ve included this code snippet up above. So what’s the problem? Faster sequencers means we sequence more people, but we have tools that work and a natural path to parallelism! Why does there need to be a “next generation?” The answer, of course, is that when you have all that data, you want to *USE* all that data.
When you want to *use* all the data, now your entire system will start to show cracks. This is an example, variant calling. But [ANIMATE] God help you if you want to combine statistical information at an earlier phase of the process. But this is by no means a unique problem. And what is one solution? You might have guessed it from the title to my talk….
There’s more parallelism that we can extract from our pipelines.
Spark. The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but also…