SlideShare ist ein Scribd-Unternehmen logo
1 von 65
Downloaden Sie, um offline zu lesen
Scaling Genetic Data Analysis

with Hail and Apache Spark
Jon Bloom and Tim Poterba,
Bio-curious Mathware Engineers

Initiative in Scalable Analytics
at the
Broad Institute of MIT and Harvard
…in the heart of biomedical research and technology:
Bay
Bay State
…in the heart of biomedical research and technology:
http://visual6502.org/images/pages/Intel_8086_die_shots.html
Computers are complex
• Hardware and software
• Many levels of abstraction
• Execution at all levels
Cluster
Machine
Microprocessor
Gate / Register
Transistor
Physics
Application
Spark
Scala
Bytecode
Assembly
Machine Code
Biology is ridiculously complex
• Wetware and software
• Many levels of abstraction
• Execution at all levels
Population
Organism
System
Organ
Tissue
Cell
Organelle
Molecular Biology
Biochemistry
Physics
Behavior / Phenotype
…
…
…
…
…
…
Protein
RNA
DNA
Biology is ridiculously complex
• Wetware and software
• Many levels of abstraction
• Execution at all levels
• We designed computers.

Biology evolved us!
Population
Organism
System
Organ
Tissue
Cell
Organelle
Molecular Biology
Biochemistry
Physics
Behavior / Phenotype
…
…
…
…
…
…
Protein
RNA
DNA
Biology is ridiculously complex
Kahn Academy
Behavior / Phenotype
…
…
…
…
…
…
Protein
RNA
DNA
Biology is ridiculously complex
Behavior / Phenotype
…
…
…
…
…
…
Protein
RNA
DNA
PCSK9
Why understand biology?
• Science!
Why understand biology?
• Science!

• Technology!
https://www.youtube.com/watch?v=U7Bh41YnfB0&list=PLlMMtlgw6qNjROoMNTBQjAcdx53kV50cS
Why understand biology?
• Science!

• Technology!

• Immortality!
Why understand biology?
• Science!

• Technology!

• To decipher the wetware and software stack of human biology

in order to prevent, diagnose, and treat disease.
• 7.5 billion people
• 23 chromosomes
• 3.2 billion bases
• 1.5% codes for 20k genes
• 98.5% is non-coding
Genetics
Genetics
gnomad.broadinstitute.org
• 7.5 billion people
• 23 chromosomes
• 3.2 billion bases
• 1.5% codes for 20k genes
• 98.5% is non-coding
https://www.snpedia.com/index.php/Heritability
Most diseases are heritable
MoreLess
Type-1 diabetesType-2 diabetes
Stroke
Alcoholism
Alzheimer’s
Autism Bipolar
Breast cancer
Coronary artery
Depression
Epilepsy
Heart
Migraine
Obesity
SchizophreniaOvarian cancer
Colon cancer
Lung cancer
Acne
Celiac
Crohn's
Parkinson's
Prostate cancer
Thyroid cancer Eczema
Anorexia
Asthma
Hypertension
Cystic Fibrosis
Tay-Sachs
Huntington’s
Placebo
Evolocumab
Statistical genetics
PCSK9SHH
Genome
S1
SN
Low LDL
High LDL
……
Stephens, et al., Big Data: Astronomical or Genomical? (2015)
Broad Institute data
• The Broad sequences 1 genome every 10 minutes.
• The Broad generates 17 TB of new genomes per day.
• The Broad manages 45 PB of scientific data.
Genome
Transcriptome
Epigenome
Proteome
Metabolome
Broad Institute data
• The Broad sequences 1 genome every 10 minutes.
• The Broad generates 17 TB of new genomes per day.
• The Broad manages 45 PB of scientific data.
Genome
Transcriptome
Epigenome
Proteome
Metabolome
Microbiome Cancer
Single-cell
RNA
Science
Runtime
Implementation
Computational Experiments
Science
Runtime
Implementation
Failure to Scale
• Powerful and flexible
• Domain-relevant, easy to use
• Fast and scalable
Science
Runtime
Implementation
• Powerful and flexible
• Domain-relevant, easy to use
• Fast and scalable
• Open-source and open-dev!
Science
Runtime
Implementation
Open
• Powerful and flexible
• Domain-relevant, easy to use
• Fast and scalable
• Open-source and open-dev!
Science
Runtime
Implementation
Drop the latency of computational experiments

to rev the engine of biomedical science!
Individual ID
“NA12878”
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
Genotype
{
“call”: “A/T”,
“reads”: [10, 8],
“quality”: 43,
“p”: [43, 0, 52]
}
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
Genotype
{
“call”: “A/T”,
“reads”: [10, 8],
“quality”: 43,
“p”: [43, 0, 52]
}
ID-indexed table
{
“LDL”: 75.123,
“ancestry”: “SAS”,
“cohort”: “1KG”
}
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
Genotype
{
“call”: “A/T”,
“reads”: [10, 8],
“quality”: 43,
“p”: [43, 0, 52]
}
ID-indexed table
{
“LDL”: 75.123,
“ancestry”: “SAS”,
“cohort”: “1KG”
}
Locus-indexed table
{
“gene”: “SHH”,
“pred_impact”: “high”,
“pop_frequency”: 0.102
}
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
Genotype
{
“call”: “A/T”,
“reads”: [10, 8],
“quality”: 43,
“p”: [43, 0, 52]
}
ID-indexed table
{
“LDL”: 75.123,
“ancestry”: “SAS”,
“cohort”: “1KG”
}
Locus-indexed table
{
“gene”: “PCSK9”,
“pred_impact”: “high”,
“pop_frequency”: 0.102
}
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
Genotype
{
“call”: “A/T”,
“reads”: [10, 8],
“quality”: 43,
“p”: [43, 0, 52]
}
ID-indexed table
{
“LDL”: 75.123,
“ancestry”: “SAS”,
“cohort”: “1KG”
}
Locus-indexed table
{
“gene”: “SHH”,
“pred_impact”: “high”,
“pop_frequency”: 0.102
}
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
What is the average genotype
quality for common loci in each cohort?
Genotype
{
“call”: “A/T”,
“reads”: [10, 8],
“quality”: 43,
“p”: [43, 0, 52]
}
ID-indexed table
{
“LDL”: 75.123,
“ancestry”: “SAS”,
“cohort”: “1KG”
}
Locus-indexed table
{
“gene”: “SHH”,
“pred_impact”: “high”,
“pop_frequency”: 0.102
}
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
What is the
average quality
for common loci
for each cohort?
Spark Core
SQLMLlib
• scalability
• high-level programming APIs
• linear algebra, MLlib
• Scala, Python, R
DataFrame 2
{
“chromosome”: 1,
“position”: 16123092,
“reference_allele”: “A”,
“alternate_allele”: “T”,
“gene”: “SHH”,
“pred_impact”: “high”,
“pop_frequency”: 0.102
}
DataFrame 1
{
“ID”: “NA12878”,
“LDL”: 75.123,
“ancestry”: “SAS”,
“cohort”: “1KG”
}
DataFrame 3
{
“ID”: “NA12878”,
“chromosome”: 1,
“position”: 16123092,
“reference_allele”: “A”,
“alternate_allele”: “T”,
“call”: “A/T”,
“reads”: [10, 8],
“quality”: 43,
“p”: [43, 0, 52]
}
= + +
df3
JOIN df1 ON ID
JOIN df2 ON {chr, pos, ref, alt}
WHERE pop_frequency > 0.10
GROUP BY cohort
AGGREGATE mean(quality)
What is the
average genotype quality
for common loci
in each cohort?
DataFrame 2
{
“chromosome”: 1,
“position”: 16123092,
“reference_allele”: “A”,
“alternate_allele”: “T”,
“gene”: “SHH”,
“pred_impact”: “high”,
“pop_frequency”: 0.102
}
DataFrame 1
{
“ID”: “NA12878”,
“LDL”: 75.123,
“ancestry”: “SAS”,
“cohort”: “1KG”
}
DataFrame 3
{
“ID”: “NA12878”,
“chromosome”: 1,
“position”: 16123092,
“reference_allele”: “A”,
“alternate_allele”: “T”,
“call”: “A/T”,
“reads”: [10, 8],
“quality”: 43,
“p”: [43, 0, 52]
}
df3
JOIN df1 ON ID
JOIN df2 ON {chr, pos, ref, alt}
WHERE pop_frequency > 0.10
GROUP BY cohort
AGGREGATE mean(quality)
What is the
average genotype quality
for common loci
in each cohort?
DataFrame 2
{
“chromosome”: 1,
“position”: 16123092,
“reference_allele”: “A”,
“alternate_allele”: “T”,
“gene”: “SHH”,
“pred_impact”: “high”,
“pop_frequency”: 0.102
}
DataFrame 1
{
“ID”: “NA12878”,
“LDL”: 75.123,
“ancestry”: “SAS”,
“cohort”: “1KG”
}
DataFrame 3
{
“ID”: “NA12878”,
“chromosome”: 1,
“position”: 16123092,
“reference_allele”: “A”,
“alternate_allele”: “T”,
“call”: “A/T”,
“reads”: [10, 8],
“quality”: 43,
“p”: [43, 0, 52]
}
Spark Core
SQLMLlib
Hail
• scalability
• high-level programming APIs
• linear algebra, MLlib
• Scala, Python, R
• genomic data ETL
• high-level APIs for multi-
dimensional data query
• stats and ML methods
• Scala, Python
RDD
Partitions
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
Array[Table]
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
RDD
Partitions
RDD[(Locus, (Table, Array[Genotype])]
Individual ID
“NA12878”
Array[Table]
filterCols
filterRows
filterCells
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
map
f
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
reduceByRow
reduceByCol
Genomic Locus
{
“chromosome”: 1,
“position”: 16123092,
“reference”: “A”,
“alternate”: “T”
}
Individual ID
“NA12878”
reduceByRow
reduceByCol
rdd.map rdd.reduce
reduceByRow
reduceByCol
join: the troublemaker
• Need to join by genomic locus all the time
• Machine learning models for effect of mutation
• Clinical annotation databases
• Combine information across genetic datasets
Typical join
…
…
…
Network
Partitioned join
…
…
…
Network
Local
OrderedRDD
• Generalizes Spark RangePartitioner
• Shuffle-free joins
• Preserves partitioner on disk
OrderedRDD range join
…
1
1
2
…
2
3
…
1
4
32
3
1
OrderedRDD range join
…
1
1
2
…
2
3
…
1
4
32
3
1
OrderedRDD range join
…
1
1
2
…
2
3
…
1
4
32
3
1
OrderedRDD range join
…
1
1
2
…
2
3
…
1
4
32
3
1
OrderedRDD predicate pushdown
2
3
5
…
1
4
Target
interval
• Interval filter is O(partitions kept)
• 20G dataset: 100ms on a laptop
Where we are today
• Stable interface: 0.1 release

• Transformed iterative QC

• Rebuilt most standard genetics
tools as methods in Hail
Science
Runtime
Implementation
Hail Science
• L. Francioli, MacArthur lab, Analysis of whole-genome sequencing from 15,139 individuals
• A. Ganna et al., Ultra-rare disruptive and damaging mutations influence educational attainment in the general
population, Nature Neuroscience
• A. Ganna et al., The impact of ultra-rare variants on human diseases and traits
• A. Ganna et al., The impact of rare variants on schizophrenia: whole genome sequencing of 10,000 individuals
from the WGSPD consortia
• M. Kurki, Palotie Lab, Alzheimer’s Disease Rare Variant Association Study in Finnish Founder Population
• M. Kurki, Palotie Lab, Genetic Architecture of Idiopathic Intellectual Disability in a Northern Finnish founder
population cohort
• M. Kurki, P. Gormley, Palotie Lab, Genetic Architecture of Familial Migraine in a Family collection of 9000
Individuals in 2000 Families
• K. Karczewski, MacArthur Lab, The Human Knockout Project: analyzing loss-of-function variants across
126,216 individuals
Hail Science
• X. Li et al., Developing and optimizing a whole genome and whole exome sequencing quality control pipeline with 652
Genotype-Tissue Expression donors
• M. A. Rivas et al., Insights into the genetic epidemiology of Crohn's and rare diseases in the Ashkenazi Jewish population
• K. Satterstrom, iPSYCH-Broad Consortium, Rare variants conferring risk for autism identified by whole exome
sequencing of dried bloodspots
• C. Seed et al., Neale Lab, Hail: An Open-Source Framework for Scalable Genetic Data Analysis
• G. Tiao, Pan-Cancer Analysis of Whole Genomes, Analysis of rare variation in 2,818 whole-genome germline samples
from cancer patients
• S. Maryam Zekavat, P. Natarajan, Kathiresan Lab. An analysis of deep, whole-genome sequences and plasma lipids in
~16,000 multi-ethnic samples.
• S. Maryam Zekavat, Kathiresan Lab. An analysis of deep, whole-genome sequences and coronary artery disease in
~7,000 multi-ethnic samples.
• S. Maryam Zekavat, Kathiresan Lab. Analyzing the full spectrum of genomic variation with Lp(a) Cholesterol: Novel
insights from deep, whole genome sequence data in 5,192 Europeans and African Americans from Estonia and from the
Jackson Heart Study
Hail Engagement
Academia


Aarhus University

Garvan Institute

John Hopkins

Mayo Clinic

University of Michigan
National Cancer Institute
Oxford

Sanger

Stanford

UCSF

Wash U Genome Institute

…
Industry
Databricks

Illumina
Cloudera

- Children’s of Atlanta

- Rush University MC

- Dignity Hospital

- Quest Labs

- Labcorp

Digital China Health
Cray
Metistream

…
New code
• 10-100x performance improvements in the next year
• compile DSL to JVM bytecode
• query optimizer
• explore alternatives to Parquet

• Build new models and algorithms for big genetic data
New applications
• Single-cell RNA profiles

• Deep phenotypes

• Best to analyze it all
together!
Distributed Linear Algebra

Matrix Algebra
Numerical Methods
Data Query

MapReduce / SQL on
Tensorial Structured Data
Bayesian Inference
Graphical Models

MCMC
Variational Inference

Deep Learning
NN, CNN, RNN, GAN

Representational Learning

Reinforcement Learning
We need …
We need help!
Contributors
Szabolcs Berecz
Alex Bloemendal
John Compitello
Laurent Francioli
Konrad Karczewski
Jack Kosmicki
Mitja Kurki
Mark Pinese
Ben Weisburd
Tom White
Alex Zamoshchin
@shusson
Patrick Schultz
Duncan Palmer
Broad Institute
Ben Neale
Mark Daly
Daniel MacArthur
Too many other scientific
collaborators to list…
Hail Team
Cotton Seed
Jackie Goldstein
Dan King
John Compitello
www.hail.isDemo at Cray booth:
Tue 12:20 - 1:20
Wed 3:20 - 4:20
www.broadinstitute.org/mia
hail@broadinstitute.org

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Institute
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 

Ähnlich wie Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba

Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
Detecting and Quantifying Low Level Variants in Sanger Sequencing Traces
Detecting and Quantifying Low Level Variants in Sanger Sequencing TracesDetecting and Quantifying Low Level Variants in Sanger Sequencing Traces
Detecting and Quantifying Low Level Variants in Sanger Sequencing Traces
Thermo Fisher Scientific
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherVisualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Nils Gehlenborg
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
Atai Rabby
 

Ähnlich wie Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba (20)

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
Detecting and Quantifying Low Level Variants in Sanger Sequencing Traces
Detecting and Quantifying Low Level Variants in Sanger Sequencing TracesDetecting and Quantifying Low Level Variants in Sanger Sequencing Traces
Detecting and Quantifying Low Level Variants in Sanger Sequencing Traces
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Bioinformatica t2-databases
Bioinformatica t2-databasesBioinformatica t2-databases
Bioinformatica t2-databases
 
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherVisualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All Together
 
NCBI
NCBINCBI
NCBI
 
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
 
AWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of Maryland
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
RNA-Seq with R-Bioconductor
RNA-Seq with R-BioconductorRNA-Seq with R-Bioconductor
RNA-Seq with R-Bioconductor
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 

Kürzlich hochgeladen (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 

Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba

  • 1. Scaling Genetic Data Analysis
 with Hail and Apache Spark Jon Bloom and Tim Poterba, Bio-curious Mathware Engineers
 Initiative in Scalable Analytics at the Broad Institute of MIT and Harvard
  • 2. …in the heart of biomedical research and technology: Bay
  • 3. Bay State …in the heart of biomedical research and technology:
  • 5. Computers are complex • Hardware and software • Many levels of abstraction • Execution at all levels Cluster Machine Microprocessor Gate / Register Transistor Physics Application Spark Scala Bytecode Assembly Machine Code
  • 6. Biology is ridiculously complex • Wetware and software • Many levels of abstraction • Execution at all levels Population Organism System Organ Tissue Cell Organelle Molecular Biology Biochemistry Physics Behavior / Phenotype … … … … … … Protein RNA DNA
  • 7. Biology is ridiculously complex • Wetware and software • Many levels of abstraction • Execution at all levels • We designed computers.
 Biology evolved us! Population Organism System Organ Tissue Cell Organelle Molecular Biology Biochemistry Physics Behavior / Phenotype … … … … … … Protein RNA DNA
  • 8. Biology is ridiculously complex Kahn Academy Behavior / Phenotype … … … … … … Protein RNA DNA
  • 9. Biology is ridiculously complex Behavior / Phenotype … … … … … … Protein RNA DNA PCSK9
  • 11. Why understand biology? • Science!
 • Technology! https://www.youtube.com/watch?v=U7Bh41YnfB0&list=PLlMMtlgw6qNjROoMNTBQjAcdx53kV50cS
  • 12. Why understand biology? • Science!
 • Technology!
 • Immortality!
  • 13. Why understand biology? • Science!
 • Technology!
 • To decipher the wetware and software stack of human biology
 in order to prevent, diagnose, and treat disease.
  • 14. • 7.5 billion people • 23 chromosomes • 3.2 billion bases • 1.5% codes for 20k genes • 98.5% is non-coding Genetics
  • 15. Genetics gnomad.broadinstitute.org • 7.5 billion people • 23 chromosomes • 3.2 billion bases • 1.5% codes for 20k genes • 98.5% is non-coding
  • 16. https://www.snpedia.com/index.php/Heritability Most diseases are heritable MoreLess Type-1 diabetesType-2 diabetes Stroke Alcoholism Alzheimer’s Autism Bipolar Breast cancer Coronary artery Depression Epilepsy Heart Migraine Obesity SchizophreniaOvarian cancer Colon cancer Lung cancer Acne Celiac Crohn's Parkinson's Prostate cancer Thyroid cancer Eczema Anorexia Asthma Hypertension Cystic Fibrosis Tay-Sachs Huntington’s
  • 17.
  • 20. Stephens, et al., Big Data: Astronomical or Genomical? (2015)
  • 21. Broad Institute data • The Broad sequences 1 genome every 10 minutes. • The Broad generates 17 TB of new genomes per day. • The Broad manages 45 PB of scientific data. Genome Transcriptome Epigenome Proteome Metabolome
  • 22. Broad Institute data • The Broad sequences 1 genome every 10 minutes. • The Broad generates 17 TB of new genomes per day. • The Broad manages 45 PB of scientific data. Genome Transcriptome Epigenome Proteome Metabolome Microbiome Cancer Single-cell RNA
  • 25. • Powerful and flexible • Domain-relevant, easy to use • Fast and scalable Science Runtime Implementation
  • 26. • Powerful and flexible • Domain-relevant, easy to use • Fast and scalable • Open-source and open-dev! Science Runtime Implementation Open
  • 27. • Powerful and flexible • Domain-relevant, easy to use • Fast and scalable • Open-source and open-dev! Science Runtime Implementation Drop the latency of computational experiments
 to rev the engine of biomedical science!
  • 29. Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878”
  • 30. Genotype { “call”: “A/T”, “reads”: [10, 8], “quality”: 43, “p”: [43, 0, 52] } Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878”
  • 31. Genotype { “call”: “A/T”, “reads”: [10, 8], “quality”: 43, “p”: [43, 0, 52] } ID-indexed table { “LDL”: 75.123, “ancestry”: “SAS”, “cohort”: “1KG” } Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878”
  • 32. Genotype { “call”: “A/T”, “reads”: [10, 8], “quality”: 43, “p”: [43, 0, 52] } ID-indexed table { “LDL”: 75.123, “ancestry”: “SAS”, “cohort”: “1KG” } Locus-indexed table { “gene”: “SHH”, “pred_impact”: “high”, “pop_frequency”: 0.102 } Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878”
  • 33. Genotype { “call”: “A/T”, “reads”: [10, 8], “quality”: 43, “p”: [43, 0, 52] } ID-indexed table { “LDL”: 75.123, “ancestry”: “SAS”, “cohort”: “1KG” } Locus-indexed table { “gene”: “PCSK9”, “pred_impact”: “high”, “pop_frequency”: 0.102 } Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878”
  • 34. Genotype { “call”: “A/T”, “reads”: [10, 8], “quality”: 43, “p”: [43, 0, 52] } ID-indexed table { “LDL”: 75.123, “ancestry”: “SAS”, “cohort”: “1KG” } Locus-indexed table { “gene”: “SHH”, “pred_impact”: “high”, “pop_frequency”: 0.102 } Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878” What is the average genotype quality for common loci in each cohort?
  • 35. Genotype { “call”: “A/T”, “reads”: [10, 8], “quality”: 43, “p”: [43, 0, 52] } ID-indexed table { “LDL”: 75.123, “ancestry”: “SAS”, “cohort”: “1KG” } Locus-indexed table { “gene”: “SHH”, “pred_impact”: “high”, “pop_frequency”: 0.102 } Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878” What is the average quality for common loci for each cohort?
  • 36. Spark Core SQLMLlib • scalability • high-level programming APIs • linear algebra, MLlib • Scala, Python, R
  • 37. DataFrame 2 { “chromosome”: 1, “position”: 16123092, “reference_allele”: “A”, “alternate_allele”: “T”, “gene”: “SHH”, “pred_impact”: “high”, “pop_frequency”: 0.102 } DataFrame 1 { “ID”: “NA12878”, “LDL”: 75.123, “ancestry”: “SAS”, “cohort”: “1KG” } DataFrame 3 { “ID”: “NA12878”, “chromosome”: 1, “position”: 16123092, “reference_allele”: “A”, “alternate_allele”: “T”, “call”: “A/T”, “reads”: [10, 8], “quality”: 43, “p”: [43, 0, 52] } = + +
  • 38. df3 JOIN df1 ON ID JOIN df2 ON {chr, pos, ref, alt} WHERE pop_frequency > 0.10 GROUP BY cohort AGGREGATE mean(quality) What is the average genotype quality for common loci in each cohort? DataFrame 2 { “chromosome”: 1, “position”: 16123092, “reference_allele”: “A”, “alternate_allele”: “T”, “gene”: “SHH”, “pred_impact”: “high”, “pop_frequency”: 0.102 } DataFrame 1 { “ID”: “NA12878”, “LDL”: 75.123, “ancestry”: “SAS”, “cohort”: “1KG” } DataFrame 3 { “ID”: “NA12878”, “chromosome”: 1, “position”: 16123092, “reference_allele”: “A”, “alternate_allele”: “T”, “call”: “A/T”, “reads”: [10, 8], “quality”: 43, “p”: [43, 0, 52] }
  • 39. df3 JOIN df1 ON ID JOIN df2 ON {chr, pos, ref, alt} WHERE pop_frequency > 0.10 GROUP BY cohort AGGREGATE mean(quality) What is the average genotype quality for common loci in each cohort? DataFrame 2 { “chromosome”: 1, “position”: 16123092, “reference_allele”: “A”, “alternate_allele”: “T”, “gene”: “SHH”, “pred_impact”: “high”, “pop_frequency”: 0.102 } DataFrame 1 { “ID”: “NA12878”, “LDL”: 75.123, “ancestry”: “SAS”, “cohort”: “1KG” } DataFrame 3 { “ID”: “NA12878”, “chromosome”: 1, “position”: 16123092, “reference_allele”: “A”, “alternate_allele”: “T”, “call”: “A/T”, “reads”: [10, 8], “quality”: 43, “p”: [43, 0, 52] }
  • 40. Spark Core SQLMLlib Hail • scalability • high-level programming APIs • linear algebra, MLlib • Scala, Python, R • genomic data ETL • high-level APIs for multi- dimensional data query • stats and ML methods • Scala, Python
  • 41. RDD Partitions Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878” Array[Table]
  • 42. Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } RDD Partitions RDD[(Locus, (Table, Array[Genotype])] Individual ID “NA12878” Array[Table]
  • 43. filterCols filterRows filterCells Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878”
  • 44. map f Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878”
  • 45. reduceByRow reduceByCol Genomic Locus { “chromosome”: 1, “position”: 16123092, “reference”: “A”, “alternate”: “T” } Individual ID “NA12878”
  • 48. join: the troublemaker • Need to join by genomic locus all the time • Machine learning models for effect of mutation • Clinical annotation databases • Combine information across genetic datasets
  • 51. OrderedRDD • Generalizes Spark RangePartitioner • Shuffle-free joins • Preserves partitioner on disk
  • 56. OrderedRDD predicate pushdown 2 3 5 … 1 4 Target interval • Interval filter is O(partitions kept) • 20G dataset: 100ms on a laptop
  • 57. Where we are today • Stable interface: 0.1 release
 • Transformed iterative QC
 • Rebuilt most standard genetics tools as methods in Hail Science Runtime Implementation
  • 58. Hail Science • L. Francioli, MacArthur lab, Analysis of whole-genome sequencing from 15,139 individuals • A. Ganna et al., Ultra-rare disruptive and damaging mutations influence educational attainment in the general population, Nature Neuroscience • A. Ganna et al., The impact of ultra-rare variants on human diseases and traits • A. Ganna et al., The impact of rare variants on schizophrenia: whole genome sequencing of 10,000 individuals from the WGSPD consortia • M. Kurki, Palotie Lab, Alzheimer’s Disease Rare Variant Association Study in Finnish Founder Population • M. Kurki, Palotie Lab, Genetic Architecture of Idiopathic Intellectual Disability in a Northern Finnish founder population cohort • M. Kurki, P. Gormley, Palotie Lab, Genetic Architecture of Familial Migraine in a Family collection of 9000 Individuals in 2000 Families • K. Karczewski, MacArthur Lab, The Human Knockout Project: analyzing loss-of-function variants across 126,216 individuals
  • 59. Hail Science • X. Li et al., Developing and optimizing a whole genome and whole exome sequencing quality control pipeline with 652 Genotype-Tissue Expression donors • M. A. Rivas et al., Insights into the genetic epidemiology of Crohn's and rare diseases in the Ashkenazi Jewish population • K. Satterstrom, iPSYCH-Broad Consortium, Rare variants conferring risk for autism identified by whole exome sequencing of dried bloodspots • C. Seed et al., Neale Lab, Hail: An Open-Source Framework for Scalable Genetic Data Analysis • G. Tiao, Pan-Cancer Analysis of Whole Genomes, Analysis of rare variation in 2,818 whole-genome germline samples from cancer patients • S. Maryam Zekavat, P. Natarajan, Kathiresan Lab. An analysis of deep, whole-genome sequences and plasma lipids in ~16,000 multi-ethnic samples. • S. Maryam Zekavat, Kathiresan Lab. An analysis of deep, whole-genome sequences and coronary artery disease in ~7,000 multi-ethnic samples. • S. Maryam Zekavat, Kathiresan Lab. Analyzing the full spectrum of genomic variation with Lp(a) Cholesterol: Novel insights from deep, whole genome sequence data in 5,192 Europeans and African Americans from Estonia and from the Jackson Heart Study
  • 60. Hail Engagement Academia 
 Aarhus University
 Garvan Institute
 John Hopkins
 Mayo Clinic
 University of Michigan National Cancer Institute Oxford
 Sanger
 Stanford
 UCSF
 Wash U Genome Institute
 … Industry Databricks
 Illumina Cloudera
 - Children’s of Atlanta
 - Rush University MC
 - Dignity Hospital
 - Quest Labs
 - Labcorp
 Digital China Health Cray Metistream
 …
  • 61. New code • 10-100x performance improvements in the next year • compile DSL to JVM bytecode • query optimizer • explore alternatives to Parquet
 • Build new models and algorithms for big genetic data
  • 62. New applications • Single-cell RNA profiles
 • Deep phenotypes
 • Best to analyze it all together!
  • 63. Distributed Linear Algebra
 Matrix Algebra Numerical Methods Data Query
 MapReduce / SQL on Tensorial Structured Data Bayesian Inference Graphical Models
 MCMC Variational Inference
 Deep Learning NN, CNN, RNN, GAN
 Representational Learning
 Reinforcement Learning We need …
  • 65. Contributors Szabolcs Berecz Alex Bloemendal John Compitello Laurent Francioli Konrad Karczewski Jack Kosmicki Mitja Kurki Mark Pinese Ben Weisburd Tom White Alex Zamoshchin @shusson Patrick Schultz Duncan Palmer Broad Institute Ben Neale Mark Daly Daniel MacArthur Too many other scientific collaborators to list… Hail Team Cotton Seed Jackie Goldstein Dan King John Compitello www.hail.isDemo at Cray booth: Tue 12:20 - 1:20 Wed 3:20 - 4:20 www.broadinstitute.org/mia hail@broadinstitute.org