With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
3. Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
3
4. Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
4
5. Genomics is a big data problem
5
40,000 Petabytes / year by 2025From $2.7B to <$1,000
https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
6. Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
6
7. The power of big genomic data
7
Accelerate
Target
Discovery
Goal: identify a biological target
(eg. protein) that can be
modulated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
Result: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547496/
Orthosteric inhibition
8. The power of big genomic data
8
Accelerate
Target
Discovery
Goal: identify a biological target
(eg. protein) that can be
modulated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
Result: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
9. The power of big genomic data
9
Accelerate
Target
Discovery
Goal: identify a biological target
(eg. protein) that can be
modulated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
Result: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
10. The power of big genomic data
10
Accelerate
Target
Discovery
Reduce Costs
via Precision
Prevention
Improve
Survival with
Optimized
Treatment
11. Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
11
12. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
12
13. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
13
14. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
14
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq}
-T SelectVariants
-V my_flies.vcf
-L $i
-o my_flies.${i}.vcf
done;
15. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
15
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq}
-T SelectVariants
-V my_flies.vcf
-L $i
-o my_flies.${i}.vcf
done;
for i in {1..16};
do vcftools --vcf VCF_FILE --chr $i
--recode --recode-INFO-all --out VCF_$i;
done
16. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
16
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq}
-T SelectVariants
-V my_flies.vcf
-L $i
-o my_flies.${i}.vcf
done;
for i in {1..16};
do vcftools --vcf VCF_FILE --chr $i
--recode --recode-INFO-all --out VCF_$i;
done
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf
17. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
17
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq}
-T SelectVariants
-V my_flies.vcf
-L $i
-o my_flies.${i}.vcf
done;
for i in {1..16};
do vcftools --vcf VCF_FILE --chr $i
--recode --recode-INFO-all --out VCF_$i;
done
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf
java -jar SnpSift.jar split file.vcf
18. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
18
19. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
19
838 results
20. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
20
Data management
--make-bed
--recode
--output-chr
--zero-cluster
--split-x/--merge-x
--set-me-missing
--fill-missing-a2
--set-missing-var-ids
--update-map...
--update-ids...
--flip
--flip-scan
--keep-allele-order...
--indiv-sort
--write-covar...
--{,b}merge...
Merge failures
VCF reference merge
--merge-list
--write-snplist
--list-duplicate-vars
Basic statistics
--freq{,x}
--missing
--test-mishap
--hardy
--mendel
--het/--ibc
--check-sex/--impute-sex
--fst
Linkage disequilibrium
--indep...
--r/--r2
--show-tags
--blocks
Distance matrices
Identity-by-state/Hamming
(--distance...)
Relationship/covariance
(--make-grm-bin...)
--rel-cutoff
Distance-pheno. analysis
(--ibs-test...)
Identity-by-descent
--genome
--homozyg...
Population stratification
--cluster
--pca
--mds-plot
--neighbour
Association analysis
Basic case/control
(--assoc, --model)
Stratified case/control
(--mh, --mh2, --homog)
Quantitative trait
(--assoc, --gxe)
Regression w/ covariates
(--linear, --logistic)
--dosage
--lasso
--test-missing
Monte Carlo permutation
Set-based tests
REML additive heritability
Family-based association
--tdt
--dfam
--qfam...
--tucc
Report postprocessing
--annotate
--clump
--gene-report
--meta-analysis
Epistasis
--fast-epistasis
--epistasis
--twolocus
Allelic scoring (--score)
R plugins (--R)
21. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
21
22. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
22
23. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
23
24. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
24
25. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
25
26. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
26
1. Converting one file format to another file format.
2. Converting one file format to another file format.
3. Converting one file format to another file format.
27. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
27
“Give a statistical geneticist an
awk line, feed him for a day,
teach a statistical geneticist how
to awk, feed him for a lifetime...”
28. Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
28
“Give a statistical geneticist an
awk line, feed him for a day,
teach a statistical geneticist how
to awk, feed him for a lifetime...”
29. Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
29
31. • Open-source toolkit for large-scale genomic
analysis
• Built on Spark for biobank scale
• Query and use built-in commands with familiar
languages using Spark SQL
• Compatible with existing genomic tools and
formats, as well as big data and ML tools
31
32. Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
32
64. rdd.pipe()
• Input and output RDDs have single text column
– Input: set header as pipe context
– Output: mixed header and text data
• Convert between genomic file formats
– Changing specs
64
70. glow.transform(‘pipe’)
• For each partition
– Input formatter writes to the command’s stdin
– Output formatter reads from the command’s stdout
– If running the command triggers an exception, the
error is propagated to the driver
70
72. glow.transform(‘pipe’)
• CSV output formatter
– Write schema to first element in iterator
– Write remaining rows to iterator
72
CHR POS BETA SE p.value
22 35292447 1.206 3.285 0.714
22 35292456 1.358 2.534 0.592
StructType(
Seq(“CHR”, “POS”, “BETA”,
“SE”, “p.value”).map(
StructField(_, StringType))
InternalRow(“22”, “35292447”, “1.206”, “3.285”, “0.714”)
InternalRow(“22”, “35292456”, “1.358”, “2.534”, “0.592”)
73. glow.transform(‘pipe’)
• Input and output DataFrames
– Input: infer header from schema
– Output: infer schema from header
• Convert genomic data under the hood
– Spark Row ↔ Java object ↔ text
73
74. Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
74
75. GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
75
76. GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
76
spark.read.format("vcf")
.load(“genotypes.vcf”)
77. GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
77
variant_df.selectExpr("*",
"expand_struct(call_summary_stats(genotypes))",
"expand_struct(hardy_weinberg(genotypes))")
.where((col("alleleFrequencies").getItem(0) >=
allele_freq_cutoff) &
(col("alleleFrequencies").getItem(0) <=
(1.0 - allele_freq_cutoff)) &
(col("pValueHwe") >= hwe_cutoff))
78. GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
78
qc_df.write
.format(“delta”)
.save(delta_path)
79. GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
79
matrix.computeSVD(num_pcs)
80. GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
80
genotypes.crossJoin(
phenotypeAndCovariates)
.selectExpr(
“expand_struct( ”
“linear_regression_gwas( ”
“genotype_states(genotypes), ”
“phenotype_values, covariates))”)
81. GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
81
gwas_results_rdf <- as.data.frame(gwas_results)
install.packages("qqman",
`repos="http://cran.us.r-project.org") library(qqman)
png('/databricks/driver/manhattan.png')
manhattan(gwas_results_rdf)
82. GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
82
http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
83. GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
83
mlflow.log_artifact(
'/databricks/driver/manhattan.png')