1. The document discusses using PySpark and Pandas UDFs to perform machine learning at scale for genomic data. It describes a genomics use case called GloWGR that uses this approach.
2. Three key problems are identified with existing tools: genomic data is growing too quickly; bioinformaticians are unfamiliar with Scala; and ML algorithms are difficult to write in Spark SQL. The solutions proposed are to use Spark, provide a Python client, and write algorithms in Python linked to Spark.
3. GloWGR is presented as a novel whole genome regression and association study algorithm built with PySpark. It uses Pandas UDFs to parallelize the REGENIE method and perform tasks like dimensionality
2. Agenda
● Discuss using PySpark
(especially Pandas UDFs) to
perform machine learning
at unprecedented scale
● Learn about an application
for a genomics use case
(GloWGR)
3. Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
4. Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
2. Problem: Bioinformaticians are not familiar with
the native languages used by big data tools (Scala)
Solution: Provide clients for high-level languages
(Python)
5. Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
2. Problem: Bioinformaticians are not familiar with
the native languages used by big data tools (Scala)
Solution: Provide clients for high-level languages
(Python)
3. Problem: Performant, maintainable machine
learning algorithms are difficult to write natively in
big data tools (Spark SQL expressions)
Solution: Write algorithms in high-level languages
and link them to big data tools (PySpark)
8. Biobank datasets are growing in
scale
• Next-generation sequencing
• Genotyping arrays (1Mb)
• Whole exome sequence (39Mb)
• Whole genome sequence (3200Mb)
• 1,000s of samples → 100,000s
of samples
• 10s of traits → 1000s of traits
Genomic data are growing at an exponential pace
10. Differentiation from single-node libraries
▪ Flexible: Glow is built natively on Spark, a
general-purpose big data engine
▪ Enables aggregation and mining of genetic
variants on an industrial scale
▪ Low-overhead: Spark minimizes
serialization cost with libraries like Kryo and
Arrow
▪ Inflexible: Each tool requires custom
parallelization logic, per language and
algorithm
▪ High-overhead: Moving text between
arbitrary processes hurts performance
Single-node
11. Bioinformaticians are not familiar with the native languages
used by big data tools, such as Scala
Problem 2
13. Data engineers and
scientists are
Python-oriented
● More than 60% of
notebook commands in
Databricks are written in
Python
● Fewer than 20% of
commands are written in
Scala
18. Spark SQL expressions
• Built to process data row-by-row
• Difficult to maintain state
• Minimal support for machine learning
• Overhead from converting rows to ML-compatible shapes (eg. matrices)
• Few linear algebra libraries exist in Scala
• Limited functionality
19. Write algorithms in high-level languages and link them to big
data tools
Solution 3
20. Python improves the developer experience
• Pandas: user-defined
functions (UDFs)
• Apache Arrow: transfer
data between JVM and
Python processes
21. Feature in Spark 3.0: mapInPandas
Local algorithm development in Pandas Plug-and-play with Spark with minimal overhead
X
f(X) → Y
Y
...
Iter(Y) ...
Iter(X)
f(X) → Y
24. Genome Wide Association Studies (GWAS)
Detect associations between
genetic variations and traits of
interest across a population
• Common genetic
variations confer a small
amount of risk
• Rare genetic variation
confer a large amount of
risk
25. Whole Genome Regression (WGR)
Account for polygenic
effects, population
structure, and
relatedness
• Reduce false positives
• Reduce false
negatives
26. Mission: Industrialize genomics by integrating bioinformatics
into data science
Core principles:
• Build on Apache Spark
• Flexibly and natively support genomics tools and file
formats
• Provide single-line functions for common genomics
workloads
• Build an open-source community
26
27. Glow v1.0.0
● Datasources: Read/write common
genomic file formats (eg. VCF, BGEN,
Plink, GFF3) into/from Spark
DataFrames
● SQL expressions: Simple variant
handling operations can be called
from Python, SQL, Scala, or R
● Transformers: Complex genomic
transformations can be called from
Python or Scala
● GloWGR: Novel WGR/GWAS algorithm
built with PySpark
https://projectglow.io/
28. GloWGR: WGR and GWAS
● Detect which genotypes are associated with each
phenotype using a Generalized Linear Model
● Glow parallelizes the REGENIE method via Spark as
GloWGR
● Built from the ground-up using Pandas UDFs
29. GWAS Regression Tests
Millions of single-variate linear or logistic regressions
GloWGR: Learning at huge dimensions
WGR Reduction: ~5000 multi-variate linear ridge
regressions (one for each block and parameter)
500K x 100
500K x 50 500K x 1M
WGR Regression: ~ 5000 multi-variate linear or
logistic ridge regressions with cross validation
32. Stage 2: Dimensionality reduction
RidgeReduction.fit
● Pandas UDF: Construct X and Y
matrices for each block and calculate
Xt
X and Xt
Y
● Pandas UDF: Reduce with
element-wise sum over sample blocks
● Pandas UDF: Assemble the matrices
Xt
X and Xt
Y for a particular sample
block and calculate B= (Xt
X + I⍺)-1
Xt
Y
RidgeReduction.transform
● Pandas UDF: Calculates XB for each block
33. Stage 3: Estimate
phenotypic predictors
RidgeRegression.fit
● Pandas UDF: Construct X and Y
matrices for each block and calculate
Xt
X and Xt
Y
● Pandas UDF: Reduce with
element-wise sum over sample blocks
● Pandas UDF: Assemble the matrices
Xt
X and XY for a particular sample block
and calculate B= (Xt
X + I⍺)-1
Xt
Y
● Perform cross validation. Pick model
with best ⍺
RidgeRegression.transform_loco
● Pandas UDF: Calculates XB for each
block in a loco fashion
34. GWAS
Y ~ Gβg
+ Cβc
+ ϵ
Y - Ŷ ~ Gβg
+ Cβc +
ϵ
Use the phenotype estimate Ŷ
output by WGR to account for
polygenic effects during
regression
35. GWAS with Spark SQL expressions
Data
S samples
C covariates
V variants
T traits
Fitted model
S samples
C covariates
1 variant
1 trait
Results
V variants
T traits
Null model
S samples
C covariates
1 trait
V
x T
x
T
x
Cβc
Gβg
36. GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
37. GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
38. GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
Cons
• Requires writing your own Spark SQL
expressions
• User-unfriendly linear algebra libraries in Scala
(ie. Breeze)
• Limited to 2 dimensions
• Unnatural expressions of mathematical operations
• Customized, expensive data transfers
• Spark DataFrames ↔ MLLib matrices ↔ Breeze
matrices
• Input and output must be Spark DataFrames
39. GWAS with PySpark
Phenotype
matrix
S samples
T traits
Covariate
matrix
S samples
C covariates
Null model
S samples
C covariates
1 trait
Genotype
matrix
S samples
T traits
Fitted model
S samples
C covariates
O(V) variants
O(T) traits
T x
# partitions x
Results
V variants
T traits
Gβg
Cβc
40. GWAS with PySpark
Pros
• User-friendly Scala libraries (ie. Pandas)
• Easy to express mathematical notation
• Unlimited dimensions
• Batched, optimized transfers between Pandas
and Spark DataFrames
• Input and output can be Pandas or Spark
DataFrames
Cons
• Accessible only from Python
41. GWAS with PySpark
Pros
• User-friendly Scala libraries (ie. Pandas)
• Easy to express mathematical notation
• Unlimited dimensions
• Batched, optimized transfers between Pandas
and Spark DataFrames
• Input and output can be Pandas or Spark
DataFrames
Cons
• Accessible only from Python
43. Differentiation from other parallelized libraries
▪ Lightweight: Glow is a thin layer built to be
compatible with the latest major Spark
releases, as well as other open-source
libraries (eg. Delta)
▪ Flexible: Glow includes a set of core
algorithms, and is easily extended to ad-hoc
use cases using existing tools
▪ Heavyweight: Many libraries build on
custom logic that make it difficult to update
to new technologies
▪ Inflexible: Many libraries expose custom
interfaces that make it difficult to extend
beyond the built-in algorithms
Other parallelized libraries