SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
Extending Machine
Learning Algorithms
with PySpark
Karen Feng, Kiavash Kianfar
Databricks
Agenda
● Discuss using PySpark
(especially Pandas UDFs) to
perform machine learning
at unprecedented scale
● Learn about an application
for a genomics use case
(GloWGR)
Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
2. Problem: Bioinformaticians are not familiar with
the native languages used by big data tools (Scala)
Solution: Provide clients for high-level languages
(Python)
Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
2. Problem: Bioinformaticians are not familiar with
the native languages used by big data tools (Scala)
Solution: Provide clients for high-level languages
(Python)
3. Problem: Performant, maintainable machine
learning algorithms are difficult to write natively in
big data tools (Spark SQL expressions)
Solution: Write algorithms in high-level languages
and link them to big data tools (PySpark)
Genomic data are growing too fast for existing tools
Problem 1
Genomic data are growing at an exponential pace
●
Biobank datasets are growing in
scale
• Next-generation sequencing
• Genotyping arrays (1Mb)
• Whole exome sequence (39Mb)
• Whole genome sequence (3200Mb)
• 1,000s of samples → 100,000s
of samples
• 10s of traits → 1000s of traits
Genomic data are growing at an exponential pace
Use general-purpose big data tools - specifically, Spark
Solution 1
Differentiation from single-node libraries
▪ Flexible: Glow is built natively on Spark, a
general-purpose big data engine
▪ Enables aggregation and mining of genetic
variants on an industrial scale
▪ Low-overhead: Spark minimizes
serialization cost with libraries like Kryo and
Arrow
▪ Inflexible: Each tool requires custom
parallelization logic, per language and
algorithm
▪ High-overhead: Moving text between
arbitrary processes hurts performance
Single-node
Bioinformaticians are not familiar with the native languages
used by big data tools, such as Scala
Problem 2
Spark is predominantly written in Scala
Data engineers and
scientists are
Python-oriented
● More than 60% of
notebook commands in
Databricks are written in
Python
● Fewer than 20% of
commands are written in
Scala
Bioinformaticians are even more Python-oriented
Provide clients for high-level languages, such as Python
Solution 2
Python improves the user experience
• Py4J: achieve
near-feature parity with
Scala APIs
• PySpark Project Zen
• PySpark type hints
Py4J
Performant, maintainable machine learning algorithms are
difficult to write natively in big data tools
Problem 3
Spark SQL expressions
• Built to process data row-by-row
• Difficult to maintain state
• Minimal support for machine learning
• Overhead from converting rows to ML-compatible shapes (eg. matrices)
• Few linear algebra libraries exist in Scala
• Limited functionality
Write algorithms in high-level languages and link them to big
data tools
Solution 3
Python improves the developer experience
• Pandas: user-defined
functions (UDFs)
• Apache Arrow: transfer
data between JVM and
Python processes
Feature in Spark 3.0: mapInPandas
Local algorithm development in Pandas Plug-and-play with Spark with minimal overhead
X
f(X) → Y
Y
...
Iter(Y) ...
Iter(X)
f(X) → Y
Deep Dive: Genomics Use Case
Single nucleotide polymorphisms (SNP)
Genome Wide Association Studies (GWAS)
Detect associations between
genetic variations and traits of
interest across a population
• Common genetic
variations confer a small
amount of risk
• Rare genetic variation
confer a large amount of
risk
Whole Genome Regression (WGR)
Account for polygenic
effects, population
structure, and
relatedness
• Reduce false positives
• Reduce false
negatives
Mission: Industrialize genomics by integrating bioinformatics
into data science
Core principles:
• Build on Apache Spark
• Flexibly and natively support genomics tools and file
formats
• Provide single-line functions for common genomics
workloads
• Build an open-source community
26
Glow v1.0.0
● Datasources: Read/write common
genomic file formats (eg. VCF, BGEN,
Plink, GFF3) into/from Spark
DataFrames
● SQL expressions: Simple variant
handling operations can be called
from Python, SQL, Scala, or R
● Transformers: Complex genomic
transformations can be called from
Python or Scala
● GloWGR: Novel WGR/GWAS algorithm
built with PySpark
https://projectglow.io/
GloWGR: WGR and GWAS
● Detect which genotypes are associated with each
phenotype using a Generalized Linear Model
● Glow parallelizes the REGENIE method via Spark as
GloWGR
● Built from the ground-up using Pandas UDFs
GWAS Regression Tests
Millions of single-variate linear or logistic regressions
GloWGR: Learning at huge dimensions
WGR Reduction: ~5000 multi-variate linear ridge
regressions (one for each block and parameter)
500K x 100
500K x 50 500K x 1M
WGR Regression: ~ 5000 multi-variate linear or
logistic ridge regressions with cross validation
Data preparation
Transformation and SQL functions
on Genomic Variant DataFrame
● split_multiallelics
● genotype_states
● mean_substitute
Stage 1: Genotype matrix blocking
Stage 2: Dimensionality reduction
RidgeReduction.fit
● Pandas UDF: Construct X and Y
matrices for each block and calculate
Xt
X and Xt
Y
● Pandas UDF: Reduce with
element-wise sum over sample blocks
● Pandas UDF: Assemble the matrices
Xt
X and Xt
Y for a particular sample
block and calculate B= (Xt
X + I⍺)-1
Xt
Y
RidgeReduction.transform
● Pandas UDF: Calculates XB for each block
Stage 3: Estimate
phenotypic predictors
RidgeRegression.fit
● Pandas UDF: Construct X and Y
matrices for each block and calculate
Xt
X and Xt
Y
● Pandas UDF: Reduce with
element-wise sum over sample blocks
● Pandas UDF: Assemble the matrices
Xt
X and XY for a particular sample block
and calculate B= (Xt
X + I⍺)-1
Xt
Y
● Perform cross validation. Pick model
with best ⍺
RidgeRegression.transform_loco
● Pandas UDF: Calculates XB for each
block in a loco fashion
GWAS
Y ~ Gβg
+ Cβc
+ ϵ
Y - Ŷ ~ Gβg
+ Cβc +
ϵ
Use the phenotype estimate Ŷ
output by WGR to account for
polygenic effects during
regression
GWAS with Spark SQL expressions
Data
S samples
C covariates
V variants
T traits
Fitted model
S samples
C covariates
1 variant
1 trait
Results
V variants
T traits
Null model
S samples
C covariates
1 trait
V
x T
x
T
x
Cβc
Gβg
GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
Cons
• Requires writing your own Spark SQL
expressions
• User-unfriendly linear algebra libraries in Scala
(ie. Breeze)
• Limited to 2 dimensions
• Unnatural expressions of mathematical operations
• Customized, expensive data transfers
• Spark DataFrames ↔ MLLib matrices ↔ Breeze
matrices
• Input and output must be Spark DataFrames
GWAS with PySpark
Phenotype
matrix
S samples
T traits
Covariate
matrix
S samples
C covariates
Null model
S samples
C covariates
1 trait
Genotype
matrix
S samples
T traits
Fitted model
S samples
C covariates
O(V) variants
O(T) traits
T x
# partitions x
Results
V variants
T traits
Gβg
Cβc
GWAS with PySpark
Pros
• User-friendly Scala libraries (ie. Pandas)
• Easy to express mathematical notation
• Unlimited dimensions
• Batched, optimized transfers between Pandas
and Spark DataFrames
• Input and output can be Pandas or Spark
DataFrames
Cons
• Accessible only from Python
GWAS with PySpark
Pros
• User-friendly Scala libraries (ie. Pandas)
• Easy to express mathematical notation
• Unlimited dimensions
• Batched, optimized transfers between Pandas
and Spark DataFrames
• Input and output can be Pandas or Spark
DataFrames
Cons
• Accessible only from Python
GWAS
I/O formats Linalg libraries Accessible clients
Spark SQL Spark DataFrames Spark ML/MLLib,
Breeze
Scala, Python, R
PySpark Spark or Pandas
DataFrames
Pandas, Numpy,
Einsum, ...
Python
Differentiation from other parallelized libraries
▪ Lightweight: Glow is a thin layer built to be
compatible with the latest major Spark
releases, as well as other open-source
libraries (eg. Delta)
▪ Flexible: Glow includes a set of core
algorithms, and is easily extended to ad-hoc
use cases using existing tools
▪ Heavyweight: Many libraries build on
custom logic that make it difficult to update
to new technologies
▪ Inflexible: Many libraries expose custom
interfaces that make it difficult to extend
beyond the built-in algorithms
Other parallelized libraries
Future work: gene burden tests
Big takeaways
1. Listen to your
users
2. Use the latest
off-the-shelf
tools
3. If all else fails,
pivot early
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Weitere ähnliche Inhalte

Was ist angesagt?

PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkDavide Nardone
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기NAVER D2
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfChester Chen
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarPeter Ward
 
Sentiment Analysis Dockerised Microservice using Stanford NLP and HELIDON
Sentiment Analysis Dockerised Microservice using Stanford NLP and HELIDONSentiment Analysis Dockerised Microservice using Stanford NLP and HELIDON
Sentiment Analysis Dockerised Microservice using Stanford NLP and HELIDONSaiyam Pathak
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBDatabricks
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 

Was ist angesagt? (20)

PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
R programming Language
R programming LanguageR programming Language
R programming Language
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
Document Database
Document DatabaseDocument Database
Document Database
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
 
Sentiment Analysis Dockerised Microservice using Stanford NLP and HELIDON
Sentiment Analysis Dockerised Microservice using Stanford NLP and HELIDONSentiment Analysis Dockerised Microservice using Stanford NLP and HELIDON
Sentiment Analysis Dockerised Microservice using Stanford NLP and HELIDON
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 

Ähnlich wie Extending ML Algorithms with PySpark for Genomics

DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Predictive Models at Scale
Predictive Models at ScalePredictive Models at Scale
Predictive Models at ScaleNikhil Ketkar
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionSri Ambati
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 

Ähnlich wie Extending ML Algorithms with PySpark for Genomics (20)

DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Predictive Models at Scale
Predictive Models at ScalePredictive Models at Scale
Predictive Models at Scale
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 

Kürzlich hochgeladen (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 

Extending ML Algorithms with PySpark for Genomics

  • 1. Extending Machine Learning Algorithms with PySpark Karen Feng, Kiavash Kianfar Databricks
  • 2. Agenda ● Discuss using PySpark (especially Pandas UDFs) to perform machine learning at unprecedented scale ● Learn about an application for a genomics use case (GloWGR)
  • 3. Design decisions 1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark)
  • 4. Design decisions 1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark) 2. Problem: Bioinformaticians are not familiar with the native languages used by big data tools (Scala) Solution: Provide clients for high-level languages (Python)
  • 5. Design decisions 1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark) 2. Problem: Bioinformaticians are not familiar with the native languages used by big data tools (Scala) Solution: Provide clients for high-level languages (Python) 3. Problem: Performant, maintainable machine learning algorithms are difficult to write natively in big data tools (Spark SQL expressions) Solution: Write algorithms in high-level languages and link them to big data tools (PySpark)
  • 6. Genomic data are growing too fast for existing tools Problem 1
  • 7. Genomic data are growing at an exponential pace ●
  • 8. Biobank datasets are growing in scale • Next-generation sequencing • Genotyping arrays (1Mb) • Whole exome sequence (39Mb) • Whole genome sequence (3200Mb) • 1,000s of samples → 100,000s of samples • 10s of traits → 1000s of traits Genomic data are growing at an exponential pace
  • 9. Use general-purpose big data tools - specifically, Spark Solution 1
  • 10. Differentiation from single-node libraries ▪ Flexible: Glow is built natively on Spark, a general-purpose big data engine ▪ Enables aggregation and mining of genetic variants on an industrial scale ▪ Low-overhead: Spark minimizes serialization cost with libraries like Kryo and Arrow ▪ Inflexible: Each tool requires custom parallelization logic, per language and algorithm ▪ High-overhead: Moving text between arbitrary processes hurts performance Single-node
  • 11. Bioinformaticians are not familiar with the native languages used by big data tools, such as Scala Problem 2
  • 12. Spark is predominantly written in Scala
  • 13. Data engineers and scientists are Python-oriented ● More than 60% of notebook commands in Databricks are written in Python ● Fewer than 20% of commands are written in Scala
  • 14. Bioinformaticians are even more Python-oriented
  • 15. Provide clients for high-level languages, such as Python Solution 2
  • 16. Python improves the user experience • Py4J: achieve near-feature parity with Scala APIs • PySpark Project Zen • PySpark type hints Py4J
  • 17. Performant, maintainable machine learning algorithms are difficult to write natively in big data tools Problem 3
  • 18. Spark SQL expressions • Built to process data row-by-row • Difficult to maintain state • Minimal support for machine learning • Overhead from converting rows to ML-compatible shapes (eg. matrices) • Few linear algebra libraries exist in Scala • Limited functionality
  • 19. Write algorithms in high-level languages and link them to big data tools Solution 3
  • 20. Python improves the developer experience • Pandas: user-defined functions (UDFs) • Apache Arrow: transfer data between JVM and Python processes
  • 21. Feature in Spark 3.0: mapInPandas Local algorithm development in Pandas Plug-and-play with Spark with minimal overhead X f(X) → Y Y ... Iter(Y) ... Iter(X) f(X) → Y
  • 24. Genome Wide Association Studies (GWAS) Detect associations between genetic variations and traits of interest across a population • Common genetic variations confer a small amount of risk • Rare genetic variation confer a large amount of risk
  • 25. Whole Genome Regression (WGR) Account for polygenic effects, population structure, and relatedness • Reduce false positives • Reduce false negatives
  • 26. Mission: Industrialize genomics by integrating bioinformatics into data science Core principles: • Build on Apache Spark • Flexibly and natively support genomics tools and file formats • Provide single-line functions for common genomics workloads • Build an open-source community 26
  • 27. Glow v1.0.0 ● Datasources: Read/write common genomic file formats (eg. VCF, BGEN, Plink, GFF3) into/from Spark DataFrames ● SQL expressions: Simple variant handling operations can be called from Python, SQL, Scala, or R ● Transformers: Complex genomic transformations can be called from Python or Scala ● GloWGR: Novel WGR/GWAS algorithm built with PySpark https://projectglow.io/
  • 28. GloWGR: WGR and GWAS ● Detect which genotypes are associated with each phenotype using a Generalized Linear Model ● Glow parallelizes the REGENIE method via Spark as GloWGR ● Built from the ground-up using Pandas UDFs
  • 29. GWAS Regression Tests Millions of single-variate linear or logistic regressions GloWGR: Learning at huge dimensions WGR Reduction: ~5000 multi-variate linear ridge regressions (one for each block and parameter) 500K x 100 500K x 50 500K x 1M WGR Regression: ~ 5000 multi-variate linear or logistic ridge regressions with cross validation
  • 30. Data preparation Transformation and SQL functions on Genomic Variant DataFrame ● split_multiallelics ● genotype_states ● mean_substitute
  • 31. Stage 1: Genotype matrix blocking
  • 32. Stage 2: Dimensionality reduction RidgeReduction.fit ● Pandas UDF: Construct X and Y matrices for each block and calculate Xt X and Xt Y ● Pandas UDF: Reduce with element-wise sum over sample blocks ● Pandas UDF: Assemble the matrices Xt X and Xt Y for a particular sample block and calculate B= (Xt X + I⍺)-1 Xt Y RidgeReduction.transform ● Pandas UDF: Calculates XB for each block
  • 33. Stage 3: Estimate phenotypic predictors RidgeRegression.fit ● Pandas UDF: Construct X and Y matrices for each block and calculate Xt X and Xt Y ● Pandas UDF: Reduce with element-wise sum over sample blocks ● Pandas UDF: Assemble the matrices Xt X and XY for a particular sample block and calculate B= (Xt X + I⍺)-1 Xt Y ● Perform cross validation. Pick model with best ⍺ RidgeRegression.transform_loco ● Pandas UDF: Calculates XB for each block in a loco fashion
  • 34. GWAS Y ~ Gβg + Cβc + ϵ Y - Ŷ ~ Gβg + Cβc + ϵ Use the phenotype estimate Ŷ output by WGR to account for polygenic effects during regression
  • 35. GWAS with Spark SQL expressions Data S samples C covariates V variants T traits Fitted model S samples C covariates 1 variant 1 trait Results V variants T traits Null model S samples C covariates 1 trait V x T x T x Cβc Gβg
  • 36. GWAS with Spark SQL expressions Pros • Portable to all Spark clients
  • 37. GWAS with Spark SQL expressions Pros • Portable to all Spark clients
  • 38. GWAS with Spark SQL expressions Pros • Portable to all Spark clients Cons • Requires writing your own Spark SQL expressions • User-unfriendly linear algebra libraries in Scala (ie. Breeze) • Limited to 2 dimensions • Unnatural expressions of mathematical operations • Customized, expensive data transfers • Spark DataFrames ↔ MLLib matrices ↔ Breeze matrices • Input and output must be Spark DataFrames
  • 39. GWAS with PySpark Phenotype matrix S samples T traits Covariate matrix S samples C covariates Null model S samples C covariates 1 trait Genotype matrix S samples T traits Fitted model S samples C covariates O(V) variants O(T) traits T x # partitions x Results V variants T traits Gβg Cβc
  • 40. GWAS with PySpark Pros • User-friendly Scala libraries (ie. Pandas) • Easy to express mathematical notation • Unlimited dimensions • Batched, optimized transfers between Pandas and Spark DataFrames • Input and output can be Pandas or Spark DataFrames Cons • Accessible only from Python
  • 41. GWAS with PySpark Pros • User-friendly Scala libraries (ie. Pandas) • Easy to express mathematical notation • Unlimited dimensions • Batched, optimized transfers between Pandas and Spark DataFrames • Input and output can be Pandas or Spark DataFrames Cons • Accessible only from Python
  • 42. GWAS I/O formats Linalg libraries Accessible clients Spark SQL Spark DataFrames Spark ML/MLLib, Breeze Scala, Python, R PySpark Spark or Pandas DataFrames Pandas, Numpy, Einsum, ... Python
  • 43. Differentiation from other parallelized libraries ▪ Lightweight: Glow is a thin layer built to be compatible with the latest major Spark releases, as well as other open-source libraries (eg. Delta) ▪ Flexible: Glow includes a set of core algorithms, and is easily extended to ad-hoc use cases using existing tools ▪ Heavyweight: Many libraries build on custom logic that make it difficult to update to new technologies ▪ Inflexible: Many libraries expose custom interfaces that make it difficult to extend beyond the built-in algorithms Other parallelized libraries
  • 44. Future work: gene burden tests
  • 45. Big takeaways 1. Listen to your users 2. Use the latest off-the-shelf tools 3. If all else fails, pivot early
  • 46. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.