SlideShare a Scribd company logo
1 of 42
Combining Machine
Learning frameworks with
Apache Spark
Tim Hunter
Hadoop Summit
June 2016
About me
Apache Spark contributor (since Spark 0.6)
Software Engineer @ Databricks
Ph.D. in Machine Learning @ UC Berkeley
2
Founded by the team who
created Apache Spark
Offers a hosted service:
- Apache Spark in the cloud
- Notebooks
- Cluster management
- Production environment
About Databricks
3
Apache Spark
The most active open-source project in big data
Large-scale machine learning on Apache Spark
Spark MLlib
MLlib’s Mission
MLlib’s mission is to make practical machine
learning easy and scalable.
• Easy to build machine learning applications
• Capable of learning from large-scale datasets
• Easy to integrate into existing workflows
6
Algorithm Coverage
• Classification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
• Multilayer perceptron
• Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
• Generalized Linear Models
• Frequent itemsets
• FP-growth
• PrefixSpan
7
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
• Bisecting K-Means
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
• Kolmogorov–Smirnov test
• Online hypothesis testing
• Survival analysis
Linear algebra
• Local dense & sparse vectors & matrices
• Normal equation for least squares
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer
• One-Hot Encoder
• StringIndexer
• VectorIndexer
• VectorAssembler
• Binarizer
• Bucketizer
• ElementwiseProduct
• PolynomialExpansion
• Quantile discretizer
• SQL transformer
Model import/export
Pipelines
List based on Spark 2.0
Outline
• ML workflows are complex
• Distributing single-machine ML frameworks:
• Embedding with Spark:
• Unified cross-languages ML pipelines with MLlib
8
ML workflows are complex
• Specify the pipeline
• Re-run on new data
• Inspect the results
• Tune the parameters
• Usually, each step of a pipeline is easier with one
framework
9
ML Workflows are Complex
10
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
Existing tools
• Scikit-learn
– Excellent documentation
– Standard for Python
• R
– Lots of packages available
• Pandas
– Very easy to use
• A lot of investment in tooling and education
– How to integrate big data with these tools?
11
Common misconceptions
• Spark is for big data only
• Spark can only work with dedicated, distributed
libraries
12
Spark as a scheduler
• A lot of tasks in ML are ”embarrassingly parallel”
• Use Spark for data management and for
scheduling
13
One example: learning digits
• Learning tasks: given set of images, recognized
digits
• Standard benchmark dataset in computer vision
built by NIST:
14
Training Deep Learning algorithms
• Training a neural network is hard:
• It is a sequential procedure (present one image after the
other to learn from)
• It can be sensitive to noise and order of images:
robustness analysis is critical
• Tuning the training parameters (descent rate, batch sizes,
etc.) is very important. Otherwise, learning is too slow or
gets stuck in a local minima. A lot of heuristics are used in
practice.
15
TensorFlow as a training library
• A lot of algorithms have been presented for this
task, we will choose TensorFlow, from Google:
• Popular choice for neural network training and
deep learning
• Competitive performance
• Easy to experiment with
• Python interface makes it easy to integrate with
Spark
16
Distributing TensorFlow computations
• Even if TF is used as a single-machine library, we
get speedups from Spark
17
Distributed Cross Validation
...
Best
Model
Model #1
Training
Model #2
Training
Model #3
Training
Distributing TensorFlow computations
18
Distributed Cross Validation
...
Best Model
Model #4
Training
Model #6
Training
Model #3
Training
Model #1
Training
Model #5
Training
Model #2
Training
Results
• Running a 2-layer neural network, and testing for
different update rates and different layer sizes
19
0
3000
6000
9000
12000
1 node 2 nodes 13 nodes
Embedding deep learning in Spark
• Best known algorithms are essentially sequential
during training
• Careful selection of training parameters is critical
• Spark can help for fast iterations and find a good
set of parameters
20
Managing ML workflows with Spark
21
A data scientist’s wish list:
• Run original code on a production environment
• Use distributed data sources
• Use familiar APIs and libraries
• Distribute ML workload piece by piece
• Only distribute as needed
• Easily switch between local & distributed settings
22
Example: sentiment analysis
23
Given a review (text), predict the user’s rating.
Data from https://snap.stanford.edu/data/web-Amazon.html
ML Workflow
24
Train model
Evaluate
Load data
Extract features
Review: This product doesn't seem to be made to last… Rating: 2
feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0
Regression: (review: String) => Double
Load Data
25
built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames
LIBSVM
Train model
Evaluate
Load data
Extract features
Extract Features
words: [this, product, doesn't, seem, to, …]
feature_vector: [0.1 -1.3 0.23 … -0.74]
Review: This product doesn't seem to be made to last… Rating: 2
Prediction: 3.0
Train model
Evaluate
Load data
Tokenizer
Hashed Term Frequ.
Extract Features
words: [this, product, doesn't, seem, to, …]
feature_vector: [0.1 -1.3 0.23 … -0.74]
Review: This product doesn't seem to be made to last… Rating: 2
Prediction: 3.0
Linear regression
Evaluate
Load data
Tokenizer
Hashed Term Frequ.
Our ML workflow
28
Cross Validation
Model
Training
Feature
Extraction
regularization
parameter:
{0.0, 0.1, ...}
Cross validation
29
Cross Validation
...
Best Model
Model #1
Training
Model #2
Training
Feature
Extraction
Model #3
Training
Example
30
MLlib in production
ML Persistence
31
A data scientist’s wish list:
• Run original code on a production environment
• Use distributed data sources
• Use familiar APIs and libraries
• Distribute ML workload piece by piece
• Only distribute as needed
• Easily switch between local & distributed settings
32
DataFrame-based API for MLlib
a.k.a. “Pipelines” API, with utilities for constructing ML Pipelines
In 2.0, the DataFrame-based API will become the primary API for
MLlib.
• Voted by community
• org.apache.spark.ml, pyspark.ml
The RDD-based API will enter maintenance mode.
• Still maintained with bug fixes, but no new features
•org.apache.spark.mllib, pyspark.mllib
33
Why ML persistence?
34
Data Science Software Engineering
Prototype
(Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
Why ML persistence?
35
Data Science Software Engineering
Prototype
(Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline
for production (Java)
Deploy Pipeline
With ML persistence...
36
Data Science Software Engineering
Prototype
(Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
Model tuning
ML persistence status
37
Text
preprocessin
g
Feature
generation
Generalize
d Linear
Regressio
n
Unfitted Fitted
Model
Pipeline
Supported in
MLlib’s RDD-based
API
“recipe” “result”
ML persistence status
Near-complete coverage in all Spark language APIs
• Scala & Java: complete (29 feature transformers, 21 models)
• Python: complete except for 2 algorithms
• R: complete for existing APIs
Single underlying implementation of models
Exchangeable data format
• JSON for metadata
• Parquet for model data (coefficients, etc.)
38
A data scientist’s wish list:
• Run original code on a production environment
• Directly apply learned pipelines
• Use MLlib as export format
• Use distributed data sources
• Builtin Spark conversions
• Use familiar APIs and libraries
• Distribute ML workload piece by piece
• Easy to distribute the most common ML tasks
39
What’s next?
Prioritized items on the 2.1 roadmap JIRA (SPARK-
15581):
• Critical feature completeness for the DataFrame-based API
– Multiclass logistic regression
– Statistics
• Python API parity & R API expansion
• Scaling & speed tuning for key algorithms: trees & ensembles
GraphFrames
• Release for Spark 2.0
• Speed improvements (join elimination, connected components)
40
Get started
• Get involved via roadmap JIRA (SPARK-
15581) + mailing lists
• Download notebook for this talk
http://dbricks.co/1UfvAH9
• ML persistence blog post
http://databricks.com/blog/2016/05/31
41
Try out the Apache Spark
2.0 preview release:
http://databricks.com/try
Thank you!
spark.apache.org
spark-packages.org
databricks.com

More Related Content

What's hot

3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache AMD
 
Active learning: Scenarios and techniques
Active learning: Scenarios and techniquesActive learning: Scenarios and techniques
Active learning: Scenarios and techniquesweb2webs
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillationNAVER Engineering
 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationPier Luca Lanzi
 
Data Science: Applying Random Forest
Data Science: Applying Random ForestData Science: Applying Random Forest
Data Science: Applying Random ForestEdureka!
 
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUsAMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUsAMD
 
Edition Based Redefinition - Continuous Database Application Evolution with O...
Edition Based Redefinition - Continuous Database Application Evolution with O...Edition Based Redefinition - Continuous Database Application Evolution with O...
Edition Based Redefinition - Continuous Database Application Evolution with O...Lucas Jellema
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural NetworksPyData
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
AI Chip Trends and Forecast
AI Chip Trends and ForecastAI Chip Trends and Forecast
AI Chip Trends and ForecastCastLabKAIST
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsShunta Saito
 
Intriguing properties of contrastive losses
Intriguing properties of contrastive lossesIntriguing properties of contrastive losses
Intriguing properties of contrastive lossestaeseon ryu
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 

What's hot (16)

3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache
 
Active learning: Scenarios and techniques
Active learning: Scenarios and techniquesActive learning: Scenarios and techniques
Active learning: Scenarios and techniques
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to Classification
 
Data Science: Applying Random Forest
Data Science: Applying Random ForestData Science: Applying Random Forest
Data Science: Applying Random Forest
 
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUsAMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
 
Edition Based Redefinition - Continuous Database Application Evolution with O...
Edition Based Redefinition - Continuous Database Application Evolution with O...Edition Based Redefinition - Continuous Database Application Evolution with O...
Edition Based Redefinition - Continuous Database Application Evolution with O...
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural Networks
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
SPIN in Five Slides
SPIN in Five SlidesSPIN in Five Slides
SPIN in Five Slides
 
AI Chip Trends and Forecast
AI Chip Trends and ForecastAI Chip Trends and Forecast
AI Chip Trends and Forecast
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Intriguing properties of contrastive losses
Intriguing properties of contrastive lossesIntriguing properties of contrastive losses
Intriguing properties of contrastive losses
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 

Similar to Combining Machine Learning Frameworks with Apache Spark

Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkDatabricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkDatabricks
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDatabricks
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15MLconf
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesDatabricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 

Similar to Combining Machine Learning Frameworks with Apache Spark (20)

Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Combining Machine Learning Frameworks with Apache Spark

  • 1. Combining Machine Learning frameworks with Apache Spark Tim Hunter Hadoop Summit June 2016
  • 2. About me Apache Spark contributor (since Spark 0.6) Software Engineer @ Databricks Ph.D. in Machine Learning @ UC Berkeley 2
  • 3. Founded by the team who created Apache Spark Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment About Databricks 3
  • 4. Apache Spark The most active open-source project in big data
  • 5. Large-scale machine learning on Apache Spark Spark MLlib
  • 6. MLlib’s Mission MLlib’s mission is to make practical machine learning easy and scalable. • Easy to build machine learning applications • Capable of learning from large-scale datasets • Easy to integrate into existing workflows 6
  • 7. Algorithm Coverage • Classification • Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees • Multilayer perceptron • Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods • Generalized Linear Models • Frequent itemsets • FP-growth • PrefixSpan 7 Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering • Bisecting K-Means Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation • Kolmogorov–Smirnov test • Online hypothesis testing • Survival analysis Linear algebra • Local dense & sparse vectors & matrices • Normal equation for least squares • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer • One-Hot Encoder • StringIndexer • VectorIndexer • VectorAssembler • Binarizer • Bucketizer • ElementwiseProduct • PolynomialExpansion • Quantile discretizer • SQL transformer Model import/export Pipelines List based on Spark 2.0
  • 8. Outline • ML workflows are complex • Distributing single-machine ML frameworks: • Embedding with Spark: • Unified cross-languages ML pipelines with MLlib 8
  • 9. ML workflows are complex • Specify the pipeline • Re-run on new data • Inspect the results • Tune the parameters • Usually, each step of a pipeline is easier with one framework 9
  • 10. ML Workflows are Complex 10 Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble
  • 11. Existing tools • Scikit-learn – Excellent documentation – Standard for Python • R – Lots of packages available • Pandas – Very easy to use • A lot of investment in tooling and education – How to integrate big data with these tools? 11
  • 12. Common misconceptions • Spark is for big data only • Spark can only work with dedicated, distributed libraries 12
  • 13. Spark as a scheduler • A lot of tasks in ML are ”embarrassingly parallel” • Use Spark for data management and for scheduling 13
  • 14. One example: learning digits • Learning tasks: given set of images, recognized digits • Standard benchmark dataset in computer vision built by NIST: 14
  • 15. Training Deep Learning algorithms • Training a neural network is hard: • It is a sequential procedure (present one image after the other to learn from) • It can be sensitive to noise and order of images: robustness analysis is critical • Tuning the training parameters (descent rate, batch sizes, etc.) is very important. Otherwise, learning is too slow or gets stuck in a local minima. A lot of heuristics are used in practice. 15
  • 16. TensorFlow as a training library • A lot of algorithms have been presented for this task, we will choose TensorFlow, from Google: • Popular choice for neural network training and deep learning • Competitive performance • Easy to experiment with • Python interface makes it easy to integrate with Spark 16
  • 17. Distributing TensorFlow computations • Even if TF is used as a single-machine library, we get speedups from Spark 17 Distributed Cross Validation ... Best Model Model #1 Training Model #2 Training Model #3 Training
  • 18. Distributing TensorFlow computations 18 Distributed Cross Validation ... Best Model Model #4 Training Model #6 Training Model #3 Training Model #1 Training Model #5 Training Model #2 Training
  • 19. Results • Running a 2-layer neural network, and testing for different update rates and different layer sizes 19 0 3000 6000 9000 12000 1 node 2 nodes 13 nodes
  • 20. Embedding deep learning in Spark • Best known algorithms are essentially sequential during training • Careful selection of training parameters is critical • Spark can help for fast iterations and find a good set of parameters 20
  • 21. Managing ML workflows with Spark 21
  • 22. A data scientist’s wish list: • Run original code on a production environment • Use distributed data sources • Use familiar APIs and libraries • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings 22
  • 23. Example: sentiment analysis 23 Given a review (text), predict the user’s rating. Data from https://snap.stanford.edu/data/web-Amazon.html
  • 24. ML Workflow 24 Train model Evaluate Load data Extract features Review: This product doesn't seem to be made to last… Rating: 2 feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0 Regression: (review: String) => Double
  • 25. Load Data 25 built-in external { JSON } JDBC and more … Data sources for DataFrames LIBSVM Train model Evaluate Load data Extract features
  • 26. Extract Features words: [this, product, doesn't, seem, to, …] feature_vector: [0.1 -1.3 0.23 … -0.74] Review: This product doesn't seem to be made to last… Rating: 2 Prediction: 3.0 Train model Evaluate Load data Tokenizer Hashed Term Frequ.
  • 27. Extract Features words: [this, product, doesn't, seem, to, …] feature_vector: [0.1 -1.3 0.23 … -0.74] Review: This product doesn't seem to be made to last… Rating: 2 Prediction: 3.0 Linear regression Evaluate Load data Tokenizer Hashed Term Frequ.
  • 28. Our ML workflow 28 Cross Validation Model Training Feature Extraction regularization parameter: {0.0, 0.1, ...}
  • 29. Cross validation 29 Cross Validation ... Best Model Model #1 Training Model #2 Training Feature Extraction Model #3 Training
  • 31. MLlib in production ML Persistence 31
  • 32. A data scientist’s wish list: • Run original code on a production environment • Use distributed data sources • Use familiar APIs and libraries • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings 32
  • 33. DataFrame-based API for MLlib a.k.a. “Pipelines” API, with utilities for constructing ML Pipelines In 2.0, the DataFrame-based API will become the primary API for MLlib. • Voted by community • org.apache.spark.ml, pyspark.ml The RDD-based API will enter maintenance mode. • Still maintained with bug fixes, but no new features •org.apache.spark.mllib, pyspark.mllib 33
  • 34. Why ML persistence? 34 Data Science Software Engineering Prototype (Python/R) Create model Re-implement model for production (Java) Deploy model
  • 35. Why ML persistence? 35 Data Science Software Engineering Prototype (Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implement Pipeline for production (Java) Deploy Pipeline
  • 36. With ML persistence... 36 Data Science Software Engineering Prototype (Python/R) Create Pipeline Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  • 37. Model tuning ML persistence status 37 Text preprocessin g Feature generation Generalize d Linear Regressio n Unfitted Fitted Model Pipeline Supported in MLlib’s RDD-based API “recipe” “result”
  • 38. ML persistence status Near-complete coverage in all Spark language APIs • Scala & Java: complete (29 feature transformers, 21 models) • Python: complete except for 2 algorithms • R: complete for existing APIs Single underlying implementation of models Exchangeable data format • JSON for metadata • Parquet for model data (coefficients, etc.) 38
  • 39. A data scientist’s wish list: • Run original code on a production environment • Directly apply learned pipelines • Use MLlib as export format • Use distributed data sources • Builtin Spark conversions • Use familiar APIs and libraries • Distribute ML workload piece by piece • Easy to distribute the most common ML tasks 39
  • 40. What’s next? Prioritized items on the 2.1 roadmap JIRA (SPARK- 15581): • Critical feature completeness for the DataFrame-based API – Multiclass logistic regression – Statistics • Python API parity & R API expansion • Scaling & speed tuning for key algorithms: trees & ensembles GraphFrames • Release for Spark 2.0 • Speed improvements (join elimination, connected components) 40
  • 41. Get started • Get involved via roadmap JIRA (SPARK- 15581) + mailing lists • Download notebook for this talk http://dbricks.co/1UfvAH9 • ML persistence blog post http://databricks.com/blog/2016/05/31 41 Try out the Apache Spark 2.0 preview release: http://databricks.com/try

Editor's Notes

  1. Thanks I would like to discuss the integration between Apache Spark and single ML frameworks and tools, such as …
  2. Can I have a show of hands of many people in the audience have used scikit-learn, or R or Pandas?
  3. More than 1000 committers in Jan 2016
  4. Spark is a very simple tool to accelerate your computations, even if you do not use have big data Spark integrates well with existing libraries - easy to use as a simple scheduler - easy to write small bindings for the critical parts of libraries
  5. I am demonstrating from databricks Mounted S3 buckets Parquet datafiles
  6. - Be able to extract some smaller amount of data from the production storage system Slowly start distributing the system: piece by piece Easily switch between local and distributed Keep familiar APIs or even the same tools Show how thet distribution happens
  7. Reviews scraped f
  8. ML algorithms like to have numerical vectors
  9. Model training / tuning Regularization: parameter that controls how the linear model does on unseen data There is no single good value for the regularization parameter. One common method to find on is to try out different values. This technique is called CV: you split your training data into 2 sets: one set used to learn some parameters with a given regularization parameter, and another set to evaluate how well we are doing with the given parameter.
  10. Note: Recap verbally before this.
  11. - Be able to extract some smaller amount of data from the production storage system Slowly start distributing the system: piece by piece Easily switch between local and distributed Keep familiar APIs or even the same tools Show how thet distribution happens
  12. Note this is loading into Spark.
  13. Saving & loading ML types Models, both unfitted (“recipe”) & fitted Complex Pipelines, both unfitted (“workflow”) & fitted
  14. (DEMO)