In the Apache Spark 2.x releases, Machine Learning (ML) is focusing on DataFrame-based APIs. This webinar is aimed at helping users take full advantage of the new APIs. Topics will include migrating workloads from RDDs to DataFrames, ML persistence for saving and loading models, and the roadmap ahead.
Migrating ML workloads to use Spark DataFrames and Datasets allows users to benefit from simpler APIs, plus speed and scalability improvements. As the DataFrame/Dataset API becomes the primary API for data in Spark, this migration will become increasingly important to MLlib users, especially for integrating ML with the rest of Spark data processing workloads. We will give a tutorial covering best practices and some of the immediate and future benefits to expect.
ML persistence is one of the biggest improvements in the DataFrame-based API. With Spark 2.0, almost all ML algorithms can be saved and loaded, even across languages. ML persistence dramatically simplifies collaborating across teams and moving ML models to production. We will demonstrate how to use persistence, and we will discuss a few existing issues and workarounds.
At the end of the webinar, we will discuss major roadmap items. These include API coverage, major speed and scalability improvements to certain algorithms, and integration with structured streaming.
4. About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineer and
Apache Spark Committer & PMC member
working on Machine Learning at
Databricks. Previously, he was a postdoc
at UC Berkeley after receiving his Ph.D. in
Machine Learning from Carnegie Mellon in
2013.
5
5. About the speaker: Jules S. Damji
Jules S. Damji is an Apache Spark
Community Evangelist with Databricks. He
is a hands-on developer with over 15 years
of experience and has worked at leading
companies building large-scale distributed
systems.
6
6. Databricks
Founded by the creators
of Apache Spark in 2013
Share of Spark code
contributed by
Databricks
in 2014
75%
Data Value
Created Databricks on top of Spark to make big data simple.
7
7. …
Apache Spark Engine
Spark Core
Spark
Streaming
Spark
SQL
MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, & R
APIs
Standard libraries
8
9. 10
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT
2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update
10. Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap ahead during 2.x
11
11. Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap ahead during 2.x
12
12. A bit of MLlib history
Spark 0.8
RDD-based API
Fast, scale-out ML
Challenges
• Expressing complex workflows
• Integrating with DataFrames
• Developing Java, Python & R APIs
Spark 1.2
DataFrame-based API
(a.k.a. “Spark ML”)
Major improvements
• ML Pipelines with automated tuning
• Native DataFrame integration
• Standard API across languages
See Xiangrui Meng’s original
design & prototype in SPARK-
3530.
13
14. DataFrame-based API for MLlib
DataFrames are the standard ML dataset type.
Uniform APIs for algorithms, hyperparameters, etc.
Pipelines provide utilities for constructing ML workflows +
automating hyperparameter tuning.
Learn more about ML Pipelines:
http://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-mllib-2
http://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html
15
15. DataFrame-based API for MLlib
In 2.0, the DataFrame-based API became the primary MLlib
API.
• Voted by community
• org.apache.spark.ml, pyspark.ml
The RDD-based API is in maintenance mode.
• Still maintained with bug fixes, but no new features
•org.apache.spark.mllib, pyspark.mllib
16
16. Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
17
17. Why migrate to DataFrames?
DataFrames
Language APIs
Pipelines
DataFrames & Datasets are the new “core” API for Spark.
• Data sources & ETL
• Latest performance improvements (Catalyst & Tungsten)
• Structured Streaming
18
18. Why migrate to DataFrames?
DataFrames
Language APIs
Pipelines
Standardized across Scala, Java, Python, and R
• Python & R match Scala/Java performance
• Cross-language persistence (saving/loading models)
19
19. Why migrate to DataFrames?
DataFrames
Language APIs
Pipelines
Specify complex ML workflows
• Chain together Transformers, Estimators, & Models
• Automated hyperparameter tuning
20
20. Demo migration
Convert a notebook from the RDD-based API
to the DataFrame-based API.
Key points
• Work with single models or complex Pipelines
• Incremental migration
• Many benefits: simpler APIs, SQL integration, tuning
• A few gotchas (linear algebra types)
21
Warning
Demo for
experts!
21. Demo recap: migration process
Separate 2 migrations:
• Spark 1.6 2.0
• RDDs DataFrames
Migrate ML APIs: spark.mllib spark.ml
• Gotcha: a few naming changes (from standardizing algorithm
APIs)
• Certain Param and model methods
• run() fit()
• Tips:
• Use explainParams()
• Compare the API docs if you hit issues!
Migrate data APIs: RDDs DataFrames
• Tip: Get familiar with conversion syntax in both directions.
22
22. Demo recap: migration process
Debugging runtime errors
• Gotcha: Lazy evaluation in Pipelines means bugs appear later than
expected.
• Tip: Check intermediate results.
• Gotcha: Vector (and Matrix) types in spark.mllib and spark.ml.
• Relevant for Spark 1.6 2.0 migration
• Tip: Watch for buried errors: MatchError and mentions of “vector”
• Tip: Use helper methods for conversion
• org.apache.spark.mllib.linalg.Vector.asML
• org.apache.spark.mllib.linalg.Vectors.fromML
• http://spark.apache.org/docs/latest/ml-guide.html#migration-guide
23
23. Future benefits of migration
Currently
ML training is implemented on
RDDs.
Goal
Port implementation to DataFrames.
Benefit from DataFrame
optimizations (Catalyst, Tungsten).
Spark SQL MLlib
Core
RDDs
DataFrames
Datasets
SQL
24
24. Future benefits of migration
Status
First published implementation
in GraphFrames (Spark
package for graph processing)
Ongoing work
DataFrame improvements for
iterative algorithms:
checkpointing, improved
caching, and more.
Spark SQL
MLlib
Core
RDDs
DataFrames
Datasets
SQL
25
25. Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
26
26. Why ML persistence?
Data Science Software Engineering
Prototype
(Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
27
27. Why ML persistence?
Data Science Software Engineering
Prototype
(Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline
for production (Java)
Deploy Pipeline
28
28. With ML persistence...
Data Science Software Engineering
Prototype
(Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
29
29. Model tuning
ML persistence status
Text
preprocessin
g
Feature
generation
Random
forest
Unfitted Fitted
Model
Pipeline
“recipe” “result”
30
30. ML persistence status
Near-complete coverage in all Spark language APIs
• Scala & Java: complete
• Python: complete except for 2 algorithms
• R: complete for existing APIs
Single underlying implementation of models
Exchangeable data format
• JSON for metadata
• Parquet for model data (coefficients, etc.)
31
31. Demo: ML persistence
• Can persist single models & complex workflows
• Easy to move models across Spark deployments
• Share models across teams & languages
32
32. ML persistence: pending issues
Python tuning: not yet implemented
• CrossValidator, TrainValidationSplit
R format: incompatible with Python/Java/Scala
• Issue: R wrappers are all special Pipelines.
• Working towards a fix
• Workaround: Load underlying PipelineModel from subfolder in
saved model directory.
Backwards compatibility: WIP in SPARK-15573
ML persistence blog post:
http://databricks.com/blog/2016/05/31
33
33. Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
34
34. Goals for MLlib in 2.x
Major initiatives
• ML persistence: saving &
loading models & Pipelines
• Complete feature parity for
DataFrames-based API.
Missing items:
• Frequent Pattern Mining
• Certain methods in models
• Developer APIs
For an overview of MLlib in 2.0, see
http://spark-summit.org/2016/events/apache-spark-mllib-20-preview-data-science-and-
production
35
Other important improvements
• Generalized Linear Models
• Python & R API parity
• Speed & scalability improvements
35. Coming in 2.1
Multiclass logistic regression (SPARK-7159)
Locality sensitive hashing (SPARK-5992)
More ML in SparkR (SPARK-16442)
• ALS
• Isotonic Regression
• Multilayer Perceptron Classifier
• Random Forest
• Gaussian Mixture Model
• LDA
• Multiclass Logistic Regression
• Gradient Boosted Trees
Various speed & scalability improvements
• Random Forest, Naive Bayes, LDA, Gaussian Mixture, and others
Spark 2.1 status:
Release candidates are under QA.
For release schedule, see
http://spark.apache.org/versioning-policy.html
36
36. Get started
Get involved in the community
• Events & news https://sparkhub.databricks.com/
• User mailing list
http://spark.apache.org/community.html
Get involved in development
• Dev mailing list
http://spark.apache.org/community.html
• JIRA
http://issues.apache.org/jira/browse/SPARK
• Contribute
http://spark.apache.org/contributing.html
Try out Apache Spark
for free on Databricks
Community Edition!
http://databricks.com/try
Many thanks to
the Apache
Spark
community!
37
Abstract
In the Apache Spark 2.x releases, Machine Learning (ML) is focusing on DataFrame-based APIs. This webinar is aimed at helping users take full advantage of the new APIs. Topics will include migrating workloads from RDDs to DataFrames, ML persistence for saving and loading models, and the roadmap ahead.
Migrating ML workloads to use Spark DataFrames and Datasets allows users to benefit from simpler APIs, plus speed and scalability improvements. As the DataFrame/Dataset API becomes the primary API for data in Spark, this migration will become increasingly important to MLlib users, especially for integrating ML with the rest of Spark data processing workloads. We will give a tutorial covering best practices and some of the immediate and future benefits to expect.
ML persistence is one of the biggest improvements in the DataFrame-based API. With Spark 2.0, almost all ML algorithms can be saved and loaded, even across languages. ML persistence dramatically simplifies collaborating across teams and moving ML models to production. We will demonstrate how to use persistence, and we will discuss a few existing issues and workarounds.
At the end of the webinar, we will discuss major roadmap items. These include API coverage, major speed and scalability improvements to certain algorithms, and integration with structured streaming.
Original Pipeline JIRA: http://issues.apache.org/jira/browse/SPARK-3530
“Certain Param and model methods” is an algorithm-specific issue. All other issues are general across MLlib.
Note this is loading into Spark.
Saving & loading ML types
Models, both unfitted (“recipe”) & fitted
Complex Pipelines, both unfitted (“workflow”) & fitted