MLeap: Deploy Spark ML Pipelines to Production API Servers

MLeap: Scaling Machine Learning
From Research to Production
https://github.com/combust/mleap
twitter: @combustml

Intros
Hollin Wilkins Mikhail Semeniuk

Our Talk in 3 Parts
1. What is MLeap?
○ Problem Statement + Architecture of serialization
format and execution engine + Benchmarks
2. Future of MLeap/Product Roadmap
○ Beyond Spark and JVM
3. Demo: Train and deploy a streaming model
to an API server with MLeap-Serving

Initially
Took too long to get Spark
models out into production

Original MLeap Requirements
- Has to eliminate re-coding of feature pipelines and models
from research to production
- Serving/inference system has to be fast, sub-20ms at worst
- Should require minimal amount of new code to be written by
the researcher to add new features/models
- Should be a lightweight library that will allow
users/organizations to customize as they see fit

Now
Solve for more than just
Spark … we’ll talk about this
later

New MLeap Requirements
- Inference needs to happen outside of the JVM (Train in
Spark, execute on an embedded device)
- Should support other popular ML frameworks like
Scikit-Learn, and TensorFlow

MLeap Architecture (high-level)
A Serialization
Framework For
Machine Learning
Pipelines
An Execution Engine for
Machine Learning Pipelines

From Research to Production in 3 Steps
1. Continue to write
your ML pipelines and
training of models in
Spark
2. Serialize your entire
feature pipeline and
model(s) to an MLeap
bundle, called
bundle.ml
3. Load the serialized
pipeline to MLeap
serving and execute
via a REST-api, without
any dependency on
the Spark-context

MLeap Bundle
(a.ka. bundle.ml)

bundle.ml: Serialization Framework
Vector
Assembler
Continuous
Feature Vector
Standard
Scaler
Vector
Assembler
Scaled Continuous
Feature Vector
String
Indexer
String
Indexer
OneHotEncoder
Vector
Assembler
Categorical
Feature Vector
OneHotEncoder
Linear
Regression
Continuous
Features
Categorical
Feature
(bundle 1)

bundle.ml: Serialization Framework
Vector
Assembler
Continuous
Feature Vector
Standard
Scaler
Vector
Assembler
Scaled Continuous
Feature Vector
String
Indexer
String
Indexer
OneHotEncoder
Vector
Assembler
Categorical
Feature Vector
OneHotEncoder
Random Forest
Regression
Continuous
Features
Categorical
Feature
(bundle 2)
PCA

bundle.ml: Structure
Bundle.json - Root-level meta data about pipeline (version,
names, etc.)
Model.json - Data required to execute the model
(coefficients, decision trees, intercepts, string lookups, etc.)
Node.json - Connects input/output data for models to a
LeapFrame (features for a logistic regression, prediction
field for a random forest, etc.)

Custom Transformers
Custom Spark
Transformer
Custom MLeap
Transformer
Bundle Spark
Serializer
Bundle MLeap
Serializer
MLeap
Bundle
- Define your model and node
conversions
- Bundle.ML handles serializing
as either JSON or Protobuf
- All transformers in MLeap are
implemented in this way
- Custom MLeap TFs:
Unary/Binary, SVM, Imputer
Logic

Core Concepts: Data Frame (LeapFrame)
square_feet (Int) room_type (string) avg_rating (double) is_special (bool)
1200 House .93 true
800 Apartment .90 false
1. Schema defines names and types of columns
2. Rows to hold data

IoT Devices
Cloud APIs
MLeap-JVM
MLeap-RS
MLeap
Bundles

MLeap <> Scikit-Learn
- Preprocessing (Scalers,
Label/OneHot Encoder)
- Base Models (Linear/Logistic)
- Dimension Reduction (PCA)
- Other: PCA, RF, GBRT
- Pipelines/Feature Unions
MLeap
Bundles
MLeap-JVM

MLeap <> TensorFlow
Spark Feature
Pipeline
TensorFlow
Transformer
TF via JNI
MLeap Serving
TF via JNI

MLeap Benchmarks
Default MLeap Transformer (Random Forest): MLeap Row Transformer (Random Forest):

MLeap Deployed in Prod
Deployed With MLeap
MLeap
4-15ms
Spark
0.2s-1s

MLeap Roadmap
MLeap <> Spark
- Full Streaming
Support
- Fully typed
transformer pipelines
MLeap <> Scikit
- Scikit <> Spark
Transformer Parity
- Protobuf Serialization
- Deserialization
MLeap <> Rust
- MIR - mid-level
intermediate
representation
- JIT/Interpreted Mode

Grow the Community
Deployed With MLeap
14
Contributors
21k
Lines of
Code
Started
2016
- Become a
contributor!
- File an issue report
- Write a cool demo
using MLeap
- Discuss the future of
MLeap
- Chat with us about
your use case
- Let us help if you run
into any problems
Deployed With MLeap
- Looking for Product
Managers for MLeap
- Share your MLeap
success story
- Write a blog post
about your MLeap
project

MLeap Rust
CPU CUDA/HIP OpenCL
MLeap Rust
Python Obj-C/Swift Go Ruby C#

Demo
Train Spark Feature Pipeline +
Linear Regression Model
Apartment
Listing
Data
MLeap
Serving
Client API
Feature Pipeline Bundle
Streaming
Listing
Data
Transform
Features
Stream
to
Socket
Online Linear
Regression

Thank You!
Hollin Wilkins
e: hollin@combust.ml
tw: @hollinwilkins
https://github.com/combust/mleap
Mikhail Semeniuk
e: mikhail@combust.ml
tw: @mikhailsemeniuk
https://gitter.im/combust/mleap
https://twitter.com/combustml

MLeap: Deploy Spark ML Pipelines to Production API Servers

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie MLeap: Deploy Spark ML Pipelines to Production API Servers

Ähnlich wie MLeap: Deploy Spark ML Pipelines to Production API Servers (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MLeap: Deploy Spark ML Pipelines to Production API Servers