MLeap is an open-source technology that allows Data Scientists and Engineers to deploy Spark-trained ML Pipelines and Models to a scoring engine instantly. During our presentation, we will show you how to deploy any Spark ML Pipeline, as well as custom transformers, that are trained using Spark streaming to both a cloud-based API server as well as an IoT device.
Why MLeap? Data Scientists use a myriad tools to analyze datasets, clean them and build offline models and validate their performance. The resulting scripts are thrown across the wall to Data Engineers and Architects whose job is to bring these pipelines to production. The Engineers are left with the unenviable job of not only reproducing the Data Scientists’ conclusions, but to scale the resulting pipeline both of which require a deep understanding of Data Science itself. As a result, most if not all Data Science deployments in the wild end up either too simplistic or take too long to productionize.
MLeap solves this problem for Spark users by providing serialization of ML Pipelines’ transformers to an MLeap Bundle, which is a graph-based serialization framework built on top of Protobuf 3 and JSON. In addition, MLeap also provides a highly optimized execution engine that doesn’t rely on the Spark-context, making inference blazing fast and is capable of executing one model or thousands of models in parallel.
3. Our Talk in 3 Parts
1. What is MLeap?
○ Problem Statement + Architecture of serialization
format and execution engine + Benchmarks
2. Future of MLeap/Product Roadmap
○ Beyond Spark and JVM
3. Demo: Train and deploy a streaming model
to an API server with MLeap-Serving
6. Original MLeap Requirements
- Has to eliminate re-coding of feature pipelines and models
from research to production
- Serving/inference system has to be fast, sub-20ms at worst
- Should require minimal amount of new code to be written by
the researcher to add new features/models
- Should be a lightweight library that will allow
users/organizations to customize as they see fit
8. New MLeap Requirements
- Inference needs to happen outside of the JVM (Train in
Spark, execute on an embedded device)
- Should support other popular ML frameworks like
Scikit-Learn, and TensorFlow
9. MLeap Architecture (high-level)
A Serialization
Framework For
Machine Learning
Pipelines
An Execution Engine for
Machine Learning Pipelines
10. From Research to Production in 3 Steps
1. Continue to write
your ML pipelines and
training of models in
Spark
2. Serialize your entire
feature pipeline and
model(s) to an MLeap
bundle, called
bundle.ml
3. Load the serialized
pipeline to MLeap
serving and execute
via a REST-api, without
any dependency on
the Spark-context
14. bundle.ml: Structure
Bundle.json - Root-level meta data about pipeline (version,
names, etc.)
Model.json - Data required to execute the model
(coefficients, decision trees, intercepts, string lookups, etc.)
Node.json - Connects input/output data for models to a
LeapFrame (features for a logistic regression, prediction
field for a random forest, etc.)
15. Custom Transformers
Custom Spark
Transformer
Custom MLeap
Transformer
Bundle Spark
Serializer
Bundle MLeap
Serializer
MLeap
Bundle
- Define your model and node
conversions
- Bundle.ML handles serializing
as either JSON or Protobuf
- All transformers in MLeap are
implemented in this way
- Custom MLeap TFs:
Unary/Binary, SVM, Imputer
Logic
16. Core Concepts: Data Frame (LeapFrame)
square_feet (Int) room_type (string) avg_rating (double) is_special (bool)
1200 House .93 true
800 Apartment .90 false
1. Schema defines names and types of columns
2. Rows to hold data
25. Grow the Community
Deployed With MLeap
14
Contributors
21k
Lines of
Code
Started
2016
- Become a
contributor!
- File an issue report
- Write a cool demo
using MLeap
- Discuss the future of
MLeap
- Chat with us about
your use case
- Let us help if you run
into any problems
Deployed With MLeap
- Looking for Product
Managers for MLeap
- Share your MLeap
success story
- Write a blog post
about your MLeap
project
27. Demo
Train Spark Feature Pipeline +
Linear Regression Model
Apartment
Listing
Data
MLeap
Serving
Client API
Feature Pipeline Bundle
Streaming
Listing
Data
Transform
Features
Stream
to
Socket
Online Linear
Regression