What MLflow is; what problem it solves for machine learning lifecycle; and how it solves; How it will be used with Databricks; and CI/CD pipeline with Databricks.
2. Problems of Machine Learning Workflow
DIFFICULT TO KEEP TRACK OF
EXPERIMENTS
DIFFICULT TO REPRODUCE CODE NO STANDARD WAY TO PACKAGE
AND DEPLOY MODELS
6. Scalability and Big Data
MLfow supports scaling in three dimensions
1. An individual MLflow run can execute on a distributed cluster, for example,
using Apache Spark. You can launch runs on the distributed infrastructure of your
choice and report results to a Tracking Server to compare them. MLflow includes a
built-in API to launch runs on Databricks.
2. MLflow supports launching multiple runs in parallel with different parameters, for
example, for hyperparameter tuning. You can simply use the Projects API to start
multiple runs and the Tracking API to track them.
3. MLflow Projects can take input from, and write output to, distributed storage
systems such as AWS S3 and DBFS. MLflow can automatically download such files
locally for projects that can only run on local files, or give the project a distributed
storage URI if it supports that. This means that you can write projects that build
large datasets, such as featurizing a 100 TB file.
7. MLflow Components - tracking
• - is an API and UI for logging parameters,
code versions, metrics, and output files when
running your machine learning code and for
later visualizing the results. You can use
MLflow Tracking in any environment (for
example, a standalone script or a notebook)
to log results to local files or to a server, then
compare multiple runs. Teams can also use it
to compare results from different users.
9. Multisteps Tracking
• A typical flow
• Step 1: download data from a url
• Step 2: transform and load the downloaded dataset to another place
• Step 3: use Spark (Databricks) to train your model
• Step 4: share your model with application developers
You can use MLFlow Tracking API to track information of each steps
10. MLflow Components - Projects
• are a standard format for packaging reusable data science code.
Each project is simply a directory with code or a Git repository,
and uses a descriptor file or simply convention to specify its
dependencies and how to run the code. For example, projects
can contain a conda.yaml file for specifying a
Python Conda environment. When you use the MLflow Tracking
API in a Project, MLflow automatically remembers the project
version (for example, Git commit) and any parameters. You can
easily run existing MLflow Projects from GitHub or your own Git
repository, and chain them into multi-step workflows.
mlflow run https://github.com/mlflow/mlflow-
example.git -P alpha=0.4
11. MLflow
Components
- models
• offer a convention for packaging machine learning models in multiple
flavors, and a variety of tools to help you deploy them. Each Model is
saved as a directory containing arbitrary files and a descriptor file
that lists several “flavors” the model can be used in. For example, a
TensorFlow model can be loaded as a TensorFlow DAG, or as a
Python function to apply to input data. MLflow provides tools to
deploy many common model types to diverse platforms: for
example, any model supporting the “Python function” flavor can be
deployed to a Docker-based REST server, to cloud platforms such as
Azure ML and AWS SageMaker, and as a user-defined function in
Apache Spark for batch and streaming inference. If you output
MLflow Models using the Tracking API, MLflow also automatically
remembers which Project and run they came from.
12. MLflow Models
• Storage Format
• MLflow defines several “standard” flavors that all of its built-in
deployment tools support, such as a “Python function” flavor that
describes how to run the model as a Python function.
• However, libraries can also define and use other flavors. For example,
MLflow’s mlflow.sklearn library allows loading models back as a scikit-
learn Pipeline object for use in code that is aware of scikit-learn, or as
a generic Python function for use in tools that just need to apply the
model (for example, the mlflow sagemaker tool for deploying models
to Amazon SageMaker).
13. Built-in Model Flavors
• Python Function
• R Funtion
• H2O
• Keras
• Mleap
• PyTorch
• Scikit-learn
• Spark Mlib
• TensorFlow
• ONNX
14. Saving & Serving Models
• MLflow includes a generic MLmodel format for saving models from a
variety of tools in diverse flavors. For example, many models can be
served as Python functions, so an MLmodel file can declare how each
model should be interpreted as a Python function in order to let
various tools serve it. MLflow also includes tools for running such
models locally and exporting them to Docker containers or
commercial serving platforms.
• for example, batch inference on Apache Spark and real-time serving
through a REST API
mlflow models serve -m runs:/<RUN_ID>/model
15. • A local restful service demo
Modeling wine preferences by data mining from physicochemical pro
perties
• mlflow models serve -m runs:/ea388ca349964193a17f2823480fc6bf/model --
port 5001
• curl -d '{"columns":["x"], "data":[[1], [-1]]}' -H 'Content-Type: application/json;
format=pandas-split' -X POST localhost:5001/invocations
Modeling Learning Model as a Service
16. More about model service
• Restful Service for Batch or low latency Inference with or w/o
databricks (Spark)
• Deploy with Azure ML
• The mlflow.azureml module can package python_function models into Azure
ML container images.
• Example workflow using the Python API
• https://www.mlflow.org/docs/latest/models.html#built-in-deployment-tools
17. Summary: Use Cases
• Individual Data Scientists can use MLflow Tracking to track experiments locally on their machine, organize code in projects for
future reuse, and output models that production engineers can then deploy using MLflow’s deployment tools. MLflow Tracking
just reads and writes files to the local file system by default, so there is no need to deploy a server.
• Data Science Teams can deploy an MLflow Tracking server to log and compare results across multiple users working on the same
problem. By setting up a convention for naming their parameters and metrics, they can try different algorithms to tackle the same
problem and then run the same algorithms again on new data to compare models in the future. Moreover, anyone can download
and run another model.
• Large Organizations can share projects, models, and results using MLflow. Any team can run another team’s code using MLflow
Projects, so organizations can package useful training and data preparation steps that other teams can use, or compare results
from many teams on the same task. Moreover, engineering teams can easily move workflows from R&D to staging to production.
• Production Engineers can deploy models from diverse ML libraries in the same way, store the models as files in a management
system of their choice, and track which run a model came from.
• Researchers and Open Source Developers can publish code to GitHub in the MLflow Project format, making it easy for anyone to
run their code using the mlflow run github.com/... command.
• ML Library Developers can output models in the MLflow Model format to have them automatically support deployment using
MLflow’s built-in tools. In addition, deployment tool developers (for example, a cloud vendor building a serving platform) can
automatically support a large variety of models.
18. Part II: Working
with Databricks
• MLflow on Databricks integrates with the complete
Databricks Unified Analytics Platform, including
Notebooks, Jobs, Databricks Delta, and the Databricks
security model, enabling you to run your existing MLflow
jobs at scale in a secure, production-ready manner.
mlflow run git@github.com:mlflow/mlflow-example.git -P
alpha=0.5 -b databricks --backend-config json-cluster-spec.json
Step by step guide: https://medium.com/@liangjunjiang/install-
mlflow-on-databricks-55b11bc023fa
19. Extra: CI/CD on Databricks
1. https://thedataguy.blog/ci-cd-with-databricks-and-azure-devops/
2. https://databricks.com/blog/2017/10/30/continuous-integration-continuous-delivery-databricks.html
• Databricks doesn’t support
Enteprise Github (yet)
• Azure DevOps can do per file git
tracking
• Coding with Databricks needs
supports of 1. your code. 2.
libraries you used 3.
machine(cluster) you will run
20. CI/CD with Databricks - Development
• A two-step process for development
1. Use git manage your project (as usual) locally . Your project should have data,
your notebook, the library you will use, etc
2. Copy your project content to Databricks Workspace
3. Once you are done with your notebook coding, export from Databricks
Workspace to your local
4. Use git command to push to the remote repo.
21. CI/CD with Databricks – Unit Test or
Integration Test
• Solution 1:
• Rewrite your notebook code to Java/Scala Classes or Python Packages using
IDE
• Write unit test of those classes with IDE
• Remember to split the core logic and your library
• Import your library to databricks and let your notebooks to interact with it
• In the end, your code are two parts: libraries and notebooks
• Solution 2:
• Everything are in a package, use Spark run it
22. CI/CD with Databricks – Build
• Similar to the development process
• Assign dedicated production cluster
23. Alternative to this Presentation
• You Probably should just watch this Mlflow Spark
+ AI Summit Keynote video
• https://vimeo.com/274266886#t=33s