Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Flock: E2E platform
to democratize Data
Science

Agenda
Chapter 1:
• GSL’s vantage point
• Why are we building (yet) another Data Science platform?
• Flock platform
• A technology showcase demo
Chapter 2:
• EGML applications in Microsoft: MLflow + ONNX + SQL Server
• Capturing provenance
• Future Work

GSL’s vantage point
640Patents GAed or Public Preview
features just this year
LoC in OSS
0.5MLoC in OSS
130+Publications in top tier
conferences/journals
1.1MLoC in products
600kServers running our code
in Azure/Hydra
Applied research lab part of Office of the CTO, Azure Data

Systems considered thus far
Cloud Providers Private Services OSS

Systems comparison
Training
Experiment Tracking
Managed Notebooks
Pipelines / Projects
Multi-Framework
Proprietary Algos
Distributed Training
Auto ML
Serving
Batch prediction
On-prem deployment
Model Monitoring
Model Validation
Data Management
Data Provenance
Data testing
Feature Store
Featurization DSL
Labelling
Good Support OK Support No Support Unknown

Insights
– Data Science is all about data J
– There’s an emerging class of applications:
• Enterprise Grade Machine Learning (EGML – CIDR’20)
– Dichotomy of “smarts” with rudimentary process
» Challenge on account of dual nature of models – software & data
– Couple of key pillars to enable EGML are:
• Tools for automating DS lifecycle
– Only O(100) ipython notebooks in GitHub import mlflow over 1M+ analyzed
– O(1000) for sklearn pipelines
• Data Governance
• (Unified data access)

Flock: Data-driven development
offline
online
Data-driven development
Solution Deployment
NN
Model transform
ONNX
ONNX’ Optimization
Close/Update Incidents
Job-id
Job telemetry
telemetry
application
tracking
model
training
LightGBM
policies
deployment
ONNX’
policies
Dhalion

DEMO Python code
import pandas as pd
import lightgbm as lgb
from sklearn import metrics
data_train = pd.read_csv("global_train_x_label_with_mapping.csv")
data_test = pd.read_csv("global_test_x_label_with_mapping.csv")
train_x = data_train.iloc[:,:-1].values
train_y = data_train.iloc[:,-1].values
test_x = data_test.iloc[:,:-1].values
test_y = data_test.iloc[:,-1].values
n_leaves = 8
n_trees = 100
clf = lgb.LGBMClassifier(num_leaves=n_leaves, n_estimators=n_trees)
clf.fit(train_x,train_y)
score = metrics.precision_score(test_y, clf.predict(test_x), average='macro’)
print("Precision Score on Test Data: " + str(score))
import mlflow
import mlflow.onnx
import multiprocessing
import torch
import onnx
from onnx import optimizer
from functools import partial
from flock import get_tree_parameters, LightGBMBinaryClassifier_Batched
import mlflow.sklearn
import mlflow
import pandas as pd
import lightgbm as lgb
from sklearn import metrics
data_train = pd.read_csv('global_train_x_label_with_mapping.csv')
data_test = pd.read_csv('global_test_x_label_with_mapping.csv')
train_x = data_train.iloc[:, :-1].values
train_y = data_train.iloc[:, (-1)].values
test_x = data_test.iloc[:, :-1].values
test_y = data_test.iloc[:, (-1)].values
n_leaves = 8
n_trees = 100
clf = lgb.LGBMClassifier(num_leaves=n_leaves, n_estimators=n_trees)
mlflow.log_param('clf_init_n_estimators', n_trees)
mlflow.log_param('clf_init_num_leaves', n_leaves)
clf.fit(train_x, train_y)
mlflow.sklearn.log_model(clf, 'clf_model')
score = metrics.precision_score(test_y, clf.predict(test_x), average='macro')
mlflow.log_param('precision_score_average', ' macro')
mlflow.log_param('score', score)
print('Precision Score on Test Data: ' + str(score))
n_features = 100
activation = 'sigmoid'
torch.set_num_threads(1)
device = torch.device('cpu')
model_name = 'griffon'
model = clf.booster_.dump_model()
n_features = clf.n_features_
tree_infos = model['tree_info']
pool = multiprocessing.Pool(8)
parameters = pool.map(partial(get_tree_parameters, n_features=n_features),
tree_infos)
lgb_nn = LightGBMBinaryClassifier_Batched(parameters, n_features, activation
).to(device)
torch.onnx.export(lgb_nn, torch.randn(1, n_features).to(device), model_name +
'_nn.onnx', export_params=True, operator_export_type=torch.onnx.
OperatorExportTypes.ONNX_ATEN_FALLBACK)
passes = ['eliminate_deadend', 'eliminate_identity',
'eliminate_nop_monotone_argmax', 'eliminate_nop_transpose',
'eliminate_unused_initializer', 'extract_constant_to_initializer',
'fuse_consecutive_concats', 'fuse_consecutive_reduce_unsqueeze',
'fuse_consecutive_squeezes', 'fuse_consecutive_transposes',
'fuse_matmul_add_bias_into_gemm', 'fuse_transpose_into_gemm',
'lift_lexical_references']
model = onnx.load(model_name + '_nn.onnx')
opt_model = optimizer.optimize(model, passes)
mlflow.onnx.log_model(opt_model, 'opt_model')
pyfunc_loaded = mlflow.pyfunc.load_pyfunc('opt_model', run_id=mlflow.
active_run().info.run_uuid)
scoring = pyfunc_loaded.predict(pd.DataFrame(test_x[:1].astype('float32'))
).values
print('Scoring through mlflow pyfunc: ', scoring)
mlflow.log_param('pyfunc_scoring', scoring[0][0])
User code Instrumented
code
Flock

Griffon: why is my job slow today?
Current OnCall Workflow
Revised OnCall Workflow with Griffon
A support engineer (SE) spends hours
of manual labor looking through
hundreds of metrics
After 5-6 hours of investigation, the
reason for job slow down is found.
A job goes out of SLA and
Support is alerted
A job goes out of SLA and
the SE is alerted The Job ID is fed through Griffon and
the top reasons for job slowdown are
generated automatically
The reason is found in
the top five generated
by Griffon.
All the metrics Griffon
has looked at can be
ruled out and the SE can
direct their efforts to a
smaller set of metrics.
ACM SoCC 2019

EGML applications in Microsoft
Model Training
TensorFlow
Spark
PyTorch
…
1
2
Model Generation
Conversion to
ONNX
Mlflow
Model.v1
3
Serving
SQL Server
Model.vn
H2O
Keras
…
Scikit-learn
Run
--Tracks the runs (parameters, code versions, metrics, output files)
-- Visualizes the output
4
ML Flow Model
(ONNX flavor)
SQL Server as
artifact/backend store

ONNX: Interoperability across ML frameworks
Open format to represent ML models
Backed by Microsoft, Amazon, Facebook, and several hardware vendors

ONNX exchange format
• Open format
• Enables interoperability across frameworks
• Many supported frameworks to import/export
– Caffe2, PyTorch, CNTK, MXNet, TensorFlow, CoreML

ONNX Runtime
• Cross-platform, high-performance scoring engine for ONNX models
• Open-sourced at the end of 2018
• ONNX Runtime is used in millions of Windows devices and powers core
models across Office, Bing, and Azure
Train a model using a
popular framework
such as TensorFlow
Convert the model to
ONNX format
Perform inference
efficiently across
multiple platforms
and hardware using
ONNX runtime

ONNX Runtime and optimizations
Key design points:
Graph IR
Support for multiple backends (e.g., CPU, GPU, FPGA)
Graph optimizations
Rule-based optimizer inspired by DB optimizers
Improved inference time and memory consumption
Examples: 117msec à 34msec; 250MB à 200MB

~40 ONNX
models
in production
>10 orgs
are migrating models
to ONNX Runtime
Average Speedup
2.7x
ONNX Runtime in production

ONNX Runtime in production
Office – Grammar Checking Model
14.6x reduction in latency

MLflow + ONNX
• MLflow (1.0.0) has now built-in support for ONNX models
• ONNX model flavor for saving, loading and evaluating ONNX
models
Train a sklearn
model

Serving the ONNX model
mlflow models serve -m /artifacts/model -p 1234
[6.379428821398614]
Deploy the server
Perform Inference
ONNX Runtime is
automatically invoked
curl -XPOST-H"Content-Type:application/json; format=pandas-split"--data'{"columns":["alcohol", "chlorides", "citricacid", "density", "fixedacidity", "free
sulfurdioxide","pH","residualsugar","sulphates","totalsulfurdioxide","volatileacidity"],"data":[[12.8,0.029,0.48, 0.98, 6.2, 29, 3.33, 1.2, 0.39, 75, 0.66]]}'
http://127.0.0.1:1234/invocations

MLflow + SQL Server
• MLflow can use SQL Server as an artifact store (and other
RDBMSs as well) (PR)
• The models are stored in binary format in the database along with
other metadata such as model name, size, run_id, etc.
client = MlflowClient()
exp_name = “test"
client.create_experiment(exp_name, artifact_location="mssql+pyodbc://sa:password@ipAddress:port/dbName?driver=ODBC+Driver+17+for+SQL+Server")
mlflow.set_experiment(exp_name)
mlflow.onnx.log_model(onnx, “model")

Provenance in EGML applications
• Need for end-to-end provenance tracking
• Multiple systems involved in each pipeline
SQL
Data pre-processing
Python
Script
Model Training
• Compliance
• Keeping ML models up-to-date

Tracking provenance in Python scripts
Python
Script
Python
AST
generation
Dependencies
between variables
and functions
Semantic annotation
through a knowledge
base of common ML
libraries
• Automatically identify models, metrics, hyperparameters in python scripts
• Answer questions such as: “Which columns in a dataset were used for model
training?”
Dataset #Scripts %ML models
covered
%Training
Datasets Covered
Kaggle 49 95% 61%
Microsoft 37 100% 100%

Future work
• MLflow:
– Integration with metadata management systems such as Apache Atlas
• Flock:
– Data Governance
– Generalize and extend coverage of auto-tracking and ML à NN
conversion.
– Provenance of end-to-end pipelines
• Combine with other systems (e.g., SQL, Spark)

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Similar to Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow