This document discusses platforms for democratizing data science and enabling enterprise grade machine learning applications. It introduces Flock, a platform that aims to automate the machine learning lifecycle including tracking experiments, managing models, and deploying models for production. It demonstrates Flock by instrumenting Python code for a light gradient boosted machine model to track parameters, log models to MLFlow, convert the model to ONNX, optimize it, and deploy it as a REST API. Future work discussed includes improving Flock's data governance, generalizing auto-tracking capabilities, and integrating with other systems like SQL and Spark for end-to-end pipeline provenance.
2. Agenda
Chapter 1:
• GSL’s vantage point
• Why are we building (yet) another Data Science platform?
• Flock platform
• A technology showcase demo
Chapter 2:
• EGML applications in Microsoft: MLflow + ONNX + SQL Server
• Capturing provenance
• Future Work
3. GSL’s vantage point
640Patents GAed or Public Preview
features just this year
LoC in OSS
0.5MLoC in OSS
130+Publications in top tier
conferences/journals
1.1MLoC in products
600kServers running our code
in Azure/Hydra
Applied research lab part of Office of the CTO, Azure Data
5. Systems comparison
Training
Experiment Tracking
Managed Notebooks
Pipelines / Projects
Multi-Framework
Proprietary Algos
Distributed Training
Auto ML
Serving
Batch prediction
On-prem deployment
Model Monitoring
Model Validation
Data Management
Data Provenance
Data testing
Feature Store
Featurization DSL
Labelling
Good Support OK Support No Support Unknown
6. Insights
– Data Science is all about data J
– There’s an emerging class of applications:
• Enterprise Grade Machine Learning (EGML – CIDR’20)
– Dichotomy of “smarts” with rudimentary process
» Challenge on account of dual nature of models – software & data
– Couple of key pillars to enable EGML are:
• Tools for automating DS lifecycle
– Only O(100) ipython notebooks in GitHub import mlflow over 1M+ analyzed
– O(1000) for sklearn pipelines
• Data Governance
• (Unified data access)
7. Flock: Data-driven development
offline
online
Data-driven development
Solution Deployment
NN
Model transform
ONNX
ONNX’ Optimization
Close/Update Incidents
Job-id
Job telemetry
telemetry
application
tracking
model
training
LightGBM
policies
deployment
ONNX’
policies
Dhalion
9. Griffon: why is my job slow today?
Current OnCall Workflow
Revised OnCall Workflow with Griffon
A support engineer (SE) spends hours
of manual labor looking through
hundreds of metrics
After 5-6 hours of investigation, the
reason for job slow down is found.
A job goes out of SLA and
Support is alerted
A job goes out of SLA and
the SE is alerted The Job ID is fed through Griffon and
the top reasons for job slowdown are
generated automatically
The reason is found in
the top five generated
by Griffon.
All the metrics Griffon
has looked at can be
ruled out and the SE can
direct their efforts to a
smaller set of metrics.
ACM SoCC 2019
10.
11. EGML applications in Microsoft
Model Training
TensorFlow
Spark
PyTorch
…
1
2
Model Generation
Conversion to
ONNX
Mlflow
Model.v1
3
Serving
SQL Server
Model.vn
H2O
Keras
…
Scikit-learn
Run
--Tracks the runs (parameters, code versions, metrics, output files)
-- Visualizes the output
4
ML Flow Model
(ONNX flavor)
SQL Server as
artifact/backend store
12. ONNX: Interoperability across ML frameworks
Open format to represent ML models
Backed by Microsoft, Amazon, Facebook, and several hardware vendors
13. ONNX exchange format
• Open format
• Enables interoperability across frameworks
• Many supported frameworks to import/export
– Caffe2, PyTorch, CNTK, MXNet, TensorFlow, CoreML
14. ONNX Runtime
• Cross-platform, high-performance scoring engine for ONNX models
• Open-sourced at the end of 2018
• ONNX Runtime is used in millions of Windows devices and powers core
models across Office, Bing, and Azure
Train a model using a
popular framework
such as TensorFlow
Convert the model to
ONNX format
Perform inference
efficiently across
multiple platforms
and hardware using
ONNX runtime
15. ONNX Runtime and optimizations
Key design points:
Graph IR
Support for multiple backends (e.g., CPU, GPU, FPGA)
Graph optimizations
Rule-based optimizer inspired by DB optimizers
Improved inference time and memory consumption
Examples: 117msec à 34msec; 250MB à 200MB
17. ONNX Runtime in production
Office – Grammar Checking Model
14.6x reduction in latency
18. MLflow + ONNX
• MLflow (1.0.0) has now built-in support for ONNX models
• ONNX model flavor for saving, loading and evaluating ONNX
models
Train a sklearn
model
19. Serving the ONNX model
mlflow models serve -m /artifacts/model -p 1234
[6.379428821398614]
Deploy the server
Perform Inference
ONNX Runtime is
automatically invoked
curl -XPOST-H"Content-Type:application/json; format=pandas-split"--data'{"columns":["alcohol", "chlorides", "citricacid", "density", "fixedacidity", "free
sulfurdioxide","pH","residualsugar","sulphates","totalsulfurdioxide","volatileacidity"],"data":[[12.8,0.029,0.48, 0.98, 6.2, 29, 3.33, 1.2, 0.39, 75, 0.66]]}'
http://127.0.0.1:1234/invocations
20. MLflow + SQL Server
• MLflow can use SQL Server as an artifact store (and other
RDBMSs as well) (PR)
• The models are stored in binary format in the database along with
other metadata such as model name, size, run_id, etc.
client = MlflowClient()
exp_name = “test"
client.create_experiment(exp_name, artifact_location="mssql+pyodbc://sa:password@ipAddress:port/dbName?driver=ODBC+Driver+17+for+SQL+Server")
mlflow.set_experiment(exp_name)
mlflow.onnx.log_model(onnx, “model")
21. Provenance in EGML applications
• Need for end-to-end provenance tracking
• Multiple systems involved in each pipeline
SQL
Data pre-processing
Python
Script
Model Training
• Compliance
• Keeping ML models up-to-date
22. Tracking provenance in Python scripts
Python
Script
Python
AST
generation
Dependencies
between variables
and functions
Semantic annotation
through a knowledge
base of common ML
libraries
• Automatically identify models, metrics, hyperparameters in python scripts
• Answer questions such as: “Which columns in a dataset were used for model
training?”
Dataset #Scripts %ML models
covered
%Training
Datasets Covered
Kaggle 49 95% 61%
Microsoft 37 100% 100%
23. Future work
• MLflow:
– Integration with metadata management systems such as Apache Atlas
• Flock:
– Data Governance
– Generalize and extend coverage of auto-tracking and ML à NN
conversion.
– Provenance of end-to-end pipelines
• Combine with other systems (e.g., SQL, Spark)