SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Production machine learning:
Managing models, workflows &
risk at scale
Alex Housley
Founder & CEO, Seldon
@ahousley @seldon_io
#CogX2021
The unbundling of ML platforms
1. Tech giants build DIY ML platforms
from scratch to gain competitive
advangtage e.g. Michelangelo,
FBLearner, TFX.
2. Specialised tools emerge to solve
MLOps challenges - e.g. version
control, feature stores, CI/CD,
monitoring.
3. Cloud-native driving hybrid/multi-
cloud adoption: more control,
reduced vendor lock-in.
16/06/2021
#COGX2021
2
Hidden Technical Debt in Machine Learning Systems.
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young,
Jean-Francois Crespo, Dan Dennison Google, NIPS 2015 Conference
Analysis Tools
Serving
Infrastructure
Monitoring
Machine
Resource
Management
Process
Management Tools
Analysis Tools
Serving
Infrastructure
Monitoring
Machine
Resource
Management
Process
Management Tools
ML
Code
Data Verification
Data Collection
Feature Extraction
Configuration
AI adoption is accelerating in the enterprise
16/06/2021 3
AI Adoption in the Enterprise 2021 – O’Reilly (oreilly.com)
5,154 global respondents
How Tech Stacks Up in B2B - Andreessen Horowitz (a16z.com)
Survey of technology leaders at Fortune 500, Global 2000, and
SaaS 50 companies.
#COGX2021
Data scientists and DevOps must collaborate to
productionise models
16/06/2021
#COGX2021
4
Siloed teams in offices and working remote. From 1 week to 6 months to deploy a model.
ML Engineer
$141k average base salary
Production ML has a larger surface area than data
prep and training
The nine stages of the machine learning workflow (Amershi, IEEE 2019)
5
More Metadata
#COGX2021
Data Scientists
Data Engineers ML Engineers
“Day 0” “Day 1” “Day 2”
DevOps
Product / Mgmt
Scaling MLOps across the organisation
16/06/2021 6
Team
– 10 users
– < 50 models
– One training system
– Minimal or team-
level restrictions
Business Unit
– 50 users
– < 200 models
– 3-4 training systems,
multiple frameworks
– Large DevOps team
– Dept level
constraints
– Role based access
Organisation
– >100 users
– Hundreds or
thousands of models
– Multiple platforms
and clouds
– Org blueprints
– Compliance
– Higher level
principles AI ethics
#COGX2021
Production ML challenges
Orchestration
Monitoring
Explainability
Governance
…at Scale
16/06/2021 8
#COGX2021
Orchestration at Scale
Capital One created a ‘Model as a Service’ platform powered by Seldon
Case study: Capital One ‘Model as a Service’
Objectives
Improve the speed-to-market for ML
models
Lower the barrier to entry for developers
to get their models into production
Implement operational
efficiencies and economies of
scale
“With our MaaS platform running on
Seldon, we’ve gone from it taking
months to minutes to deploy or
update models.”
Steve Evangelista
Director of Project Management, Capital One
Results
–MVP in less than 90 days
–Deployment process now takes minutes instead of
months
–Versioning, vulnerability scanning, containerizing,
deployment, testing and promoting to production is all
taken care
–Use cases across business including fraud,
marketing, finance and customer service
–Rigorous compliance through model management and
monitoring
–Developers could work in any language/framework
Why not just wrap my models with Flask?
Flask works well in R&D until you need:
– Multiple optimized model servers
– Metrics and tracing
– Lineage and auditability
– Ingress configuration
– Complex inference graphs
(ensembles, AB tests, MABs, etc)
– Scalable solution that is battle-tested
by wide community of open-source
and commercial users
16/06/2021
ORCHESTRATION AT SCALE
11
Model serving: to achieve scale, you need to abstract
complex ML concepts into standardised infra components
ORCHESTRATION AT SCALE
12
Adversarial
Detector
Is the model
being attacked?
Leverage pre-packaged servers for framework-
agnostic model serving
– Leverage out of the box optimized
servers that wrap your model artifacts
– Enable data scientists deploy models
from their preferred framework
– Model servers are optimized for each
framework for optimal performance
– Extend existing pre-packaged servers
with simple SDKs
16/06/2021
ORCHESTRATION AT SCALE
13
Central Repository
(S3, ModelDB,...)
Model
Reusable
Server
Image Registry
The anatomy of e2e enterprise MLOps architectures
16/06/2021
ORCHESTRATION AT SCALE
14
Canary Tests, Shadows & Rolling Updates
16/06/2021
ORCHESTRATION AT SCALE
15
Remove
Revert
Models
Why does this matter?
Robust and safer testing in
production with zero
downtime to minimise risk.
Canary
90%
10%
Promote
100%
Create
100%
Resource
requests/limits
Autoscaling
spec
Tempo: open-source MLOps SDK for data scientists
16/06/2021
ORCHESTRATION AT SCALE
16
https://github.com/SeldonIO/tempo
Powerful Inference
Orchestration Logic
Pluggable Runtimes
Custom Python
Components
● Create custom business logic for
models.
● Use any python expressions/libraries to
orchestrate component requests.
Data Science
Friendly
● Allow any data science library to be used
easily. E.g.,
○ Custom Models
○ Alibi Explainers
○ Alibi-Detect Outlier Models
○ Multi-Armed Bandits
● Local testing before hand-off to
production
● Python first with output to YAML
● Extendable runtimes.
○ Seldon Deploy
○ Seldon Core
○ Docker with Seldon Containers
○ KFServing
Monitoring at Scale
Case Study: Microsoft & Philips Clinical Drift
Monitoring During Covid-19
• ICUs having to make difficult decisions to
optimize patient health outcomes.
• Built models to predict outcomes such as
patient mortality, length of ventilation,
length of stay.
• Challenges: catching changes to model
performance; time-intensive and
computationally expensive training pipeline.
• Solution needed to be scalable, repeatable
and secure: Azure Databricks, Azure
DevOps and Alibi Detect.
16/06/2021
MONITORING AT SCALE
18
Making your organisation proactive rather than
reactive
16/06/2021
MONITORING AT SCALE
19
Service Metrics Statistical Performance
Drift and outliers
Explainability
Service Metrics
– Microservice metrics such as requests
per second, latency, CPU usage,
memory usage, etc
– Performance monitoring leveraging
Prometheus and ELK
– Seldon Deploy Configures Metrics with
Prometheus
16/06/2021
MONITORING AT SCALE
20
Model A
API
(REST,
gRPC,
Kafka)
Request Logs
Tracing
Production microservice
From model weights
Model Metrics
Why does this matter?
Manage compute costs
and response times
associated with SLAs.
Statistical Monitoring
– Monitor the impact on business KPIs
– Advanced metrics exposed directly by model
servers
– Metrics can be calculated using “feedback”
– Custom metrics can be added by extending
metrics servers
16/06/2021
MONITORING AT SCALE
21
Model A
Metrics Server
Sends
Feedback
Reads inference
data
Statistical Metrics
Stores inference
data
Sends
inference
data
Sends
correct label
Request routing via
cloudevent KNative
infrastructure
Why does this matter?
Understand and monitor
the impact on your
business KPIs.
Outlier Monitoring
– Detecting anomalies in data instances
and flagging/alerting
– Identifying potential metadata that could
help diagnose outliers
– Do outliers indicate there’s an issue with
the model or data?
– Outlier detection runs as a separate
component and can receive input and
prediction data from model
16/06/2021
MONITORING AT SCALE
22
Model A
Outlier
Detector
Server
Sends model
input data
Stores
inference data
Sends
inference
data
Request routing via
cloudevent KNative
infrastructure
Stores
Outlier Data
Request +
outlier data
available
Why does this matter?
Outliers are more like to
have a negative impact if
acted upon automatically.
Drift
– Over time, live data in production
environments differs from the process
that generated the training data.
– Model performance during
deployment no longer matches that
observed on held out training data.
– Goal is to identify drifts in data
distribution and relationships between
input and output.
16/06/2021
MONITORING AT SCALE
23
Why does this matter?
Model performance has a
direct correlation with
business value or safety in
some use cases.
Challenges of online drift detection
– In production, data points arrive in
sequence – and we need to detect
drift ASAP
– So how do we decide whether
fluctuations are due to drift or natural
fluctuations?
– Statistical hypothesis testing
– Windowing strategies
16/06/2021
MONITORING AT SCALE
24
Why does this matter?
Detecting drift at the right
time enable you to improve
performance and reduce
costs. Request routing via
cloudevent KNative
infrastructure
Model A
Drift
Detector
Server
Sends model
input data Sends
inference
data
Drift Metrics
Explainability at Scale
Case Study: Explainability for Insurance
16/06/2021
EXPLAINABILITY AT SCALE
26
Context
Explainability is a critical requirement for all production models.
Operations staff require models to be interpretable to justify algorithmic decisions.
Before Seldon Deploy
Advanced algorithms can not be deployed to production due to a lack of interpretability.
After Seldon Deploy
Improvements to claims automation and payments processing can be realised as these
models can now be made interpretable.
ML models are a black box
● Lending decision
(yes/no)
● Medical diagnosis
● Credit applicant
data
● Medical image
Model
EXPLAINABILITY AT SCALE
Why explain machine learning models?
– Build trust in machine learning outputs
– Increase transparency
– Improve the customer experience
– Check for bias
– Gain insights for data scientists to
understand how models are working
– Avoid damage to business reputation
– Meet regulatory requirements
16/06/2021
EXPLAINABILITY AT SCALE
28
Why does this matter?
Lack of explainability is one
of the biggest blockers to
production ML and causes
of risk in organisations
Explaining model predictions
Types of explanations
– By scope (local vs global)
– By model type (black-box vs white-box)
– By task (classification, regression, structured prediction)
– By data type (tabular, images, text…)
– By insight (feature attributions, counterfactuals, influential training instances…)
16/06/2021 29
Image credit: Scott Lundberg (https://github.com/slundberg/shap)
Image credit: Barshan et al., RelatIF: Identifying Explanatory
Training Examples via Relative Influence (2020)
EXPLAINABILITY AT SCALE
How can we explain the black-box?
Anchors
Feature Attribution: what input subsets are necessary for a prediction to hold? [1]
[1] Ribeiro et al., Anchors: High-Precision Model-Agnostic Explanations (2018)
16/06/2021 30
Image source: Alibi Explain repository home page
EXPLAINABILITY AT SCALE
How can we explain the black-box?
Counterfactuals
How can you (minimally) change input to obtain a desired prediction? [2, 3]
[2] Wachter et al., Counterfactual Explanations without Opening the Black Box (2017)
[3] Van Looveren A., Klaise J. Interpretable Counterfactual Explanations Guided by Prototypes (2018)
16/06/2021 31
a) Images of digits minimally altered to
change a classifier’s prediction
b) A person’s attributes minimally altered
to change a classifier’s prediction (low
income to high income)
EXPLAINABILITY AT SCALE
Which explainer should I use?
16/06/2021
EXPLAINABILITY AT SCALE
32
Familiar API in the style of scikit-learn
16/06/2021 33
from alibi.explainers import AnchorTabular
explainer = AnchorTabular(predict_fn, feature_names)
explainer.fit(X_train)
explanation = explainer.explain(x)
>>> explanation.meta
{'name': 'AnchorTabular', 'type': ['blackbox'], 'explanations': ['local'],
'params': {'seed': None, 'disc_perc': (25, 50, 75), 'threshold': 0.95}}
>>> explanation.data
{'anchor': ['petal width (cm) > 1.80', 'sepal width (cm) <= 2.80'],
'precision': 0.98, 'coverage': 0.32}
EXPLAINABILITY AT SCALE
Deploying Alibi Explanations
Integration with Seldon Core, Seldon Deploy and KFServing
16/06/2021 34
EXPLAINABILITY AT SCALE
Explainability Monitoring
– Explanations are useful when paired
with a monitoring system. For
example, explain why a outlier may
have occurred.
– View model explanations on UI
– Trigger explanations for specific
requests on-demand
– Close integration with auditing
16/06/2021 35
EXPLAINABILITY AT SCALE
Governance at Scale
Critical infrastructure increasingly depend on ML
systems
The impact of a bad solution can be worse than no solution at all
16/06/2021
#COGX2021
37
Cybersecurity Attacks
Misuse of personal data
Software Outages
Algorithmic Bias
Range of varying strategies at a national level
16/06/2021 38
GOVERNANCE AT SCALE
Mapping Global AI Ethics
16/06/2021 39
Harvard. 2020. Principled Artificial Intelligence. [ONLINE] Available at:
https://cyber.harvard.edu/publication/2020/principled-ai. [Accessed 21 October 2020].
GOVERNANCE AT SCALE
EU AI Regulation
What does it mean?
– Emphasis on “trustworthy AI”
– Categorising risk. Regulating “high risk” AI
(e.g. autonomous driving) and prohibiting
uses (e.g. mass social scoring).
– Currently focuses more on e2e systems,
which would apply for the platforms applied AI
projects built within organisations.
– Post-market monitoring of AI systems to
evaluate the continued compliance with
regulation.
Timespan: expect 2 years given EU leaders want
it to be fast-tracked.
40
GOVERNANCE AT SCALE
Principles for Trusted AI
The 8 LFAI Principles for Trusted AI (R)REPEATS
16/06/2021 41
Robustness Privacy
Reproducibility Equitability
Accountability
Explainability Transparency Security
Adopted by Open Source Projects
GOVERNANCE AT SCALE
Alignment between capabilities
and governance, compliance & AI ethics
16/06/2021 42
Robustness Privacy
Reproducibility Equitability
Accountability
Explainability Transparency Security
Model
metadata
Request
logging
Language
wrappers
OpenAPI
Schema
APIs
Prepack.
servers
Out-of-
the-box
prom
metrics
Explainer
compo-
nents
Metrics
monitor-
ing
RBAC via
service
account
Historical
feedback
labelling
Namesp
aced
access
Auth via
Ingress
Explainer
compo-
nents
Metrics
monitor-
ing
Historical
feedback
labelling
GitOps
integrati-
on
Request
logging
Model
Metadata
Request
logging
Metrics
monitor-
ing
Model
Metadata
Auth via
Ingress
RBAC via
service
account
Model
Metadata
GOVERNANCE AT SCALE
Programmatic governance with open & closed source
as policy
16/06/2021 43
Open & Closed Source
Tools & Frameworks
3
Regulation, Compliance,
Organisational Policy
GDPR, ISO, etc.
2
Ethics Frameworks,
Principles, Guidelines
LF AI Principles
1
GOVERNANCE AT SCALE
Ensuring principles by design which can map into higher level
organisational principles and policies
Model Metadata Store
16/06/2021 44
GOVERNANCE AT SCALE
GitOps
Deploy Metadata
Store
External customer
metadata store
Discover “find available models”
Enrich “Add metadata to models”
Lineage/Audit “Check model history”
Artifact
Store
Metadata extraction
from artifacts
Model
Explainer
Drift Detector
Outlier
Detector
Automated
Why does this matter?
Ensure proper governance,
auditing and discoverability of
models for better compliance
and risk management
Reproducibility with GitOps
45
GOVERNANCE AT SCALE
Reproducibility with GitOps
16/06/2021 46
You can restore state
to previous versions
GOVERNANCE AT SCALE
Final thoughts
– As practitioners, we have a growing
professional responsibility to our craft
– Democratisation through COSS and
cloud-native tools
– Engage your peers in discussions
about Responsible AI
– Map Trusted AI principles to your
roadmap requirements
16/06/2021 47
Get access to production machine learning at scale
– Connect with us at #CogX2021
– Product demos at our virtual booth
– Free trials for delegates
16/06/2021 48
Thank you!
Questions? Please use Q&A feature.
Contact hello@seldon.io
@seldon_io

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Accelerating Time to Market
Accelerating Time to MarketAccelerating Time to Market
Accelerating Time to Market
 
IBM Relay 2015: Open for Data
IBM Relay 2015: Open for Data IBM Relay 2015: Open for Data
IBM Relay 2015: Open for Data
 
Optimizing Service Desk Interactions with Knowledge Management - BMC Engage 2015
Optimizing Service Desk Interactions with Knowledge Management - BMC Engage 2015Optimizing Service Desk Interactions with Knowledge Management - BMC Engage 2015
Optimizing Service Desk Interactions with Knowledge Management - BMC Engage 2015
 
Top 5 IT Challenges resolved by Cloud Desktops
Top 5 IT Challenges resolved by Cloud DesktopsTop 5 IT Challenges resolved by Cloud Desktops
Top 5 IT Challenges resolved by Cloud Desktops
 
Artificial Intelligence for Network Telkom Group
Artificial Intelligence for Network Telkom GroupArtificial Intelligence for Network Telkom Group
Artificial Intelligence for Network Telkom Group
 
Using Service Discovery and Service Proxy
Using Service Discovery and Service ProxyUsing Service Discovery and Service Proxy
Using Service Discovery and Service Proxy
 
Why attend the application modernization & connectivity track at Micro Focus ...
Why attend the application modernization & connectivity track at Micro Focus ...Why attend the application modernization & connectivity track at Micro Focus ...
Why attend the application modernization & connectivity track at Micro Focus ...
 
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Extending open source and hybrid cloud to drive OT transformation - Future Oi...Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
 
Network and Application Visibility—Why You Need It More Than Ever Before
Network and Application Visibility—Why You Need It More Than Ever BeforeNetwork and Application Visibility—Why You Need It More Than Ever Before
Network and Application Visibility—Why You Need It More Than Ever Before
 
App modernization methods that work
App modernization methods that workApp modernization methods that work
App modernization methods that work
 
Monoliths, Microservices, Events, Functions: What It Takes to Go Through the ...
Monoliths, Microservices, Events, Functions: What It Takes to Go Through the ...Monoliths, Microservices, Events, Functions: What It Takes to Go Through the ...
Monoliths, Microservices, Events, Functions: What It Takes to Go Through the ...
 
I Love APIs 2015: Microservices at Amazon
I Love APIs 2015: Microservices at AmazonI Love APIs 2015: Microservices at Amazon
I Love APIs 2015: Microservices at Amazon
 
How to monitor all aspects of Citrix NetScaler usage and performance within t...
How to monitor all aspects of Citrix NetScaler usage and performance within t...How to monitor all aspects of Citrix NetScaler usage and performance within t...
How to monitor all aspects of Citrix NetScaler usage and performance within t...
 
A Decentralized Reference Architecture for Cloud-native Applications
A Decentralized Reference Architecture for Cloud-native Applications A Decentralized Reference Architecture for Cloud-native Applications
A Decentralized Reference Architecture for Cloud-native Applications
 
Mendix-7-Keynote
Mendix-7-KeynoteMendix-7-Keynote
Mendix-7-Keynote
 
Microservices and Friends
Microservices and FriendsMicroservices and Friends
Microservices and Friends
 
Webinar–AppSec: Hype or Reality
Webinar–AppSec: Hype or RealityWebinar–AppSec: Hype or Reality
Webinar–AppSec: Hype or Reality
 
MicroServices, yet another architectural style?
MicroServices, yet another architectural style?MicroServices, yet another architectural style?
MicroServices, yet another architectural style?
 
IBM MobileFirst Technical Overview
IBM MobileFirst Technical OverviewIBM MobileFirst Technical Overview
IBM MobileFirst Technical Overview
 
Best Practices for Troubleshooting Four Real-world Java Performance Issues
Best Practices for Troubleshooting Four Real-world Java Performance IssuesBest Practices for Troubleshooting Four Real-world Java Performance Issues
Best Practices for Troubleshooting Four Real-world Java Performance Issues
 

Ähnlich wie Production machine learning: Managing models, workflows and risk at scale

Ähnlich wie Production machine learning: Managing models, workflows and risk at scale (20)

[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
 
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
 
IBM Think Milano
IBM Think MilanoIBM Think Milano
IBM Think Milano
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine Learning
 
MLOPS By Amazon offered and free download
MLOPS By Amazon offered and free downloadMLOPS By Amazon offered and free download
MLOPS By Amazon offered and free download
 
Notes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleNotes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at Scale
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
 
How Cisco is Leveraging MuleSoft to Drive Continuous Innovation​ at Enterpris...
How Cisco is Leveraging MuleSoft to Drive Continuous Innovation​ at Enterpris...How Cisco is Leveraging MuleSoft to Drive Continuous Innovation​ at Enterpris...
How Cisco is Leveraging MuleSoft to Drive Continuous Innovation​ at Enterpris...
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 
The A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOpsThe A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOps
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice
GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in PracticeGDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice
GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice
 
Ahmed El Mawaziny CV
Ahmed El Mawaziny CVAhmed El Mawaziny CV
Ahmed El Mawaziny CV
 
Cloud-native Application Lifecycle Management
Cloud-native Application Lifecycle ManagementCloud-native Application Lifecycle Management
Cloud-native Application Lifecycle Management
 
IBM Collaborative Lifecycle Management Solution for DevOps v6
IBM Collaborative Lifecycle Management Solution for DevOps v6IBM Collaborative Lifecycle Management Solution for DevOps v6
IBM Collaborative Lifecycle Management Solution for DevOps v6
 
Practical machine learning
Practical machine learningPractical machine learning
Practical machine learning
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
The REMICS model-driven process for migrating legacy applications to the cloud
The REMICS model-driven process for migrating legacy applications to the cloudThe REMICS model-driven process for migrating legacy applications to the cloud
The REMICS model-driven process for migrating legacy applications to the cloud
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Production machine learning: Managing models, workflows and risk at scale

  • 1. Production machine learning: Managing models, workflows & risk at scale Alex Housley Founder & CEO, Seldon @ahousley @seldon_io #CogX2021
  • 2. The unbundling of ML platforms 1. Tech giants build DIY ML platforms from scratch to gain competitive advangtage e.g. Michelangelo, FBLearner, TFX. 2. Specialised tools emerge to solve MLOps challenges - e.g. version control, feature stores, CI/CD, monitoring. 3. Cloud-native driving hybrid/multi- cloud adoption: more control, reduced vendor lock-in. 16/06/2021 #COGX2021 2 Hidden Technical Debt in Machine Learning Systems. D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Google, NIPS 2015 Conference Analysis Tools Serving Infrastructure Monitoring Machine Resource Management Process Management Tools Analysis Tools Serving Infrastructure Monitoring Machine Resource Management Process Management Tools ML Code Data Verification Data Collection Feature Extraction Configuration
  • 3. AI adoption is accelerating in the enterprise 16/06/2021 3 AI Adoption in the Enterprise 2021 – O’Reilly (oreilly.com) 5,154 global respondents How Tech Stacks Up in B2B - Andreessen Horowitz (a16z.com) Survey of technology leaders at Fortune 500, Global 2000, and SaaS 50 companies. #COGX2021
  • 4. Data scientists and DevOps must collaborate to productionise models 16/06/2021 #COGX2021 4 Siloed teams in offices and working remote. From 1 week to 6 months to deploy a model. ML Engineer $141k average base salary
  • 5. Production ML has a larger surface area than data prep and training The nine stages of the machine learning workflow (Amershi, IEEE 2019) 5 More Metadata #COGX2021 Data Scientists Data Engineers ML Engineers “Day 0” “Day 1” “Day 2” DevOps Product / Mgmt
  • 6. Scaling MLOps across the organisation 16/06/2021 6 Team – 10 users – < 50 models – One training system – Minimal or team- level restrictions Business Unit – 50 users – < 200 models – 3-4 training systems, multiple frameworks – Large DevOps team – Dept level constraints – Role based access Organisation – >100 users – Hundreds or thousands of models – Multiple platforms and clouds – Org blueprints – Compliance – Higher level principles AI ethics #COGX2021
  • 9. Capital One created a ‘Model as a Service’ platform powered by Seldon Case study: Capital One ‘Model as a Service’ Objectives Improve the speed-to-market for ML models Lower the barrier to entry for developers to get their models into production Implement operational efficiencies and economies of scale “With our MaaS platform running on Seldon, we’ve gone from it taking months to minutes to deploy or update models.” Steve Evangelista Director of Project Management, Capital One Results –MVP in less than 90 days –Deployment process now takes minutes instead of months –Versioning, vulnerability scanning, containerizing, deployment, testing and promoting to production is all taken care –Use cases across business including fraud, marketing, finance and customer service –Rigorous compliance through model management and monitoring –Developers could work in any language/framework
  • 10. Why not just wrap my models with Flask? Flask works well in R&D until you need: – Multiple optimized model servers – Metrics and tracing – Lineage and auditability – Ingress configuration – Complex inference graphs (ensembles, AB tests, MABs, etc) – Scalable solution that is battle-tested by wide community of open-source and commercial users 16/06/2021 ORCHESTRATION AT SCALE 11
  • 11. Model serving: to achieve scale, you need to abstract complex ML concepts into standardised infra components ORCHESTRATION AT SCALE 12 Adversarial Detector Is the model being attacked?
  • 12. Leverage pre-packaged servers for framework- agnostic model serving – Leverage out of the box optimized servers that wrap your model artifacts – Enable data scientists deploy models from their preferred framework – Model servers are optimized for each framework for optimal performance – Extend existing pre-packaged servers with simple SDKs 16/06/2021 ORCHESTRATION AT SCALE 13 Central Repository (S3, ModelDB,...) Model Reusable Server Image Registry
  • 13. The anatomy of e2e enterprise MLOps architectures 16/06/2021 ORCHESTRATION AT SCALE 14
  • 14. Canary Tests, Shadows & Rolling Updates 16/06/2021 ORCHESTRATION AT SCALE 15 Remove Revert Models Why does this matter? Robust and safer testing in production with zero downtime to minimise risk. Canary 90% 10% Promote 100% Create 100% Resource requests/limits Autoscaling spec
  • 15. Tempo: open-source MLOps SDK for data scientists 16/06/2021 ORCHESTRATION AT SCALE 16 https://github.com/SeldonIO/tempo Powerful Inference Orchestration Logic Pluggable Runtimes Custom Python Components ● Create custom business logic for models. ● Use any python expressions/libraries to orchestrate component requests. Data Science Friendly ● Allow any data science library to be used easily. E.g., ○ Custom Models ○ Alibi Explainers ○ Alibi-Detect Outlier Models ○ Multi-Armed Bandits ● Local testing before hand-off to production ● Python first with output to YAML ● Extendable runtimes. ○ Seldon Deploy ○ Seldon Core ○ Docker with Seldon Containers ○ KFServing
  • 17. Case Study: Microsoft & Philips Clinical Drift Monitoring During Covid-19 • ICUs having to make difficult decisions to optimize patient health outcomes. • Built models to predict outcomes such as patient mortality, length of ventilation, length of stay. • Challenges: catching changes to model performance; time-intensive and computationally expensive training pipeline. • Solution needed to be scalable, repeatable and secure: Azure Databricks, Azure DevOps and Alibi Detect. 16/06/2021 MONITORING AT SCALE 18
  • 18. Making your organisation proactive rather than reactive 16/06/2021 MONITORING AT SCALE 19 Service Metrics Statistical Performance Drift and outliers Explainability
  • 19. Service Metrics – Microservice metrics such as requests per second, latency, CPU usage, memory usage, etc – Performance monitoring leveraging Prometheus and ELK – Seldon Deploy Configures Metrics with Prometheus 16/06/2021 MONITORING AT SCALE 20 Model A API (REST, gRPC, Kafka) Request Logs Tracing Production microservice From model weights Model Metrics Why does this matter? Manage compute costs and response times associated with SLAs.
  • 20. Statistical Monitoring – Monitor the impact on business KPIs – Advanced metrics exposed directly by model servers – Metrics can be calculated using “feedback” – Custom metrics can be added by extending metrics servers 16/06/2021 MONITORING AT SCALE 21 Model A Metrics Server Sends Feedback Reads inference data Statistical Metrics Stores inference data Sends inference data Sends correct label Request routing via cloudevent KNative infrastructure Why does this matter? Understand and monitor the impact on your business KPIs.
  • 21. Outlier Monitoring – Detecting anomalies in data instances and flagging/alerting – Identifying potential metadata that could help diagnose outliers – Do outliers indicate there’s an issue with the model or data? – Outlier detection runs as a separate component and can receive input and prediction data from model 16/06/2021 MONITORING AT SCALE 22 Model A Outlier Detector Server Sends model input data Stores inference data Sends inference data Request routing via cloudevent KNative infrastructure Stores Outlier Data Request + outlier data available Why does this matter? Outliers are more like to have a negative impact if acted upon automatically.
  • 22. Drift – Over time, live data in production environments differs from the process that generated the training data. – Model performance during deployment no longer matches that observed on held out training data. – Goal is to identify drifts in data distribution and relationships between input and output. 16/06/2021 MONITORING AT SCALE 23 Why does this matter? Model performance has a direct correlation with business value or safety in some use cases.
  • 23. Challenges of online drift detection – In production, data points arrive in sequence – and we need to detect drift ASAP – So how do we decide whether fluctuations are due to drift or natural fluctuations? – Statistical hypothesis testing – Windowing strategies 16/06/2021 MONITORING AT SCALE 24 Why does this matter? Detecting drift at the right time enable you to improve performance and reduce costs. Request routing via cloudevent KNative infrastructure Model A Drift Detector Server Sends model input data Sends inference data Drift Metrics
  • 25. Case Study: Explainability for Insurance 16/06/2021 EXPLAINABILITY AT SCALE 26 Context Explainability is a critical requirement for all production models. Operations staff require models to be interpretable to justify algorithmic decisions. Before Seldon Deploy Advanced algorithms can not be deployed to production due to a lack of interpretability. After Seldon Deploy Improvements to claims automation and payments processing can be realised as these models can now be made interpretable.
  • 26. ML models are a black box ● Lending decision (yes/no) ● Medical diagnosis ● Credit applicant data ● Medical image Model EXPLAINABILITY AT SCALE
  • 27. Why explain machine learning models? – Build trust in machine learning outputs – Increase transparency – Improve the customer experience – Check for bias – Gain insights for data scientists to understand how models are working – Avoid damage to business reputation – Meet regulatory requirements 16/06/2021 EXPLAINABILITY AT SCALE 28 Why does this matter? Lack of explainability is one of the biggest blockers to production ML and causes of risk in organisations
  • 28. Explaining model predictions Types of explanations – By scope (local vs global) – By model type (black-box vs white-box) – By task (classification, regression, structured prediction) – By data type (tabular, images, text…) – By insight (feature attributions, counterfactuals, influential training instances…) 16/06/2021 29 Image credit: Scott Lundberg (https://github.com/slundberg/shap) Image credit: Barshan et al., RelatIF: Identifying Explanatory Training Examples via Relative Influence (2020) EXPLAINABILITY AT SCALE
  • 29. How can we explain the black-box? Anchors Feature Attribution: what input subsets are necessary for a prediction to hold? [1] [1] Ribeiro et al., Anchors: High-Precision Model-Agnostic Explanations (2018) 16/06/2021 30 Image source: Alibi Explain repository home page EXPLAINABILITY AT SCALE
  • 30. How can we explain the black-box? Counterfactuals How can you (minimally) change input to obtain a desired prediction? [2, 3] [2] Wachter et al., Counterfactual Explanations without Opening the Black Box (2017) [3] Van Looveren A., Klaise J. Interpretable Counterfactual Explanations Guided by Prototypes (2018) 16/06/2021 31 a) Images of digits minimally altered to change a classifier’s prediction b) A person’s attributes minimally altered to change a classifier’s prediction (low income to high income) EXPLAINABILITY AT SCALE
  • 31. Which explainer should I use? 16/06/2021 EXPLAINABILITY AT SCALE 32
  • 32. Familiar API in the style of scikit-learn 16/06/2021 33 from alibi.explainers import AnchorTabular explainer = AnchorTabular(predict_fn, feature_names) explainer.fit(X_train) explanation = explainer.explain(x) >>> explanation.meta {'name': 'AnchorTabular', 'type': ['blackbox'], 'explanations': ['local'], 'params': {'seed': None, 'disc_perc': (25, 50, 75), 'threshold': 0.95}} >>> explanation.data {'anchor': ['petal width (cm) > 1.80', 'sepal width (cm) <= 2.80'], 'precision': 0.98, 'coverage': 0.32} EXPLAINABILITY AT SCALE
  • 33. Deploying Alibi Explanations Integration with Seldon Core, Seldon Deploy and KFServing 16/06/2021 34 EXPLAINABILITY AT SCALE
  • 34. Explainability Monitoring – Explanations are useful when paired with a monitoring system. For example, explain why a outlier may have occurred. – View model explanations on UI – Trigger explanations for specific requests on-demand – Close integration with auditing 16/06/2021 35 EXPLAINABILITY AT SCALE
  • 36. Critical infrastructure increasingly depend on ML systems The impact of a bad solution can be worse than no solution at all 16/06/2021 #COGX2021 37 Cybersecurity Attacks Misuse of personal data Software Outages Algorithmic Bias
  • 37. Range of varying strategies at a national level 16/06/2021 38 GOVERNANCE AT SCALE
  • 38. Mapping Global AI Ethics 16/06/2021 39 Harvard. 2020. Principled Artificial Intelligence. [ONLINE] Available at: https://cyber.harvard.edu/publication/2020/principled-ai. [Accessed 21 October 2020]. GOVERNANCE AT SCALE
  • 39. EU AI Regulation What does it mean? – Emphasis on “trustworthy AI” – Categorising risk. Regulating “high risk” AI (e.g. autonomous driving) and prohibiting uses (e.g. mass social scoring). – Currently focuses more on e2e systems, which would apply for the platforms applied AI projects built within organisations. – Post-market monitoring of AI systems to evaluate the continued compliance with regulation. Timespan: expect 2 years given EU leaders want it to be fast-tracked. 40 GOVERNANCE AT SCALE
  • 40. Principles for Trusted AI The 8 LFAI Principles for Trusted AI (R)REPEATS 16/06/2021 41 Robustness Privacy Reproducibility Equitability Accountability Explainability Transparency Security Adopted by Open Source Projects GOVERNANCE AT SCALE
  • 41. Alignment between capabilities and governance, compliance & AI ethics 16/06/2021 42 Robustness Privacy Reproducibility Equitability Accountability Explainability Transparency Security Model metadata Request logging Language wrappers OpenAPI Schema APIs Prepack. servers Out-of- the-box prom metrics Explainer compo- nents Metrics monitor- ing RBAC via service account Historical feedback labelling Namesp aced access Auth via Ingress Explainer compo- nents Metrics monitor- ing Historical feedback labelling GitOps integrati- on Request logging Model Metadata Request logging Metrics monitor- ing Model Metadata Auth via Ingress RBAC via service account Model Metadata GOVERNANCE AT SCALE
  • 42. Programmatic governance with open & closed source as policy 16/06/2021 43 Open & Closed Source Tools & Frameworks 3 Regulation, Compliance, Organisational Policy GDPR, ISO, etc. 2 Ethics Frameworks, Principles, Guidelines LF AI Principles 1 GOVERNANCE AT SCALE Ensuring principles by design which can map into higher level organisational principles and policies
  • 43. Model Metadata Store 16/06/2021 44 GOVERNANCE AT SCALE GitOps Deploy Metadata Store External customer metadata store Discover “find available models” Enrich “Add metadata to models” Lineage/Audit “Check model history” Artifact Store Metadata extraction from artifacts Model Explainer Drift Detector Outlier Detector Automated Why does this matter? Ensure proper governance, auditing and discoverability of models for better compliance and risk management
  • 45. Reproducibility with GitOps 16/06/2021 46 You can restore state to previous versions GOVERNANCE AT SCALE
  • 46. Final thoughts – As practitioners, we have a growing professional responsibility to our craft – Democratisation through COSS and cloud-native tools – Engage your peers in discussions about Responsible AI – Map Trusted AI principles to your roadmap requirements 16/06/2021 47
  • 47. Get access to production machine learning at scale – Connect with us at #CogX2021 – Product demos at our virtual booth – Free trials for delegates 16/06/2021 48
  • 48. Thank you! Questions? Please use Q&A feature. Contact hello@seldon.io @seldon_io