SlideShare a Scribd company logo
1 of 16
Download to read offline
Feature dri monitoring as a service for
machine learning models at scale
PyData Global 2020
Keira Zhou
Noriaki (Nori) Tatsumi
A feature dri is a change in the joint distribution of a feature and a
target
Covariate shi
Feature distribution change without label distribution change
Prior probability shi
Label distribution change without feature distribution change
Concept shi
Feature and label distribution stay the same but the relationship between the two change
https://towardsdatascience.com/understanding-dataset-shi-f2a5a262a766
Why does an enterprise with business critical ML models need easy
access to a comprehensive feature dri monitoring solution?
• Machine learning is learning from data (i.e. features)
• Many models are very brittle
• Prevent financial loss and harm to the brand of your business
• Not every ML team has the resource to build and maintain a complete monitoring solution
Our feature monitoring service provides statistics and model based
metrics and analysis for detecting feature dris
Descriptive statistics
mean, median, min, max, standard deviation, percentiles
Data quality metrics
count, sum, # of NULLs, # of NaNs
Statistics and model based analysis
population stability index (PSI), time-series changepoint and anomaly detection
Interactive User Interfaces
time-series visualization dashboards, API, SQL interface, alerts
Our feature monitoring service empowers our users to continuously
make sure that their models are performing well
Reactive re-training trigger
Trigger for model developers to investigate potential model degradation
Proactive feature selection
A way to check the volatility of features - may lead to omission of use or more frequent monitoring
Model degradation analysis
Could explain why a model has shied
The 5 key design decisions for our scalable feature monitoring service
Features are
cataloged in a
registry and
persisted in
standardized formats
with timestamps
Client users and
applications must
bring their own
access to features to
avoid the platform
from having the keys
to the kingdom
Users need to be able
to specify their
groupby keys to
produce meaningful
metrics and analysis
The service is a
distributed system
with ephemeral
processes and a
resilient and robust
orchestrator
Provide tools to
visualize, slice and
dice the metrics and
analysis
Minimize the blast
radius from a
potential security
event
Aggregation
attributes are
configurable
Empower users to
derive conclusions
and decisions
The features must
be discoverable
and readable
Isolate failures
across the multiple
tenants
Feature Data Pipeline Architecture
Feature Persistence
Channel
Ent. Data Ingestion
Service
Feature Value
(Avro, Parquet, CSV)
Batch
Streaming
API
Feature Compute Feature
Storage
Feature Monitoring
Ent. File Storage
Feature Value
(Parquet)
Ent. Data (Feature)
Registry
Feature
Metadata
Feature
Monitoring as
a Service
HTTP/gRPCHTTP
AWSS3
● An Enterprise Data Registry that catalogs each feature’s
ID, data format, schema, location, partition keys, etc
● A unified Enterprise Data Ingestion Service for all feature
compute outputs in various execution contexts that sinks
all data as Parquet files in AWS S3 storage
Feature Monitoring as a Service Architecture
Trigger and Configuration of Feature Statistic Calculation
• An API as the Entry point of the pipeline
• Uniquely identify a feature by Feature ID
• Receives a Dataset ID and location from the user
• Retrieves Feature IDs from Enterprise Feature Registry
based on Dataset ID
API
Enterprise
Feature Registry
Trigger and Configuration of Statistic Calculation (Cont’d)
Triggers the PySpark EMR cluster with
configuration parameters
• Dataset location
• Enterprise Dataset Unique ID
• Enterprise Feature IDs
• Temporary Client Credentials: to access
the dataset
• Partition Timestamp (ETL time): when
the features were calculated
• Field Timestamp (event time): Indicate
which fields is the event timestamp
• Aggregation Fields: the fields to
aggregate and produce stats on
Biking Length (mile) Biking Elevation (ft) Event Time ETL Time
5 243 202005 202009
10 100 202005 202009
8 185 202006 202009
20 320 202007 202010
15 231 202008 202010
Avg Biking
Length (mile)
Avg Biking
Elevation (ft)
Event
Time
7.5 171.5 202005
8 185 202006
20 320 202007
15 231 202008
Agg by
Event Time
Agg by
ETL Time
Avg Biking
Length (mile)
Avg Biking
Elevation (ft)
ETL Time
7.67 176 202009
17.5 275.5 202010
Distributed Stats Calculation
• Stats calculated
• min, max, average, standard deviation
• median, 25% & 75% quantiles,
• count, # of null, # of nan
• PSI
• Runs on EMR:
• Ephemeral
• Separate cluster per calculation
• All the results are
• Sent to Enterprise Kaa Cluster
• Saved into Enterprise Managed S3
• Saved in to Postgres Database
• All stats are connected to a job ID
• Easier debugging
Postgres Table Design
• Feature Stats table
• Stores all computed stats
• Parent - Child table design based on feature name
• Feature Stats Job Status table
• Tracks the status of a job
• Updated by Trigger API, PySpark job and Ingestion Engine
Parent Table
● feature_1_table_pointer
● feature_2_table_pointer
feature_1_child_Table feature_2_child_Table
Managed Kubernetes Cluster
• Most of our components are running on a managed Kubernetes cluster in AWS
• Individual - Personal Namespace; Team - Team namespace
• Helm charts to config different environment: dev, qa, prod
• Skaffold to build, push and deploy the application
Internal
Dockyard
Dockerized
Java Application
Ku
Kubernetes Cluster
Monitoring Statistics Serving Interface
• Dashboard
• A clear centralized view of various feature statistics
• Connect to Postgres DB
• GraphQL API
• Retrieve stats of a given feature
• Good for customized plotting
• Integration with Jupyter notebook or other applications
Aggregated based
on event time from
two different
partition
The 5 key design decisions for our scalable feature monitoring service
Features are
cataloged in a registry
and persisted in
standardized formats
with timestamps
Client users and
applications must
bring their own access
to features to avoid
the platform from
having the keys to the
kingdom
Users need to be able
to specify their
groupby keys to
produce meaningful
metrics and analysis
The service is a
distributed system
with ephemeral
processes and a
resilient and robust
orchestrator
Provide tools to
visualize, slice and
dice the metrics and
analysis
Minimize the blast
radius from a
potential security
event
Aggregation
attributes are
configurable
Empower users to
derive conclusions
and decisions
The features must
be accessible,
identifiable and
readable
Isolate failures
across the multiple
tenants
i.e. Standard time-series
ingestion pipeline with
Parquet output and
features registered in the
enterprise Feature Registry
i.e. Borrow clients’
temporary AWS STS
tokens and track the
activity in the audit log
i.e. Enable users to
configure the aggregation
key per feature via REST
API
i.e. Usage of ephemeral
EMR instances for Spark
jobs and microservices
orchestrated by K8
i.e. Time-series visualization
with Grafana and data driven
GraphQL API for interacting
with the Monitoring Service
Thank you!

More Related Content

What's hot

MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
cnvrg.io AI OS - Hands-on ML Workshops
 

What's hot (20)

Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at Scale
 
Concept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsConcept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML Applications
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Machine Learning Ml Overview Algorithms Use Cases And Applications
Machine Learning Ml Overview Algorithms Use Cases And ApplicationsMachine Learning Ml Overview Algorithms Use Cases And Applications
Machine Learning Ml Overview Algorithms Use Cases And Applications
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_future
 
MLOps with Azure DevOps
MLOps with Azure DevOpsMLOps with Azure DevOps
MLOps with Azure DevOps
 
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusRobust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
 
Machine Learning Operations & Azure
Machine Learning Operations & AzureMachine Learning Operations & Azure
Machine Learning Operations & Azure
 

Similar to Feature drift monitoring as a service for machine learning models at scale

Application Portfolio Migration v1
Application Portfolio Migration v1Application Portfolio Migration v1
Application Portfolio Migration v1
Arthur Ching
 
Ibm cloud forum managing heterogenousclouds_final
Ibm cloud forum managing heterogenousclouds_finalIbm cloud forum managing heterogenousclouds_final
Ibm cloud forum managing heterogenousclouds_final
Mauricio Godoy
 

Similar to Feature drift monitoring as a service for machine learning models at scale (20)

Combining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified ObservabilityCombining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified Observability
 
Les logs, traces et indicateurs au service d'une observabilité unifiée
Les logs, traces et indicateurs au service d'une observabilité unifiéeLes logs, traces et indicateurs au service d'une observabilité unifiée
Les logs, traces et indicateurs au service d'une observabilité unifiée
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
 
Predix
PredixPredix
Predix
 
DevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft AzureDevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft Azure
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architecture
 
Keynote : évolution et vision d'Elastic Observability
Keynote : évolution et vision d'Elastic ObservabilityKeynote : évolution et vision d'Elastic Observability
Keynote : évolution et vision d'Elastic Observability
 
Application Portfolio Migration v1
Application Portfolio Migration v1Application Portfolio Migration v1
Application Portfolio Migration v1
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDriven
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
 
CSC AWS re:Invent Enterprise DevOps session
CSC AWS re:Invent Enterprise DevOps sessionCSC AWS re:Invent Enterprise DevOps session
CSC AWS re:Invent Enterprise DevOps session
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
Salesforce Multitenant Architecture: How We Do the Magic We Do
Salesforce Multitenant Architecture: How We Do the Magic We DoSalesforce Multitenant Architecture: How We Do the Magic We Do
Salesforce Multitenant Architecture: How We Do the Magic We Do
 
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
 
Global ai conf_final
Global ai conf_finalGlobal ai conf_final
Global ai conf_final
 
Ibm cloud forum managing heterogenousclouds_final
Ibm cloud forum managing heterogenousclouds_finalIbm cloud forum managing heterogenousclouds_final
Ibm cloud forum managing heterogenousclouds_final
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
Migrating from a monolith to microservices – is it worth it?
Migrating from a monolith to microservices – is it worth it?Migrating from a monolith to microservices – is it worth it?
Migrating from a monolith to microservices – is it worth it?
 

More from Noriaki Tatsumi

Blackboard DevCon 2013 - Advanced Caching in Blackboard Learn Using Redis Bui...
Blackboard DevCon 2013 - Advanced Caching in Blackboard Learn Using Redis Bui...Blackboard DevCon 2013 - Advanced Caching in Blackboard Learn Using Redis Bui...
Blackboard DevCon 2013 - Advanced Caching in Blackboard Learn Using Redis Bui...
Noriaki Tatsumi
 
Blackboard DevCon 2013 - Hackathon
Blackboard DevCon 2013 - HackathonBlackboard DevCon 2013 - Hackathon
Blackboard DevCon 2013 - Hackathon
Noriaki Tatsumi
 
Blackboard DevCon 2012 - Ensuring Code Quality
Blackboard DevCon 2012 - Ensuring Code QualityBlackboard DevCon 2012 - Ensuring Code Quality
Blackboard DevCon 2012 - Ensuring Code Quality
Noriaki Tatsumi
 
Blackboard DevCon 2011 - Developing B2 for Performance and Scalability
Blackboard DevCon 2011 - Developing B2 for Performance and ScalabilityBlackboard DevCon 2011 - Developing B2 for Performance and Scalability
Blackboard DevCon 2011 - Developing B2 for Performance and Scalability
Noriaki Tatsumi
 
Blackboard DevCon 2011 - Performance Considerations for Custom Theme Development
Blackboard DevCon 2011 - Performance Considerations for Custom Theme DevelopmentBlackboard DevCon 2011 - Performance Considerations for Custom Theme Development
Blackboard DevCon 2011 - Performance Considerations for Custom Theme Development
Noriaki Tatsumi
 
Blackboard DevCon 2012 - How to Turn on the Lights to Your Blackboard Learn E...
Blackboard DevCon 2012 - How to Turn on the Lights to Your Blackboard Learn E...Blackboard DevCon 2012 - How to Turn on the Lights to Your Blackboard Learn E...
Blackboard DevCon 2012 - How to Turn on the Lights to Your Blackboard Learn E...
Noriaki Tatsumi
 

More from Noriaki Tatsumi (11)

GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
 
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital One
 
Operating a High Velocity Large Organization with Spring Cloud Microservices
Operating a High Velocity Large Organization with Spring Cloud MicroservicesOperating a High Velocity Large Organization with Spring Cloud Microservices
Operating a High Velocity Large Organization with Spring Cloud Microservices
 
Application Performance Management
Application Performance ManagementApplication Performance Management
Application Performance Management
 
Blackboard DevCon 2013 - Advanced Caching in Blackboard Learn Using Redis Bui...
Blackboard DevCon 2013 - Advanced Caching in Blackboard Learn Using Redis Bui...Blackboard DevCon 2013 - Advanced Caching in Blackboard Learn Using Redis Bui...
Blackboard DevCon 2013 - Advanced Caching in Blackboard Learn Using Redis Bui...
 
Blackboard DevCon 2013 - Hackathon
Blackboard DevCon 2013 - HackathonBlackboard DevCon 2013 - Hackathon
Blackboard DevCon 2013 - Hackathon
 
Blackboard DevCon 2012 - Ensuring Code Quality
Blackboard DevCon 2012 - Ensuring Code QualityBlackboard DevCon 2012 - Ensuring Code Quality
Blackboard DevCon 2012 - Ensuring Code Quality
 
Blackboard DevCon 2011 - Developing B2 for Performance and Scalability
Blackboard DevCon 2011 - Developing B2 for Performance and ScalabilityBlackboard DevCon 2011 - Developing B2 for Performance and Scalability
Blackboard DevCon 2011 - Developing B2 for Performance and Scalability
 
Blackboard DevCon 2011 - Performance Considerations for Custom Theme Development
Blackboard DevCon 2011 - Performance Considerations for Custom Theme DevelopmentBlackboard DevCon 2011 - Performance Considerations for Custom Theme Development
Blackboard DevCon 2011 - Performance Considerations for Custom Theme Development
 
Blackboard DevCon 2012 - How to Turn on the Lights to Your Blackboard Learn E...
Blackboard DevCon 2012 - How to Turn on the Lights to Your Blackboard Learn E...Blackboard DevCon 2012 - How to Turn on the Lights to Your Blackboard Learn E...
Blackboard DevCon 2012 - How to Turn on the Lights to Your Blackboard Learn E...
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Feature drift monitoring as a service for machine learning models at scale

  • 1. Feature dri monitoring as a service for machine learning models at scale PyData Global 2020 Keira Zhou Noriaki (Nori) Tatsumi
  • 2. A feature dri is a change in the joint distribution of a feature and a target Covariate shi Feature distribution change without label distribution change Prior probability shi Label distribution change without feature distribution change Concept shi Feature and label distribution stay the same but the relationship between the two change https://towardsdatascience.com/understanding-dataset-shi-f2a5a262a766
  • 3. Why does an enterprise with business critical ML models need easy access to a comprehensive feature dri monitoring solution? • Machine learning is learning from data (i.e. features) • Many models are very brittle • Prevent financial loss and harm to the brand of your business • Not every ML team has the resource to build and maintain a complete monitoring solution
  • 4. Our feature monitoring service provides statistics and model based metrics and analysis for detecting feature dris Descriptive statistics mean, median, min, max, standard deviation, percentiles Data quality metrics count, sum, # of NULLs, # of NaNs Statistics and model based analysis population stability index (PSI), time-series changepoint and anomaly detection Interactive User Interfaces time-series visualization dashboards, API, SQL interface, alerts
  • 5. Our feature monitoring service empowers our users to continuously make sure that their models are performing well Reactive re-training trigger Trigger for model developers to investigate potential model degradation Proactive feature selection A way to check the volatility of features - may lead to omission of use or more frequent monitoring Model degradation analysis Could explain why a model has shied
  • 6. The 5 key design decisions for our scalable feature monitoring service Features are cataloged in a registry and persisted in standardized formats with timestamps Client users and applications must bring their own access to features to avoid the platform from having the keys to the kingdom Users need to be able to specify their groupby keys to produce meaningful metrics and analysis The service is a distributed system with ephemeral processes and a resilient and robust orchestrator Provide tools to visualize, slice and dice the metrics and analysis Minimize the blast radius from a potential security event Aggregation attributes are configurable Empower users to derive conclusions and decisions The features must be discoverable and readable Isolate failures across the multiple tenants
  • 7. Feature Data Pipeline Architecture Feature Persistence Channel Ent. Data Ingestion Service Feature Value (Avro, Parquet, CSV) Batch Streaming API Feature Compute Feature Storage Feature Monitoring Ent. File Storage Feature Value (Parquet) Ent. Data (Feature) Registry Feature Metadata Feature Monitoring as a Service HTTP/gRPCHTTP AWSS3 ● An Enterprise Data Registry that catalogs each feature’s ID, data format, schema, location, partition keys, etc ● A unified Enterprise Data Ingestion Service for all feature compute outputs in various execution contexts that sinks all data as Parquet files in AWS S3 storage
  • 8. Feature Monitoring as a Service Architecture
  • 9. Trigger and Configuration of Feature Statistic Calculation • An API as the Entry point of the pipeline • Uniquely identify a feature by Feature ID • Receives a Dataset ID and location from the user • Retrieves Feature IDs from Enterprise Feature Registry based on Dataset ID API Enterprise Feature Registry
  • 10. Trigger and Configuration of Statistic Calculation (Cont’d) Triggers the PySpark EMR cluster with configuration parameters • Dataset location • Enterprise Dataset Unique ID • Enterprise Feature IDs • Temporary Client Credentials: to access the dataset • Partition Timestamp (ETL time): when the features were calculated • Field Timestamp (event time): Indicate which fields is the event timestamp • Aggregation Fields: the fields to aggregate and produce stats on Biking Length (mile) Biking Elevation (ft) Event Time ETL Time 5 243 202005 202009 10 100 202005 202009 8 185 202006 202009 20 320 202007 202010 15 231 202008 202010 Avg Biking Length (mile) Avg Biking Elevation (ft) Event Time 7.5 171.5 202005 8 185 202006 20 320 202007 15 231 202008 Agg by Event Time Agg by ETL Time Avg Biking Length (mile) Avg Biking Elevation (ft) ETL Time 7.67 176 202009 17.5 275.5 202010
  • 11. Distributed Stats Calculation • Stats calculated • min, max, average, standard deviation • median, 25% & 75% quantiles, • count, # of null, # of nan • PSI • Runs on EMR: • Ephemeral • Separate cluster per calculation • All the results are • Sent to Enterprise Kaa Cluster • Saved into Enterprise Managed S3 • Saved in to Postgres Database • All stats are connected to a job ID • Easier debugging
  • 12. Postgres Table Design • Feature Stats table • Stores all computed stats • Parent - Child table design based on feature name • Feature Stats Job Status table • Tracks the status of a job • Updated by Trigger API, PySpark job and Ingestion Engine Parent Table ● feature_1_table_pointer ● feature_2_table_pointer feature_1_child_Table feature_2_child_Table
  • 13. Managed Kubernetes Cluster • Most of our components are running on a managed Kubernetes cluster in AWS • Individual - Personal Namespace; Team - Team namespace • Helm charts to config different environment: dev, qa, prod • Skaffold to build, push and deploy the application Internal Dockyard Dockerized Java Application Ku Kubernetes Cluster
  • 14. Monitoring Statistics Serving Interface • Dashboard • A clear centralized view of various feature statistics • Connect to Postgres DB • GraphQL API • Retrieve stats of a given feature • Good for customized plotting • Integration with Jupyter notebook or other applications Aggregated based on event time from two different partition
  • 15. The 5 key design decisions for our scalable feature monitoring service Features are cataloged in a registry and persisted in standardized formats with timestamps Client users and applications must bring their own access to features to avoid the platform from having the keys to the kingdom Users need to be able to specify their groupby keys to produce meaningful metrics and analysis The service is a distributed system with ephemeral processes and a resilient and robust orchestrator Provide tools to visualize, slice and dice the metrics and analysis Minimize the blast radius from a potential security event Aggregation attributes are configurable Empower users to derive conclusions and decisions The features must be accessible, identifiable and readable Isolate failures across the multiple tenants i.e. Standard time-series ingestion pipeline with Parquet output and features registered in the enterprise Feature Registry i.e. Borrow clients’ temporary AWS STS tokens and track the activity in the audit log i.e. Enable users to configure the aggregation key per feature via REST API i.e. Usage of ephemeral EMR instances for Spark jobs and microservices orchestrated by K8 i.e. Time-series visualization with Grafana and data driven GraphQL API for interacting with the Monitoring Service