CD4ML and the challenges of testing and quality in ML systems

1
CD4ML and the challenges
of testing and quality in ML
systems
TensorFlow London Meetup, May 2020
Danilo Sato
@dtsato
©ThoughtWorks 2020 - @dtsato
TensorFlow London Meetup - May 28, 2020

7000+ technologists with 43 oﬃces in 14 countries
We help clients become Modern Digital Businesses
DELIVER VALUE MOVE FASTTHINK BIG

#1
in Agile and
Continuous Delivery
100+
books written

Techniques
Continuous delivery
for machine
learning (CD4ML)
TRIAL
7
https://www.thoughtworks.com/radar

CD4ML isn’t a technology or a
tool; it is a practice and a set of
principles. Quality is built into
software and improvement is
always possible.
But machine learning systems
have unique challenges; unlike
deterministic software, it is
diﬃcult—or impossible—to
understand the behavior of
data-driven intelligent systems.
This poses a huge challenge
when it comes to deploying
machine learning systems in
accordance with CD principles.
6
PRODUCTIONIZING ML IS HARD
Production systems should be:
● Reproducible
● Testable
● Auditable
● Continuously Improving
HOW DO WE APPLY DECADES OF SOFTWARE DELIVERY EXPERIENCE TO
INTELLIGENT SYSTEMS?

CD4ML isn’t a technology or a
tool; it is a practice and a set of
principles. Quality is built into
software and improvement is
always possible.
But machine learning systems
have unique challenges; unlike
deterministic software, it is
diﬃcult—or impossible—to
understand the behavior of
data-driven intelligent systems.
This poses a huge challenge
when it comes to deploying
machine learning systems in
accordance with CD principles.
7
PRODUCTIONIZING ML IS HARD
Production systems should be:
● Reproducible
● Testable
● Auditable
● Continuously Improving
Machine Learning is:
● Non-deterministic
● Hard to test
● Hard to explain
● Hard to improve
HOW DO WE APPLY DECADES OF SOFTWARE DELIVERY EXPERIENCE TO
INTELLIGENT SYSTEMS?

MANY SOURCES OF CHANGE
8
ModelData Code
+ +
Schema
Sampling over Time
Volume
Algorithms
More Training
Experiments
Business Needs
Bug Fixes
Conﬁguration

“Continuous Delivery is the ability to get changes of
all types — including new features, conﬁguration
changes, bug ﬁxes and experiments — into
production, or into the hands of users, safely and
quickly in a sustainable way.”
- Jez Humble & Dave Farley
9

PRINCIPLES OF CONTINUOUS DELIVERY
10
→ Create a Repeatable, Reliable Process for Releasing
Software
→ Automate Almost Everything
→ Build Quality In
→ Work in Small Batches
→ Keep Everything in Source Control
→ Done Means “Released”
→ Improve Continuously

TECHNICAL
COMPONENTS OF
CD4ML
Implementation requires lots of tools,
technologies, and architecture decisions
to fully automate the end-to-end process.
This presentation will focus on the
testing and quality aspects of CD4ML.
11
DOING CD4ML IS STILL A HARD PROBLEM
DISCOVERABLE AND
ACCESSIBLE DATA
REPRODUCIBLE
MODEL TRAINING
EXPERIMENTS
TRACKING
ELASTIC
INFRASTRUCTURE
VERSION CONTROL
& ARTIFACTS REPOS
MODEL SERVING
MODEL
DEPLOYMENT
TESTING & QUALITY
MONITORING &
OBSERVABILITY
CD
ORCHESTRATION
https://martinfowler.com/articles/cd4ml.html

“CLASSIC” SOFTWARE TEST PYRAMID
12
UI
Tests
Service Tests
Unit Tests
https://martinfowler.com/bliki/TestPyramid.html©ThoughtWorks 2020 - @dtsato
Speed
Cost

AS SOFTWARE BECAME MORE COMPLEX
13
https://martinfowler.com/articles/microservice-testing©ThoughtWorks 2020 - @dtsato

TESTING IN PRODUCTION
14
https://sookocheﬀ.com/post/architecture/testing-in-production/©ThoughtWorks 2020 - @dtsato

15
ModelData Code
+ +
??

TESTS FOR DATA
16
Data
Pipeline
Data/Feature Validation
Unit Tests
(Transformations, Engineered Features)
- Adherence to schemas
- Features can be used
- Schema versioning and
compatibility
- Integration tests against
(small) sample input
- Adherence to privacy
controls
- On-demand quality
checks

TESTS FOR MODEL
17
- Compare against a
simple model
- Numerical stability
(behaviour when NaN or
inﬁnite values appear)
Unit Tests
(Model Speciﬁcation)
Model
Quality
ML Training Pipeline
- Training is reproducible
(Watch out for sources of
non-determinism – e.g. RNG
seeds, initialization order)
- Integration test

18
ModelData Code
+ +

19
Model Performance
Contract Tests
Model Bias and Fairness
Data
Pipeline
Unit Tests
Unit Tests
Model
Quality
UI
Tests
Service Tests
Unit Tests
- Model evaluation against
diﬀerent validation
datasets
- Thresholds for model
metrics and execution
performance
- Diﬀerent data slices
- Feature generation is
same for training/serving
- Model contract is
adhered in production
- When model is exported,
test it still works
TESTING WHERE THEY OVERLAP

20
Model Performance
Contract Tests
Model Bias and Fairness
Data
Pipeline
Unit Tests
Unit Tests
Model
Quality
UI
Tests
Service Tests
Unit Tests
End-to-End Tests
Production Monitoring
Exploratory
Tests
- Model degradation
- Training/serving skew
- Operational metrics
(latency, throughput,
resource usage)
- Real impact! (KPIs)

21
“Inspection does not improve the
quality, nor guarantee quality.
Inspection is too late. The quality,
good or bad, is already in the
product.”
- W. Edward Deming

QUESTIONS?
22

WORKSHOPS,
PRESENTATIONS &
ARTICLES
Workshops:
https://github.com/ThoughtWorksInc/cd4ml-workshop
https://github.com/ThoughtWorksInc/CD4ML-Scenarios
Articles:
https://martinfowler.com/articles/cd4ml.html
https://www.thoughtworks.com/insights/articles/intelligent-enterprise-series-cd4ml
Paper:
“The ML Test Score: A Rubric for ML Production Readiness and Technical Debt
Reduction”, Breck et al (Google)
2323

2424
THANK YOU!
Danilo Sato (dsato@thoughtworks.com)
@dtsato

CD4ML and the challenges of testing and quality in ML systems

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie CD4ML and the challenges of testing and quality in ML systems

Ähnlich wie CD4ML and the challenges of testing and quality in ML systems (20)

Mehr von Seldon

Mehr von Seldon (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

CD4ML and the challenges of testing and quality in ML systems