15 Lessons I Learned Before Scaling ML Solutions

Some Things I Wish I Had Known
Before Scaling Machine Learning
Solutions
Invector Labs

Today’s
session is
about
differentiating
BS from
reality…

Agenda
• Myths and realities of machine learning solutions in the real world
• 15 Lessons I learned when building large scale machine learning
systems
• Challenge
• What we learned?
• Solution

The different
dimensions of
machine
intelligence
solutions…

We can discuss the theoretical definitions or,
instead, focus on the pragmatic one…

But the reality
remains that
building machine
learning
solutions
remains brutally
difficult

But not just because of the obvious reasons…

Challenges of Machine Learning in the Real
World
High
Technological
Barrier
Limited
Talent
Availability
Labeled
Datasets
Cost
…

A lifecycle
we haven’t
seen
before…

We are dealing with a new app lifecycle…
Traditional App Lifecycle Machine Learning App
Lifecycle
Experimentation
Model Creation
Training
Testing
Regularization
Deployment
Monitoring
Optimization
Design Implementation Deployment
Management/
Monitoring

The
Ecosystem
is Incredibly
Crowded

The Aspects of a Machine Learning Solution
that will Drive You Crazy
Strategy &
Processes
Data Engineering
Experimentation Model Training
Model
Operationalization
Runtime
Execution
Security
Lifecycle
Management
Optimization …

Lessons
learned when
building high
scale machine
learning
solutions…

Lesson #1:
Data
scientists
make horrible
engineers…

Challenges Data scientists are great at experimentation
Not so much at writing high quality code
Experimentation deep learning frameworks
don’t necessarily make great production
frameworks, ex: PyTorch vs. TensorFlow

Some Ideas to Consider
•Write notebooks and
experimentation
models
Data Science
Team
•Refactor or rewrite
models for production
environments
•Automate training
and optimization jobs
Engineering
Team •Deploy models
•Monitor, retrain, and
optimize models
DevOps Teams
• Divide data science and
data engineering teams

Lesson #2
Neither Agile nor
Waterfall
Methodologies
Work in Machine
Learning

Challenges Waterfall methods don’t work
because you rarely know what
machine learning methods are
going to work for a specific problem
Agile methods don’t work because
you need very specific
requirements

Agile Waterfall Agile
• Split the
development
lifecycle into agile
and waterfall
iterations

Lesson # 3 :
Feature
extraction can
become a
reusability
nightmare…

Challenges Different models require the same
features from a dataset
Feature extraction jobs are
computationally expensive
Different teams create proprietary
ways to capture and store feature
information

Dataset Preparation
Job1
Dataset Preparation
Job2
Dataset Preparation
JobN
Representation
Learning Task1
Representation
Learning Task1
Representation
Learning Task1
Feature
Store
Model 1
Model N
 Implement a centralized
feature store
 Leverage
representation learning
to extract relevant
features from a dataset
 Look for reference
architectures: ex:
Uber’s Michelangelo

Lesson #4 :
Data labeling is
so easy to
underestimate

Challenges Data experts spend a lot of time
labeling datasets
The logic for data labeling is often not
reusable
Subjective data labeling strategy fail to
differentiate between useful and
useless features

 Implement an
automated data
labeling strategy
 Generative learning can
help to structure more
effective labels
 Project Snorkel is one of
the leading automated
data labeling
frameworks in the
market

Lesson #5: The
single machine
learning
framework
fallacy

Challenges Enterprises like to standardize on a
single machine learning framework
Different teams have different
technology preferences
Providing a consistent machine learning
platform across different machine
learning frameworks is no easy task

Experimentation
Framework
Intermediate
Representation
Production
Framework
 Optimize for productivity, not
consistency
 Enable enough flexibility to
leverage different frameworks for
experimentation and production
 ONNX is a great solution for
intermediate representations

Lesson #6: Too
much time
going from
notebooks to
production
programs

Challenges Notebooks are ideal for model
experimentation and testing
Notebooks typically have performance
challenges when executed at scale
Scaling Notebook environments can be
challenging
Parametrizing Notebook executions is
far from trivial

Some Ideas To Consider
• Jupyter,
Zeppelin
Model
Experimentation
• Papermill
• Netflix’s
Meson
Scheduling
Notebooks • Docker
Containers
• Kubernetes
Running
Complex
Workflows
 Enable an infrastructure to
operationalize data science
notebooks
 Use containers for the most
complex machine learning
workflows

Lesson #7:
Model
selection can
be a machine
learning
problem

Challenges Data scientists make very subjective
decisions when comes to model
selection
The same problem can be solved using
different machine learning models
Very often is almost impossible to
differentiate between similar models

Some Ideas To Consider
 Represent machine learning
requirements as a dataset
with an objective attribute
 Leverage AutoML-based
techniques for model
selection
Problem
Dataset
AutoML
Proposed
Models

Lesson #8:
Training is
a
continuous
task…

Challenges The No Free Lunch Theorem
Trained models can perform poorly
against new datasets
New engineers and DevOps need to
understand how to re-train existing
models

DataLake
Data Outcomes/Feature
Store
Training Job1
Training Job2
Training JobN
 Automate Training Jobs
 Orchestrate scheduled
execution of training jobs

Lesson #9:
Training
should be
incremental…

Challenges Training machine learning models can
be computationally expensive
Most machine learning models need to
be retrained entirely based on the
arrival of new data
Its nearly impossible to quantify the
impact that new datasets have in the
performance of a model

 Implement continual
learning models
 Consider transfer learning
as a fundamental enabler

Lesson #10:
Training a
model requires
as much
coding as
creating it…

Challenges Data engineers spend a lot of time
writing training routines for machine
learning models
Comparing the performance of different
models on the same datasets remains
tricky
Changes on a training dataset often
imply changes on the training code

 Explore a configuration-
driven training process
 Uber’s Ludwig is an
innovative, no-code
framework for training
machine learning models

Executing Machine Learning Models…

Lesson #11:
Different models
require different
execution
patterns…

Challenges Not all models can be executed via APIs
Some models take a long time to run
In some scenarios, different models
need to be executed at the same time
based on a specific condition

Scheduled
Activation
Model Model
Pub-Sub
Activation
Model Model
On-Demand
Activation
Model Model
Model API
Gateway
Event
Gateway Enable different
execution modes based
on client’s requirements

Lesson #12:
Mobile deep
learning is
more
complicated
than you think

Challenges Centralized cloud deep learning models don’t
scale
On-device deep learning models are hard to
distribute and train
Tons of privacy challenges

 Consider using
federated learning
or similar patterns
for mobile based
machine learning

Machine Learning Operationalization…

Lesson
#13:
Debugging
is a
nightmare

Challenges The accuracy-interpretability friction
The unpredictability factor
Limited toolset

•Use tools like
TensorBoard to
visualize the structure
of neural networks
Visualize the Network
and its Results
•High training error is a
sign of underfitting
•High test error and
low training error is a
sign of overfitting
Compare Training and
Test Errors •Helps to determine
whether the error is in
the code or in the data
Test with Small
Datasets
•Monitor the number
of activations in
hidden units
Monitor Activations
and Gradient Values
Understanding How
Nodes are Activated
Understanding what
Hidden Layers Do
Understanding How
Concepts are Formed
Interpretability
 Establish systematic
practices to debug
machine learning
models
 Onboard modeling
visualization and
interpretability tools

Lesson #14:
Machine
learning
models are so
easy to hack

Challenges Most neural networks are vulnerable to
adversarial attacks
Attackers don’t need access to the models but
can simply manipulate input datasets
Most of the times adversarial attacks go
undetected

 Test your neural
networks for
adversarial robustness
 IBM’s adversarial
robustness toolbox is
one of the leading
stacks in neural
network security

Lesson # 15:
Data privacy
is the
elephant in
the machine
learning room

Challenges Machine learning models intrinsically build
knowledge about private datasets
Most machine learning techniques require
clear access to data which, in many cases,
contains sensitive information
There are no established techniques to
evaluating the privacy robustness of machine
learning models

 Private machine learning is
an emerging area of
research
 Leverage techniques such
as secured multi-party
computations or zero-
knowledge-proofs to
obfuscate training datasets
 PySyft is an emerging
framework to enable
privacy in machine learning
models

Some not-well-known, reference
architectures that might help…

DAWN Project from Stanford University Michelangelo from Uber
MLFlow from DataBricks
FBLearner from Facebook
TFX from Google

The challenges go beyond the obvious…

Three Foundational Challenges for the
Mainstream Adoption of Machine Learning
Lowering the Technological Entry Point
• Can mainstream developers embrace machine learning stacks?
Talent Availability
• Can companies and governments nurture local data science
talent?
Data Democratization
• Can rich datasets stop being a privilege of large corporations
and governments ?

Some Initiatives to Consider
Lowering the Technological Entry Point
• AutoML, low-code machine learning frameworks
Talent Availability
• Google AI Academy, Coursera, Udacity…
Data Democratization
• Decentralized AI platforms

Summary
• Implementing machine learning solutions in the real world remains
incredibly challenging
• There is a large gap between the advancements in AI research and the
practical viability of those techniques
• Machine learning applications require a new lifecycle different from
traditional software models
• Each aspect of that lifecycle brings a unique set of challenges
• Start small, iterate…

Thanks
jr@invectoriq.com
jr@intotheblock.io
https://medium.com/@jrodthoughts
https://twitter.com/jrdothoughts

15 Lessons I Learned Before Scaling ML Solutions

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Ähnlich wie 15 Lessons I Learned Before Scaling ML Solutions

Ähnlich wie 15 Lessons I Learned Before Scaling ML Solutions (20)

Mehr von Jesus Rodriguez

Mehr von Jesus Rodriguez (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

15 Lessons I Learned Before Scaling ML Solutions