This document outlines 15 lessons learned from building large-scale machine learning systems in the real world. Some key challenges discussed include data scientists not being well-suited for engineering work, traditional development methodologies not working for machine learning, the difficulty of data labeling and feature extraction, and the complexities of training, executing, operationalizing, and securing machine learning models at scale. The document provides ideas to address these challenges such as establishing separate data science and engineering teams, implementing automated data labeling strategies, leveraging centralized feature stores, and adopting techniques like transfer learning and continual learning.
3. Agenda
• Myths and realities of machine learning solutions in the real world
• 15 Lessons I learned when building large scale machine learning
systems
• Challenge
• What we learned?
• Solution
12. We are dealing with a new app lifecycle…
Traditional App Lifecycle Machine Learning App
Lifecycle
Experimentation
Model Creation
Training
Testing
Regularization
Deployment
Monitoring
Optimization
Design Implementation Deployment
Management/
Monitoring
14. The Aspects of a Machine Learning Solution
that will Drive You Crazy
Strategy &
Processes
Data Engineering
Experimentation Model Training
Model
Operationalization
Runtime
Execution
Security
Lifecycle
Management
Optimization …
18. Challenges Data scientists are great at experimentation
Not so much at writing high quality code
Experimentation deep learning frameworks
don’t necessarily make great production
frameworks, ex: PyTorch vs. TensorFlow
19. Some Ideas to Consider
•Write notebooks and
experimentation
models
Data Science
Team
•Refactor or rewrite
models for production
environments
•Automate training
and optimization jobs
Engineering
Team •Deploy models
•Monitor, retrain, and
optimize models
DevOps Teams
• Divide data science and
data engineering teams
21. Challenges Waterfall methods don’t work
because you rarely know what
machine learning methods are
going to work for a specific problem
Agile methods don’t work because
you need very specific
requirements
22. Some Ideas to Consider
Agile Waterfall Agile
• Split the
development
lifecycle into agile
and waterfall
iterations
24. Lesson # 3 :
Feature
extraction can
become a
reusability
nightmare…
25. Challenges Different models require the same
features from a dataset
Feature extraction jobs are
computationally expensive
Different teams create proprietary
ways to capture and store feature
information
26. Some Ideas to Consider
Dataset Preparation
Job1
Dataset Preparation
Job2
Dataset Preparation
JobN
Representation
Learning Task1
Representation
Learning Task1
Representation
Learning Task1
Feature
Store
Model 1
Model N
Implement a centralized
feature store
Leverage
representation learning
to extract relevant
features from a dataset
Look for reference
architectures: ex:
Uber’s Michelangelo
28. Challenges Data experts spend a lot of time
labeling datasets
The logic for data labeling is often not
reusable
Subjective data labeling strategy fail to
differentiate between useful and
useless features
29. Some Ideas to Consider
Implement an
automated data
labeling strategy
Generative learning can
help to structure more
effective labels
Project Snorkel is one of
the leading automated
data labeling
frameworks in the
market
32. Challenges Enterprises like to standardize on a
single machine learning framework
Different teams have different
technology preferences
Providing a consistent machine learning
platform across different machine
learning frameworks is no easy task
33. Some Ideas to Consider
Experimentation
Framework
Intermediate
Representation
Production
Framework
Optimize for productivity, not
consistency
Enable enough flexibility to
leverage different frameworks for
experimentation and production
ONNX is a great solution for
intermediate representations
35. Challenges Notebooks are ideal for model
experimentation and testing
Notebooks typically have performance
challenges when executed at scale
Scaling Notebook environments can be
challenging
Parametrizing Notebook executions is
far from trivial
36. Some Ideas To Consider
• Jupyter,
Zeppelin
Model
Experimentation
• Papermill
• Netflix’s
Meson
Scheduling
Notebooks • Docker
Containers
• Kubernetes
Running
Complex
Workflows
Enable an infrastructure to
operationalize data science
notebooks
Use containers for the most
complex machine learning
workflows
38. Challenges Data scientists make very subjective
decisions when comes to model
selection
The same problem can be solved using
different machine learning models
Very often is almost impossible to
differentiate between similar models
39. Some Ideas To Consider
Represent machine learning
requirements as a dataset
with an objective attribute
Leverage AutoML-based
techniques for model
selection
Problem
Dataset
AutoML
Proposed
Models
42. Challenges The No Free Lunch Theorem
Trained models can perform poorly
against new datasets
New engineers and DevOps need to
understand how to re-train existing
models
43. Some Ideas to Consider
DataLake
Data Outcomes/Feature
Store
Training Job1
Training Job2
Training JobN
Automate Training Jobs
Orchestrate scheduled
execution of training jobs
45. Challenges Training machine learning models can
be computationally expensive
Most machine learning models need to
be retrained entirely based on the
arrival of new data
Its nearly impossible to quantify the
impact that new datasets have in the
performance of a model
46. Some Ideas to Consider
Implement continual
learning models
Consider transfer learning
as a fundamental enabler
48. Challenges Data engineers spend a lot of time
writing training routines for machine
learning models
Comparing the performance of different
models on the same datasets remains
tricky
Changes on a training dataset often
imply changes on the training code
49. Some Ideas to Consider
Explore a configuration-
driven training process
Uber’s Ludwig is an
innovative, no-code
framework for training
machine learning models
52. Challenges Not all models can be executed via APIs
Some models take a long time to run
In some scenarios, different models
need to be executed at the same time
based on a specific condition
53. Some Ideas to Consider
Scheduled
Activation
Model Model
Pub-Sub
Activation
Model Model
On-Demand
Activation
Model Model
Model API
Gateway
Event
Gateway Enable different
execution modes based
on client’s requirements
55. Challenges Centralized cloud deep learning models don’t
scale
On-device deep learning models are hard to
distribute and train
Tons of privacy challenges
56. Some Ideas to Consider
Consider using
federated learning
or similar patterns
for mobile based
machine learning
60. Some Ideas to Consider
•Use tools like
TensorBoard to
visualize the structure
of neural networks
Visualize the Network
and its Results
•High training error is a
sign of underfitting
•High test error and
low training error is a
sign of overfitting
Compare Training and
Test Errors •Helps to determine
whether the error is in
the code or in the data
Test with Small
Datasets
•Monitor the number
of activations in
hidden units
Monitor Activations
and Gradient Values
Understanding How
Nodes are Activated
Understanding what
Hidden Layers Do
Understanding How
Concepts are Formed
Interpretability
Establish systematic
practices to debug
machine learning
models
Onboard modeling
visualization and
interpretability tools
63. Challenges Most neural networks are vulnerable to
adversarial attacks
Attackers don’t need access to the models but
can simply manipulate input datasets
Most of the times adversarial attacks go
undetected
64. Some Ideas to Consider
Test your neural
networks for
adversarial robustness
IBM’s adversarial
robustness toolbox is
one of the leading
stacks in neural
network security
65. Lesson # 15:
Data privacy
is the
elephant in
the machine
learning room
66. Challenges Machine learning models intrinsically build
knowledge about private datasets
Most machine learning techniques require
clear access to data which, in many cases,
contains sensitive information
There are no established techniques to
evaluating the privacy robustness of machine
learning models
67. Some Ideas to Consider
Private machine learning is
an emerging area of
research
Leverage techniques such
as secured multi-party
computations or zero-
knowledge-proofs to
obfuscate training datasets
PySyft is an emerging
framework to enable
privacy in machine learning
models
71. Three Foundational Challenges for the
Mainstream Adoption of Machine Learning
Lowering the Technological Entry Point
• Can mainstream developers embrace machine learning stacks?
Talent Availability
• Can companies and governments nurture local data science
talent?
Data Democratization
• Can rich datasets stop being a privilege of large corporations
and governments ?
72. Some Initiatives to Consider
Lowering the Technological Entry Point
• AutoML, low-code machine learning frameworks
Talent Availability
• Google AI Academy, Coursera, Udacity…
Data Democratization
• Decentralized AI platforms
73. Summary
• Implementing machine learning solutions in the real world remains
incredibly challenging
• There is a large gap between the advancements in AI research and the
practical viability of those techniques
• Machine learning applications require a new lifecycle different from
traditional software models
• Each aspect of that lifecycle brings a unique set of challenges
• Start small, iterate…