How to Troubleshoot Apps for the Modern Connected Worker
Â
Fabric for Deep Learning
1. Fabric for Deep Learning
FfDL
FfDL Github Page
https://github.com/IBM/FfDL
FfDL dwOpen Page
https://developer.ibm.com/code/open/projects/
fabric-for-deep-learning-ffdl/
FfDL Announcement Blog
http://developer.ibm.com/code/2018/03/20/
fabric-for-deep-learning
FfDL Technical Architecture Blog
http://developer.ibm.com/code/2018/03/20/
democratize-ai-with-fabric-for-deep-learning
Deep Learning as a Service within Watson Studio
https://www.ibm.com/cloud/deep-learning
Research paper: âScalable Multi-Framework
Management of Deep Learning Training Jobsâ
http://learningsys.org/nips17/assets/papers/
paper_29.pdf
FfDL
1
Animesh Singh, Tommy Li
@AnimeshSingh
@Tomipli
https://github.com/IBM/FfDL
2. âŠthat automate
decisions.
âŠto build modelsâŠUse dataâŠ
The Enterprise AI Process
2
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
5. Deep Learning Has Revolutionized
Machine Learning
5
Data
Accuracy
Deep
Learning
Traditional
Machine
Learning
100
80
60
40
20
0
# of Searches for Deep Learning from
2011 to 2017
Source: Google Trends. Search term âDeep Learningâ
2011 2012 2013 2014 2015 2016 2017
29. Neural Network Design WorkïŹow! domain
data
design
neural network
HPO
âąâŻ neural network structure
âąâŻ hyperparameters
NO
Performance
meets needs?
Start another
experiment
optimal
hyperparameters
30. Neural Network Design WorkïŹow! domain
data
HPO
âąâŻ neural network structure
âąâŻ hyperparameters
NO
yes
Performance
meets needs?
Start another
experiment
trained
model
deployCloud
optimal
hyperparameters
evaluate
BAD
Still
good!
design
neural network
31. Introducing
Fabric for Deep Learning
FfDL (pronounced as ïŹddle)
Multi Framework Approach
to Deep Learning, on your
own Cloud
31
32. Fabric for Deep Learning
https://github.com/IBM/FfDL
FfDL provides a scalable, resilient, and fault
tolerant deep-learning framework
FfDL Github Page
https://github.com/IBM/FfDL
FfDL dwOpen Page
https://developer.ibm.com/code/open/projects/
fabric-for-deep-learning-ffdl/
FfDL Announcement Blog
http://developer.ibm.com/code/2018/03/20/
fabric-for-deep-learning
FfDL Technical Architecture Blog
http://developer.ibm.com/code/2018/03/20/
democratize-ai-with-fabric-for-deep-learning
Deep Learning as a Service within Watson Studio
https://www.ibm.com/cloud/deep-learning
Research paper: âScalable Multi-Framework
Management of Deep Learning Training Jobsâ
http://learningsys.org/nips17/assets/papers/
paper_29.pdf
âąâŻ Fabric for Deep Learning or FfDL (pronounced as âïŹddleâ) is an
open source project which aims at making Deep Learning easily
accessible to the people it matters the most i.e. Data Scientists,
and AI developers.
âąâŻ FfDL Provides a consistent way to deploy, train and visualize
Deep Learning jobs across multiple frameworks like
TensorFlow, Caffe, PyTorch, Keras etc.
âąâŻ FfDL is being developed in close collaboration with IBM
Research and IBM Watson. It forms the core of Watson`s Deep
Learning service in open source.
FfDL
32
33. Fabric for Deep Learning
https://github.com/IBM/FfDL
FfDL is built using Microservices architecture
on Kubernetes
âąâŻ FfDL platform uses a microservices architecture to offer
resilience, scalability, multi-tenancy, and security without
modifying the deep learning frameworks, and with no or minimal
changes to model code.
âąâŻ FfDL control plane microservices are deployed as pods on
Kubernetes to manage this cluster of GPU- and CPU-enabled
machines effectively
âąâŻ Tested Platforms: Minikube, IBM Cloud Public, IBM Cloud
Private, GPUs using both Kubernetes feature gate Accelerators
and NVidia device plugins
33
39. And we offer more
Model Asset Exchange
MAX
and
Advarsarial Robustness Toolkit
ART
39
40. IBM Model Asset eXchange
MAX
MAX is a one stop exchange to ïŹnd ML/DL
models created using popular Machine
Learning engines and provides a
standardized approach to consume these
models for training and inferencing.
40
developer.ibm.com/code/exchanges/models/
41. IBM Adversarial Robustness
Toolkit
ART
ART is a library dedicated to adversarial
machine learning. Its purpose is to allow rapid
crafting and analysis of attacks and defense
methods for machine learning models. The
Adversarial Robustness Toolbox provides an
implementation for many state-of-the-art
methods for attacking and defending
classiïŹers.
41
https://developer.ibm.com/code/open/projects/
adversarial-robustness-toolbox/
The Adversarial Robustness Toolbox contains
implementations of the following attacks:
Deep Fool (Moosavi-Dezfooli et al., 2015)
Fast Gradient Method (Goodfellow et al., 2014)
Jacobian Saliency Map (Papernot et al., 2016)
Universal Perturbation (Moosavi-Dezfooli et al., 2016)
Virtual Adversarial Method (Moosavi-Dezfooli et al.,
2015)
C&W Attack (Carlini and Wagner, 2016)
NewtonFool (Jang et al., 2017)
The following defense methods are also supported:
Feature squeezing (Xu et al., 2017)
Spatial smoothing (Xu et al., 2017)
Label smoothing (Warde-Farley and Goodfellow, 2016)
Adversarial training (Szegedy et al., 2013)
Virtual adversarial training (Miyato et al., 2017)
43. Model Lifecycle Management
Machine Learning Runtimes Deep Learning Runtimes
Authoring Tools
Cloud Infrastructure as a Service
âąâŻMost popular open source frameworks
âąâŻIBM best-in-class frameworks
âąâŻCreate, collaborate, deploy, and monitor
âąâŻBest of breed open source & IBM tools
âąâŻCode (R, Python or Scala) and no-code/visual
modeling tools
âąâŻFully managed service
âąâŻContainer-based resource management
âąâŻElastic pay as you go cpu/gpu power
Watson Studio
Tools for supporting the end-to-end AI workflow
44. 3
Train neural
networks in
parallel across
NVIDIA GPUs.
Pay only for what
you use. Auto-
deallocation
means no more
remembering to
shutdown your
cloud training
instances.
Monitor batch training
experiments then
compare cross-model
performance without
worrying about log
transfers and scripts to
visualize results. You
focus on designing your
neural networks. Weâll
manage and track your
assets.
Python client, command
line interface (CLI) or
UI? You choose the
tooling that best ïŹts
your existing workflows.
Training history and
assets are tracked then
automatically
transferred to the
customerâs Object
Storage for quick
access.
Deploy models into
production then
monitor them to
evaluate
performance.
Capture new data
for continuous
learning and
retrain models so
they continually
adapt to changing
conditions.
Deep Learning as a Service within Watson Studioâš
Using FfDL as coreâš
45. Neural Network Modeller within Watson Studioâš
An intuitive drag-and-drop, no-code interface for designing neural network structureâš
48. OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service
EtcD
!
Prometheus
Push Gateway
Alert Manager
Log Collector
ELK
Stack
Web UI
Training Job
Model
DeïŹnition
âš
Training
Data
Trained
Models
Launch
Training
Job
!
! FfDL: Current Release
49. OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service
EtcD
!
Prometheus
Push Gateway
Alert Manager
Log Collector
ELK
Stack
Web UI
Training Job
Model
DeïŹnition
âš
Training
Data
Trained
Models
Launch
Training
Job
!
! FfDL: Current Release
REST API
âąâŻ The REST API microservice handles REST-level
HTTP requests and acts as proxy to the lower-
level gRPC Trainer service.
âąâŻ The service also load-balances requests and is
responsible for authenHcaHon. Load balancing is
implemented by registering the REST API service
instances dynamically in a service registry.
âąâŻ The interface is speciïŹed through a Swagger
deïŹniHon ïŹle.
REST
API
50. OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service
EtcD
!
Prometheus
Push Gateway
Alert Manager
Log Collector
ELK
Stack
Web UI
Training Job
Model
DeïŹnition
âš
Training
Data
Trained
Models
Launch
Training
Job
!
! FfDL: Current Release
Trainer
Service
Trainer
âąâŻ The Trainer service admits training job requests,
persisHng metadata and model input
conïŹguraHon in a database (MongoDB).
âąâŻ It iniHates job deployment, halHng, and (user-
requested) job terminaHon by calling the
appropriate gRPC methods on the Lifecycle
Manager microservice.
âąâŻ The Trainer also assigns a unique idenHïŹer to
each job, which is used by all other components
to track the job.
âąâŻ The data can also be used for billing/chargeback
purposes
51. OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service
EtcD
!
Prometheus
Push Gateway
Alert Manager
Log Collector
ELK
Stack
Web UI
Training Job
Model
DeïŹnition
âš
Training
Data
Trained
Models
Launch
Training
Job
!
! FfDL: Current Release
Lifecycle
Manager
Lifecycle Manager
âąâŻ The Lifecycle Manager (LCM) deploys training
jobs arriving from the Trainer, halHng (pausing)
and terminaHng training jobs.
âąâŻ LCM uses the Kubernetes cluster manager to
deploy containerized training jobs.
âąâŻ A training job is a set of interconnected
Kubernetes pods, each containing one or more
Docker containers.
52. OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service
EtcD
!
Prometheus
Push Gateway
Alert Manager
Log Collector
ELK
Stack
Web UI
Training Job
Model
DeïŹnition
âš
Training
Data
Trained
Models
Launch
Training
Job
!
! FfDL: Current Release
Lifecycle
Manager
Training Jobs - Learner Pods
âąâŻ The LCM determines the learner pods,
parameter servers, and interconnecHons among
them based on the job conïŹguraHon, and calls
on Kubernetes for deployment.
âąâŻ For example, if a user creates a TensorïŹow
training job with four learners and two CPUs/
GPUs per learner, the LCM creates ïŹve pods: one
for each learner (called the learner pod), and
one monitoring pod called the job monitor.
âąâŻ As the training job progresses, informaHon is
needed for evaluaHon of the ongoing success or
failure of the learning progress. These metrics
normally come in the form of scalar values, and
are termed evaluaHon metrics
Parameter Server
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Job Monitor
Log Collector
Training Job
53. OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service
EtcD
!
Prometheus
Push Gateway
Alert Manager
Log Collector
ELK
Stack
Web UI
Training Job
Model
DeïŹnition
âš
Training
Data
Trained
Models
Launch
Training
Job
!
! FfDL: Current Release
Training
Data
Service
Training Data Service
âąâŻ The Training Data Service (TDS) provides short-
lived storage and retrieval for logs and
evaluaHon data from a Deep Learning training
job.
âąâŻ While the learning job is running, a process runs
as a sidecar to extract the training data from the
learner, and then pushes that data into the TDS,
which pushes the data into ElasHc Search.
âąâŻ The sidecars used for collecHng training data are
termed log-collectors. Depending on the
framework and desired extracHon method,
diïŹerent types of log-collectors can be used. Log-
collectors responsibiliHes include at least both
log line collecHon, and evaluaHon metrics
extracHon.
58. THANK YOU!
FfDL Github Page
https://github.com/IBM/FfDL
FfDL dwOpen Page
https://developer.ibm.com/code/open/projects/
fabric-for-deep-learning-ffdl/
FfDL Announcement Blog
http://developer.ibm.com/code/2018/03/20/
fabric-for-deep-learning
FfDL Technical Architecture Blog
http://developer.ibm.com/code/2018/03/20/
democratize-ai-with-fabric-for-deep-learning
Deep Learning as a Service within Watson Studio
https://www.ibm.com/cloud/deep-learning
Research paper: âScalable Multi-Framework
Management of Deep Learning Training Jobsâ
http://learningsys.org/nips17/assets/papers/
paper_29.pdf
FfDL
58