Fabric for Deep Learning

Fabric for Deep Learning
FfDL
FfDL Github Page
https://github.com/IBM/FfDL
FfDL dwOpen Page
https://developer.ibm.com/code/open/projects/
fabric-for-deep-learning-ffdl/
FfDL Announcement Blog
http://developer.ibm.com/code/2018/03/20/
fabric-for-deep-learning
FfDL Technical Architecture Blog
democratize-ai-with-fabric-for-deep-learning
Deep Learning as a Service within Watson Studio
https://www.ibm.com/cloud/deep-learning
Research paper: “Scalable Multi-Framework
Management of Deep Learning Training Jobs”
http://learningsys.org/nips17/assets/papers/
paper_29.pdf
FfDL
1
Animesh Singh, Tommy Li
@AnimeshSingh
@Tomipli

…that automate
decisions.
…to build models…Use data…
The Enterprise AI Process
2
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model

Center for Open Source
Data and AI Technologies
March 30 2018 / © 2018 IBM Corporation
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codaitCode - Build and improve practical frameworks
to enable more developers to realize immediate
value (e.g. FfDL, Tensorflow Jupyter, Spark)
Content – Showcase solutions to complex and
real world AI problems
Community – Bring developers and data
scientists together to engage
Improving Enterprise AI lifecycle in Open Source
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
Python
Data Science
Stack
Fabric for
Deep Learning
(FfDL)
Mleap +
PFA
Scikit-LearnPandas
Apache
Spark
Apache
Spark
Jupyter
Model
Asset
eXchange
Keras +
Tensorflow
CODAIT
codait.org
3

Machine Learning!
and AI !
are everywhere
4IBM Cloud / Watson and Cloud Platform / © 2018 IBM Corporation
facial recognition
unlocks your phone
fraud detection
protects your credit
recommendations
help you shop faster
speech recognition
lets you go hands-free
chat bots
route calls quicker
autonomous vehicles
detect pedestrians
machine vision
detects cancer early
spam detection
unclogs your Inbox

Deep Learning Has Revolutionized
Machine Learning
5
Data
Accuracy
Deep
Learning
Traditional
Machine
Learning
100
80
60
40
20
0
# of Searches for Deep Learning from
2011 to 2017
Source: Google Trends. Search term “Deep Learning”
2011 2012 2013 2014 2015 2016 2017

mile 2
mile 1
mile 3
…
Deep learning marathon!
…not a sprint
mile 4
We are here!
mile 26.2
IBM Cloud / Watson and Cloud Platform / © 2018 IBM Corporation 6

2012
AI winter
AI summer
1985
AI spring
Deep Learning
+ GPUs

2011
IBM Watson
Jeopardy
2017
AlphaGo
Apple’s
releases Siri
1997
…
Facebook’s
face
recognition
2015 2016
Siri gets
deep learning
IBM Deep Blue
chess
AlexNet
Progress in Deep Learning
2012
Introduced
deep learning
with GPUs

what’s slowing progress in deep learning?
too few
practitioners
tools are young
and evolving
need to do more
with less data
IBM Watson Studio
TensorBoard
IBM Watson Studio
30 million students
8 million students

A human brain has:
•  200 billion neurons
•  32 trillion connections between them
•  25 million “neurons”
•  100 million connections (parameters)
Deep Learning = Training Artiﬁcial Neural Networks

What is an artiﬁcial neuron?
IBM Cloud / Watson and Cloud Platform / © 2018 IBM Corporation
11
input
neuron
Output
Think of them as calculators
X2
X3
X1
Xn
…
neuron
inputs
Output = x1 + x2 + x3 + …Xn

How do humans recognize numbers?!

Human brains detect patterns within variations!

Perhaps by decomposing into sub-parts?!

Source: https://ml4a.github.io/ml4a/neural_networks/
28 pixels!
28 pixels!
Pixels of an image capture variations in light and dark!

How do we teach a computer these 784 pixels are the number 8?!

p1
p2
p3
p784
…
784 pixels
p4
p5
p6
p7
IBM Cloud / Watson and Cloud Platform / © 2018 IBM Corporation
17
Our ﬁrst layer of neurons
One neuron per pixel in our image example

Source: https://ml4a.github.io/ml4a/neural_networks/Putting it all together!

output layer
Each layer transforms the input to match the desired output!
prediction!

Not just pixels. Text, sound and much more can be used an input!!
Lorem ipsum dolor sit amet, nam
id alterum principes cotidieque,
at suas indoctum his. No inani
soleat sed, per illum quaestio id.
No prompta luptatum sit. His alii
alterum feugiat ne. Eu delenit
expetendis duo, no possit utamur
patrioque mei. Admodum
appellantur at quo, albucius
periculis adolescens an mel, veri
quaerendum sea ut.Eam
noluisse copiosae democritum
ei, cu eos.
Lorem
ipsum
dolor
sit
Amet
nam
id
alterum
principes
Cotidieque
at
suas
indoctum
eos
Some useful
prediction like
sentiment or
even fraud!

Backpropagation: Iteratively train a neuron
X2
X3
X1
Xn
…
Wn
W1
W2
W3
output
neuron
desired
output
Δ error / loss
optimizatio
n
function
Adjust weights until the output matches expectation

How does deep learning work?
start with your data
data
data data
data
1
data
data
Enter new data into your model
If patterns in the new data
match the training data then
the model makes accurate
predictions
5
prediction
???
trained model
Deﬁne a neural network
2
Model learns to recognize
patterns in historical data
3
4

GPU = Graphics Processing Unit

Application !
Developer
pre-trained model!
SME
transfer learning!
data scientist
custom models!
your
domain
data
+
There are 3 paths to AI systems
1 2 3
your
domain
data
+
pre-Trained
model
+
pre-trained
model
+

pre-trained
model
Application
Developer
1) Pre-Trained Models
deploys
app
submit data
prediction

Pre-Trained
Model
transfer learning
model
domain
data
Application
Developer
Deploy to application
SME
2) Transfer Learning

Deploy to application
3) Create Custom Models
Application
Developer
domain
data
data scientist
custom model

Take a Multi-Framework Approach to Deep Learning
New frameworks emerging monthly.
Tensorflow was awesome yesterday but has static graphs so PyTorch’s dynamic graphs are now popular.
Caffe2

Neural Network Design Workﬂow! domain
data
design
neural network
HPO
•  neural network structure
•  hyperparameters
NO
Performance
meets needs?
Start another
experiment
optimal
hyperparameters

Neural Network Design Workﬂow! domain
data
HPO
•  neural network structure
•  hyperparameters
NO
yes
Performance
meets needs?
Start another
experiment
trained
model
deployCloud
optimal
hyperparameters
evaluate
BAD
Still
good!
design
neural network

Introducing
FfDL (pronounced as ﬁddle)
Multi Framework Approach
to Deep Learning, on your
own Cloud
31

FfDL provides a scalable, resilient, and fault
tolerant deep-learning framework
FfDL Github Page
FfDL dwOpen Page
paper_29.pdf
•  Fabric for Deep Learning or FfDL (pronounced as ‘ﬁddle’) is an
open source project which aims at making Deep Learning easily
accessible to the people it matters the most i.e. Data Scientists,
and AI developers.
•  FfDL Provides a consistent way to deploy, train and visualize
Deep Learning jobs across multiple frameworks like
TensorFlow, Caffe, PyTorch, Keras etc.
•  FfDL is being developed in close collaboration with IBM
Research and IBM Watson. It forms the core of Watson`s Deep
Learning service in open source.
FfDL
32

FfDL is built using Microservices architecture
on Kubernetes
•  FfDL platform uses a microservices architecture to offer
resilience, scalability, multi-tenancy, and security without
modifying the deep learning frameworks, and with no or minimal
changes to model code.
•  FfDL control plane microservices are deployed as pods on
Kubernetes to manage this cluster of GPU- and CPU-enabled
machines effectively
•  Tested Platforms: Minikube, IBM Cloud Public, IBM Cloud
Private, GPUs using both Kubernetes feature gate Accelerators
and NVidia device plugins
33

source code
training
deﬁnition
Access to elastic compute leveraging Kubernetes
Auto-allocation means infrastructure is used only when needed
Kubernetes container
training
artifacts
compute cluster
NVIDIA Tesla K80, P100, V100
Cloud Object Storage
Training assets are
managed and tracked.

NVIDIA GPUs
Kubernetes
container orchestration

training runs
containers
Model training distributed across containers
server cluster
dataset
Cloud Object Storage
35

37
FfDL: Research Papers
https://arxiv.org/abs/1709.05871

38
FfDL: Research Papers
http://learningsys.org/nips17/assets/papers/paper_29.pdf

And we offer more
Model Asset Exchange
MAX
and
Advarsarial Robustness Toolkit
ART
39

IBM Model Asset eXchange
MAX
MAX is a one stop exchange to ﬁnd ML/DL
models created using popular Machine
Learning engines and provides a
standardized approach to consume these
models for training and inferencing.
40
developer.ibm.com/code/exchanges/models/

IBM Adversarial Robustness
Toolkit
ART
ART is a library dedicated to adversarial
machine learning. Its purpose is to allow rapid
crafting and analysis of attacks and defense
methods for machine learning models. The
Adversarial Robustness Toolbox provides an
implementation for many state-of-the-art
methods for attacking and defending
classiﬁers.
41
adversarial-robustness-toolbox/
The Adversarial Robustness Toolbox contains
implementations of the following attacks:
Deep Fool (Moosavi-Dezfooli et al., 2015)
Fast Gradient Method (Goodfellow et al., 2014)
Jacobian Saliency Map (Papernot et al., 2016)
Universal Perturbation (Moosavi-Dezfooli et al., 2016)
Virtual Adversarial Method (Moosavi-Dezfooli et al.,
2015)
C&W Attack (Carlini and Wagner, 2016)
NewtonFool (Jang et al., 2017)
The following defense methods are also supported:
Feature squeezing (Xu et al., 2017)
Spatial smoothing (Xu et al., 2017)
Label smoothing (Warde-Farley and Goodfellow, 2016)
Adversarial training (Szegedy et al., 2013)
Virtual adversarial training (Miyato et al., 2017)

FfDL
Core of Deep Learning as a
Service in Watson Studio
42

Model Lifecycle Management
Machine Learning Runtimes Deep Learning Runtimes
Authoring Tools
Cloud Infrastructure as a Service
• Most popular open source frameworks
• IBM best-in-class frameworks
• Create, collaborate, deploy, and monitor
• Best of breed open source & IBM tools
• Code (R, Python or Scala) and no-code/visual
modeling tools
• Fully managed service
• Container-based resource management
• Elastic pay as you go cpu/gpu power
Watson Studio
Tools for supporting the end-to-end AI workflow

3
Train neural
networks in
parallel across
NVIDIA GPUs.
Pay only for what
you use. Auto-
deallocation
means no more
remembering to
shutdown your
cloud training
instances.
Monitor batch training
experiments then
compare cross-model
performance without
worrying about log
transfers and scripts to
visualize results. You
focus on designing your
neural networks. We’ll
manage and track your
assets.
Python client, command
line interface (CLI) or
UI? You choose the
tooling that best ﬁts
your existing workflows.
Training history and
assets are tracked then
automatically
transferred to the
customer’s Object
Storage for quick
access.
Deploy models into
production then
monitor them to
evaluate
performance.
Capture new data
for continuous
learning and
retrain models so
they continually
adapt to changing
conditions.
Deep Learning as a Service within Watson Studio 
Using FfDL as core

Neural Network Modeller within Watson Studio 
An intuitive drag-and-drop, no-code interface for designing neural network structure

DLaaS Training Dashboard in Watson Studio

OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service

EtcD
!

Prometheus
Push Gateway
Alert Manager

Log Collector
ELK
Stack
Web UI
Training Job
Model
Deﬁnition
 
Training
Data

Trained
Models

Launch
Training
Job
!
! FfDL: Current Release

OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service

EtcD
!

Prometheus
Push Gateway
Alert Manager

Log Collector
ELK
Stack
Web UI
Training Job
Model
Definition
 
Training
Data

Trained
Models

Launch
Training
Job
!
REST API

•  The REST API microservice handles REST-level
HTTP requests and acts as proxy to the lower-
level gRPC Trainer service.
•  The service also load-balances requests and is
responsible for authenHcaHon. Load balancing is
implemented by registering the REST API service
instances dynamically in a service registry.
•  The interface is specified through a Swagger
definiHon file.
REST
API

OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service

EtcD
!

Prometheus
Push Gateway
Alert Manager

Log Collector
ELK
Stack
Web UI
Training Job
Model
Definition
 
Training
Data

Trained
Models

Launch
Training
Job
!
Trainer
Service
Trainer

•  The Trainer service admits training job requests,
persisHng metadata and model input
configuraHon in a database (MongoDB).
•  It iniHates job deployment, halHng, and (user-
requested) job terminaHon by calling the
appropriate gRPC methods on the Lifecycle
Manager microservice.
•  The Trainer also assigns a unique idenHfier to
each job, which is used by all other components
to track the job.
•  The data can also be used for billing/chargeback
purposes

OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service

EtcD
!

Prometheus
Push Gateway
Alert Manager

Log Collector
ELK
Stack
Web UI
Training Job
Model
Deﬁnition
 
Training
Data

Trained
Models

Launch
Training
Job
!
Lifecycle
Manager
Lifecycle Manager

•  The Lifecycle Manager (LCM) deploys training
jobs arriving from the Trainer, halHng (pausing)
and terminaHng training jobs.
•  LCM uses the Kubernetes cluster manager to
deploy containerized training jobs.
•  A training job is a set of interconnected
Kubernetes pods, each containing one or more
Docker containers.

OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service

EtcD
!

Prometheus
Push Gateway
Alert Manager

Log Collector
ELK
Stack
Web UI
Training Job
Model
Definition
 
Training
Data

Trained
Models

Launch
Training
Job
!
Lifecycle
Manager
Training Jobs - Learner Pods

•  The LCM determines the learner pods,
parameter servers, and interconnecHons among
them based on the job configuraHon, and calls
on Kubernetes for deployment.
•  For example, if a user creates a Tensorflow
training job with four learners and two CPUs/
GPUs per learner, the LCM creates five pods: one
for each learner (called the learner pod), and
one monitoring pod called the job monitor.
•  As the training job progresses, informaHon is
needed for evaluaHon of the ongoing success or
failure of the learning progress. These metrics
normally come in the form of scalar values, and
are termed evaluaHon metrics

Parameter Server
Controller
Learner Pod
Job Monitor
Log Collector
Training Job

OBJECT
STORAGE
REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Controller
Learner Pod
Job Monitor
Training
Data
Service
Mongo
DB
Trainer
Service

EtcD
!

Prometheus
Push Gateway
Alert Manager

Log Collector
ELK
Stack
Web UI
Training Job
Model
Deﬁnition
 
Training
Data

Trained
Models

Launch
Training
Job
!
Training
Data
Service
Training Data Service

•  The Training Data Service (TDS) provides short-
lived storage and retrieval for logs and
evaluaHon data from a Deep Learning training
job.
•  While the learning job is running, a process runs
as a sidecar to extract the training data from the
learner, and then pushes that data into the TDS,
which pushes the data into ElasHc Search.
•  The sidecars used for collecHng training data are
termed log-collectors. Depending on the
framework and desired extracHon method,
diﬀerent types of log-collectors can be used. Log-
collectors responsibiliHes include at least both
log line collecHon, and evaluaHon metrics
extracHon.

OBJECT
STORAGE
Model
Deﬁnition
 
Training
Data

Trained
Models

REST
API
CLIs
SDKs
Browser
Parameter Server
Lifecycle
Manager
Controller
Learner Pod
Job Monitor
Training
Data
Mongo
DB
Trainer
Service
Model
Deﬁnition
Training
Data
Trained
Models

EtcD
Launch
Job
Status
Job Info
!

Prometheus
Push Gateway
Alert Manager

Log Collector
ELK
Stack
Web UI
Training Job
!

OBJECT
STORAGE
Model
Deﬁnition
 
Training
Data

Trained
Models

REST
API
CLIs
SDKs
Browser
Parameter
Server
Lifecycle
Manager
Job Monitor
Training
Data
Mongo
DB
Trainer
Service

EtcD
Launch
Job
Status
Job Info
!

Prometheus
Push Gateway
Alert Manager

ELK
Stack
Web UI
Horovod
Controller
Learner Pod
Log Collector
Training Job
MOUNT
OBJECT
STORAGE
!
! FfDL: Next Release (v0.1)

Demos
Animesh Singh, Tommy Li
@AnimeshSingh @Tomipli
57

THANK YOU!
FfDL Github Page
FfDL dwOpen Page
paper_29.pdf
FfDL
58

Fabric for Deep Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Fabric for Deep Learning

Ähnlich wie Fabric for Deep Learning (20)

Mehr von Animesh Singh

Mehr von Animesh Singh (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Fabric for Deep Learning