AI & Machine Learning Pipelines with Knative

Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
AI & Machine Learning Pipelines with Knative
@AnimeshSingh @Tomipli

≈
Center for Open Source
Data and AI
Technologies (CODAIT)
Code – Build and improve practical frameworks to enable
more developers to realize immediate value.
Content – Showcase solutions for complex and
real-world AI problems.
Community – Bring developers and data
scientists to engage with IBM
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
Python
Data Science
Stack
Fabric for
Deep Learning
(FfDL)
Mleap +
PFA
Scikit-LearnPandas
Apache
Spark
Apache
Spark
Jupyter
Model
Asset
eXchange
Keras +
Tensorflow
Improving Enterprise AI lifecycle in
Open Source
•  Team contributes to over 10 open source projects
•  17 committers and many contributors in Apache projects
•  Over 1100 JIRAs and 66,000 lines of code committed to Apache Spark itself; over 65,000
LoC into SystemML
•  Over 25 product lines within IBM leveraging Apache Spark
•  Speakers at over 100 conferences, meetups, unconferences and more
CODAIT
codait.org

Agenda
3
•  Progress in ML and DL
•  AI and Cloud: Complimentary
•  Why we need Knative to build AI
Platform
•  How to build a transparent and
trusted AI platform leveraging
Knative
•  Demo
CODAIT
codait.org

2011
IBM Watson
Jeopardy
2017
AlphaGo
Apple’s
releases Siri
1997
…
Facebook’s
face
recognition
2015 2016
Siri gets
deep learning
IBM Deep Blue
chess
AlexNet
Progress in Deep Learning
2012
Introduced
deep learning
with GPUs
4
1997

2011
IBM Watson
Jeopardy
2017
AlphaGo
Apple’s
releases Siri
1997
…
Facebook’s
face
recognition
2015 2016
Siri gets
deep learning
IBM Deep Blue
chess
AlexNet
2012
Introduced
deep learning
with GPUs
5
2011

2011
IBM Watson
Jeopardy
2017
AlphaGo
Apple’s
releases Siri
1997
…
Facebook’s
face
recognition
2015 2016
Siri gets
deep learning
IBM Deep Blue
chess
AlexNet
2012
Introduced
deep learning
with GPUs
6
2017

2011
IBM Watson
Jeopardy
2017
AlphaGo
Apple’s
releases Siri
1997
…
Facebook’s
face
recognition
2015 2016
Siri gets
deep learning
IBM Deep Blue
chess
AlexNet
2012
Introduced
deep learning
with GPUs
7
2017
2018

2011
IBM Watson
Jeopardy
2017
AlphaGo
Apple’s
releases Siri
1997
…
Facebook’s
face
recognition
2015 2016
Siri gets
deep learning
IBM Deep Blue
chess
AlexNet
2012
Introduced
deep learning
with GPUs
IBM Cloud / Watson and Cloud Platform / © 2018 IBM Corporation 8

A human brain has:
•  200 billion neurons
•  32 trillion connections between them
•  25 million “neurons”
•  100 million connections (parameters)
Deep Learning = Training Artiﬁcial Neural Networks

Neural Network Design Workﬂow! domain
data
design
neural network
HPO
•  neural network structure
•  hyperparameters
NO
Performance
meets needs?
Start another
experiment
optimal
hyperparameters

Neural Network Design Workﬂow! domain
data
HPO
•  neural network structure
•  hyperparameters
NO
yes
Performance
meets needs?
Start another
experiment
trained
model
deployCloud
optimal
hyperparameters
evaluate
BAD
Still
good!
design
neural network

So AI in general and
Deep Learning in
particular are very
iterative and repetitive.
And they need Cloud.
Why?

AI requires
the strength
of HPC &
GPUs

Ability to
scale AI
workloads on
demand

15
!
!
1.  Model/Data
Parallelism!
!
2.  MPI!
3.  NCCL!
4.  …..!
!
Ability to utilize various
technologies and achieve high
performance computing.

16
!
!
Microservices!
!
Containers!
!
DevOps
automation!
To scale we need to go
Cloud native for AI

API!
UI!
CLI!
Kubernetes !
!
Master!
Worker Node 1!
Worker Node 2!
Worker Node 3!
Worker Node n!
Registry
•  Etcd
•  API Server
•  Controller Manager
Server
•  Scheduler Server
And Cloud native means we use Kubernetes!!

18
But is Kubernetes enough for Cloud Native platform?!

19
Kubernetes is not the end game – Says who?!

20
What else do we need? Let`s understand it from the context of an AI Lifecycle!
AI
PLATFORM

AIOps
Prepared
and
Analyzed
Data
AI
PLATFORM
Initial Model
Trained
Model
Deployed
Model
We need a Cloud native AI Platform to build, train, deploy and monitor Models!

AIOps
Prepared
and
Analyzed
Data
AI
PLATFORM
Trained
Model
Deployed
Model
Also, we need a transparent and trusted AI Platform!
Prepared
and
Analyzed
Data
AI
PLATFORM
Initial Model
Deployed
Model
Is model training is showing increasing
loss?
Are model weights biased?
Is the model vulnerable to adversarial
attacks?
Has the training data changed?
Are the model predictions less accurate?
Are Hyperparameters suboptimal?
Are predictions biased?
Is the dataset biased?

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Infact, we need a transparent, trusted and automated AI Pipeline!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
Is model training is showing increasing
loss?
Are model weights biased?
Is the model vulnerable to adversarial
attacks?
Has the training data changed?
Are ate model predictions less accurate?
Are Hyperparameters suboptimal?
Are predictions biased?

Trigger: Trained Model is showing increasing loss
Prepare Data
AIOps
Prepared
and
Analyzed
Data
Initial Model
Trained
Model
Deployed
Model
Train
Deploy
Harden
Train
Deploy
Trigger: Model performance is suboptimal
Implement Defense
Train
Deploy
Trigger: Model is vulnerable to attack
Trigger: Data changed
Optimize
Hyperparameters
Train
Trigger: Bias Detected
Actions
Transparent, trusted, automated and event driven AI Pipeline!

Prepare Data
AIOps
Prepared
and
Analyzed
Data
Initial Model
Trained
Model
Deployed
Model
Train
Deploy
Harden
Train
Deploy
Trigger: Model performance is suboptimal
Implement Defense
Train
Deploy
Optimize
Hyperparameters
Train
Actions
Transparent, trusted, automated, event driven and auditable AI Pipeline!

Prepare Data
AIOps
Prepared
and
Analyzed
Data
Initial Model
Trained
Model
Deployed
Model
Train
Deploy
Debias
Train
Deploy
Trigger: Model performance is biased
Implement Defense
Train
Deploy
Optimize
Hyperparameters
Train
Data Scientist
AIOps Engineer
Actions
Transparent, trusted, automated, event driven, auditable AI Pipeline as a Service!

Training Pipe

Model
Validation
Pipe

If we translate it to logical architecture, it looks like this…!

 
 

Data Pipe

Model
Deployment
Pipe

Deployment
Analysis
Pipe

AI Pipeline (Python Deﬁnition – Orchestrate and Track)
AI Developer and Data Scientist
AI Operator

 
Python
Function

 
Python
Function

 
Python
Function

 
Python
Function

 
Python
Function

So we need more than Kubernetes. We need to be able to……!
build Data
Scientists
Code
orchestrate
the ML
code
automate
the ML
workﬂow
send event
notiﬁcations
Data Scientist
AIOps Engineer

That means, we need the concepts of..!
Build
Serving
Pipeline
Eventing

So……We need KNative!
Build
Serving
Pipeline
Eventing

● Build
● Eventing
● Serving
● Pipeline
KNative provides a set of building blocks that enable modern, source-centric and container-based
serverless workloads on Kubernetes. Uses K8S CRDs to define a new set of APIs around
Well….what is Knative?!
Build
Serving
Pipeline
Eventing

Knative Build Components
•  Build
•  Builder
•  BuildTemplate
For example, you can write a build that uses
Kubernetes-native resources to obtain your source
code from a repository, build a container image, then
run that image.
•  A Build can include multiple steps where each step specifies
a Builder.
•  A Builder is a type of container image that you create to
accomplish any task, whether that's a single step in a
process, or the whole process itself.
•  The steps in a Build can push to a repository.
•  A BuildTemplate can be used to defined reusable
parameterized templates.
Knative Build!
Build — Source-to-container build orchestration

KNative Eventing components
●  Sources — Sources that are firing events
●  Channels — A single event forwarding and
persistence layer
●  Subscriptions — Deliver and forward
events to channels/services.

Knative Eventing is designed around the
following goals:
●  Knative Eventing services are loosely coupled.
●  Event producers and event sources are independent.
●  Other services can be connected to the Eventing
system..
●  Ensure cross-service interoperability. Knative
Eventing is consistent with the CloudEvents
specification that is developed by the CNCF
Serverless WG.
Knative Eventing!
Eventing — Delivery and management of events, universal subscription, binding services to event
ecosystems,

Knative Pipeline!
Knative Pipeline components
● Task — A collection of sequential steps you would want
to run as part of your CI flow. It including the inputs/
outputs and steps
● Pipeline — A graph of Tasks to execute.
● Runs — To invoke a Pipeline or a Task.

High level details of this design:

•  Pipelines do not know what will trigger them, they can be triggered
by events or by manually creating PipelineRuns
•  Tasks can exist and be invoked completely independently of
Pipelines
•  Test results are a first class concept, being able to navigate test
results easily is powerful
•  Tasks can depend on artifacts, output and parameters created by
other tasks.
•  Resources are the artifacts used as inputs and outputs of TaskRuns.
Pipeline— Configure and run CI/CD style pipelines for your kubernetes application

●  Configuration
○  Desired current state of deployment (#HEAD)
○  Records both code and configuration (separated, ala 12 factor)
○  Stamps out builds / revisions as it is updated
●  Revision
○  Code and configuration snapshot
○  k8s infra: Deployment, ReplicaSet, Pods, etc
●  Route
○  Traffic assignment to Revisions (fractional scaling or by name)
○  Built using Istio
●  Service
○  Provides a simple entry point for UI and CLI tooling to achieve
common behavior
○  Acts as a top-level controller to orchestrate Route and
Configuration.
Configuration
Revision
Revision
Revision
Route
latest
explicit
creates
Service
creates
creates
Knative Serving!
Serving — Request-driven compute model, scale to zero, autoscaling, routing and managing
traffic
Knative Serving components

Revision
Serving
Knative Serving!

Deployment
Revision
Serving
Knative Serving Building Block!

Worker Node Worker Node
Deployment
Replica Set
Revision
Serving

Worker Node
Pod
Worker Node
Deployment
Replica Set
Revision
Serving

Worker Node
Pod
Worker Node
Deployment
Replica Set
Revision
Pod
Serving

Worker Node Worker Node
Deployment
Revision
Replica Set
Serving

Revision 1
Revision 2
Revision 3
Revision 4
Serving

Conﬁguration
Revision 1
Revision 2
Revision 3
Revision 4
Serving

Conﬁguration
Revision 1
Revision 2
Revision 3
Revision 4
Revision 5
Serving

Conﬁguration
Revision 1
Revision 2
Revision 3
Revision 4
Revision 5
Route
Serving

Conﬁguration
Revision 1
Revision 2
Revision 3
Revision 4
Revision 5
Route
5%
95%
Serving

Route
Conﬁguration Revision
Creates
References
References
Serving

Route
Creates
References
References
Service

Route
Creates
References
References
Creates
Service
Creates

● Build — Source-to-container build orchestration
● Eventing — Universal subscription, binding services to event ecosystems, delivery and management of events
● Serving — Request-driven compute model, scale to zero, autoscaling, routing and managing traffic
● Pipeline — Configure and run CI/CD style pipelines for your kubernetes application.
.
So Knative uses K8S CRDs to define a new set of APIs
!
Build
Serving
Pipeline
Eventing

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
But do we really need something like Knative to build an ML Platform?!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Let`s go through different phases of AI Lifecycle through our use cases…!
Many tools available to build initial models!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Many tools to train machine learning and deep learning models!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
We need a multi framework ML- DL cloud native platform !
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
e.g. Tensorflow is awesome ,
but has static graphs so
PyTorch’s dynamic graphs are
becoming more popular.
How do we handle multiple
Open Source Deep Learning
frameworks in a consistent way.
and leverage
the power of cloud for
distributed computing?

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Enter: Fabric for Deep Learning!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
FfDL

Fabric for Deep Learning
https://github.com/IBM/FfDL
FfDL Github Page
FfDL dwOpen Page
https://developer.ibm.com/code/open/projects/
fabric-for-deep-learning-ffdl/
FfDL Announcement Blog
http://developer.ibm.com/code/2018/03/20/
fabric-for-deep-learning
FfDL Technical Architecture Blog
http://developer.ibm.com/code/2018/03/20/
democratize-ai-with-fabric-for-deep-learning
Deep Learning as a Service within Watson Studio
https://www.ibm.com/cloud/deep-learning
Research paper: “Scalable Multi-Framework
Management of Deep Learning Training Jobs”
http://learningsys.org/nips17/assets/papers/
paper_29.pdf
•  Fabric for Deep Learning or FfDL (pronounced as ‘fiddle’
aims at making Deep Learning easily accessible to Data
Scientists, and AI developers.
•  FfDL Provides a consistent way to train and visualize Deep
Learning jobs across multiple frameworks like TensorFlow,
Caffe, PyTorch, Keras etc.
FfDL
58
Community Partners
FfDL is one of InfoWorld’s 2018 Best of Open
Source Software Award winners for machine
learning and deep learning!

FfDL is built using Microservices architecture
on Kubernetes
•  FfDL platform uses a microservices architecture to offer
resilience, scalability, multi-tenancy, and security without
modifying the deep learning frameworks, and with no or
minimal changes to model code.
•  FfDL control plane microservices are deployed as pods on
Kubernetes to manage this cluster of GPU- and CPU-
enabled machines effectively
•  Tested Platforms: Kube DIND, IBM Cloud Public, IBM Cloud
Private, GPUs using both Kubernetes feature gate
Accelerators and NVidia device plugins
59

source code
training
deﬁnition
Access to elastic compute leveraging Kubernetes
Auto-allocation means infrastructure is used only when needed
Kubernetes container
training
artifacts
compute cluster
NVIDIA Tesla K80, P100, V100
Cloud Object Storage
Training assets are
managed and tracked.

NVIDIA GPUs
Kubernetes
container orchestration

training runs
containers
Model training distributed across containers
server cluster
dataset
Cloud Object Storage
61

OBJECT
STORAGE
Model
Deﬁnition
 
Training
Data

Trained
Models

REST
API
Parameter
Server
Lifecycle
Manager
Job Monitor
Training
Data
Mongo
DB
Trainer
Service
Launch
Job
Status
Job Info
"

Prometheus
Push Gateway
Alert Manager

Web UI
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Log Collector
Training Job
MOUNT
OBJECT
STORAGE
BUCKET
Elastic
Searc
h
FfDL: Architecture - Current Release!
Open MPI
CLIs
Browser

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Great , but we also need a batch scheduler for AI Workloads. Enter Kube-Batch!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
kube-batch

OBJECT
STORAGE
Model
Deﬁnition
 
Training
Data

Trained
Models

REST
API
Parameter
Server
Job Monitor
Mongo
DB
Trainer
Service
Launch
Job
Job Info
"

Prometheus
Push Gateway
Alert Manager

Web UI
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Log Collector
Training Job
MOUNT
OBJECT
STORAGE
BUCKET
Elastic
Searc
h
Open MPI
CLIs
Browser
Scheduling
Engine
Kube-
Batch
QueueJob
Advanced Batch Scheduling with Kube-Batch !Advanced Batch Scheduling with Kube-Batch !

!
!
!
!
!
!
!
!
!
!
!
!
Kube-Batch: Filling gaps for FfDL and AI Workloads!
•  Job Queuing for enabling job priority and preemption
•  Holistic scheduling
•  Support for job dynamic prioritization/budgeting
•  User requested time based scheduling
•  Preemption/resumption in a supported checkpoint/
restart env
•  Topology aware placement (data intensive workloads
requiring network topology aware placement)
•  Partial preemption of elastic jobs
•  Performance estimation
•  Kube-Batch
–  Kubernetes incubator project
–  Introduces batch scheduling in Kubernetes
–  Support for simple job deﬁnition and lifecycle
management
–  Support for job queuing
–  Introduces queue-based quota management

12 Watson
services/apps
represented as
800+
Kubernetes
services
IBM Watson workloads:
Proven AI workload on IBM
Cloud Kubernetes Service
One deployment example:
3000+ pods on
500+ nodes
“We no longer worry about managing the
infrastructure because IBM Cloud
Kubernetes Service takes care of that
for us.” – Watson Project Team
Jason McGee / © 2018 IBM Corporation

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Training is accomplished. Model is ready – Can we trust it?!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
Can the model be trusted?

Is it fair?
Is it easy to
understand?
Did anyone
tamper with it? Is it accountable?
#21, #32, #93
#21, #32, #93
What does it take to trust a decision made by a machine?!
(Other than that it is 99% accurate)?!

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
So let`s start with vulnerability detection of Models?!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
Is the model vulnerable to
adversarial attacks?

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Enter: Adversarial Robustness Toolbox!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
ART

IBM Adversarial Robustness
Toolbox
ART
ART is a library dedicated to adversarial
machine learning. Its purpose is to allow rapid
crafting and analysis of attack and defense
methods for machine learning models. The
Adversarial Robustness Toolbox provides an
implementation for many state-of-the-art
methods for attacking and defending
classifiers.
71
https://github.com/IBM/adversarial-robustness-
toolbox
The Adversarial Robustness Toolbox contains
implementations of the following attacks:
Deep Fool (Moosavi-Dezfooli et al., 2015)
Fast Gradient Method (Goodfellow et al., 2014)
Jacobian Saliency Map (Papernot et al., 2016)
Universal Perturbation (Moosavi-Dezfooli et al., 2016)
Virtual Adversarial Method (Moosavi-Dezfooli et al.,
2015)
C&W Attack (Carlini and Wagner, 2016)
NewtonFool (Jang et al., 2017)
The following defense methods are also supported:
Feature squeezing (Xu et al., 2017)
Spatial smoothing (Xu et al., 2017)
Label smoothing (Warde-Farley and Goodfellow, 2016)
Adversarial training (Szegedy et al., 2013)
Virtual adversarial training (Miyato et al., 2017)

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Robustness check accomplished. How do we check for bias throughout lifecycle?!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
Are model weights
and classifiers biased?
Are predictions
biased?

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Enter: AI Fairness 360!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
AIF360

AI Fairness 360
https://github.com/IBM/AIF360
AIF360AIF360 toolkit is an open-source library to
help detect and remove bias in machine
learning models.
The AI Fairness 360 Python package includes
a comprehensive set of metrics for datasets
and models to test for biases, explanations for
these metrics, and algorithms to mitigate bias
in datasets and models.
Toolbox
Fairness metrics (30+)
Fairness metric explanations
Bias mitigation algorithms (9+)
74
Supported bias mitigation algorithms
Optimized Preprocessing (Calmon et al., 2017)
Disparate Impact Remover (Feldman et al., 2015)
Equalized Odds Postprocessing (Hardt et al., 2016)
Reweighing (Kamiran and Calders, 2012)
Reject Option Classification (Kamiran et al., 2012)
Prejudice Remover Regularizer (Kamishima et al., 2012)
Calibrated Equalized Odds Postprocessing (Pleiss et al.,
2017)
Learning Fair Representations (Zemel et al., 2013)
Adversarial Debiasing (Zhang et al., 2018)
Supported fairness metrics
Comprehensive set of group fairness metrics derived
from selection rates and error rates
Comprehensive set of sample distortion metrics
Generalized Entropy Index (Speicher et al., 2018)

(d’Alessandro et al., 2017)
AIF 360 detects for fairness in building and deploying models throughout AI
Lifecycle!

dataset
metric
pre-
processing
algorithm in-
processing
algorithm
post-
processing
algorithm
classifier
metric
classifier
metric
explainer
dataset
metric
explainer
With Metrics, Algorithms, and Explainers!

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Model is trained, tested and validated. Just deploy as microservices on
Kubernetes? Do we need anything else?!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Model is trained, tested and validated. Just deploy as microservices on
Kubernetes? Do we need anything else?!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
q  As I develop multiple versions
of my models, how can I
easily dark launch and shift
traffic?
q  How can I add and enforce
policies on my model
microservices?
q  The network among microservices
is not reliable.
q  How can I monitor and trace my
model microservices?
q  How can I ensure the
communication among
microservices is secure?

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Enter Istio!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model

An open service mesh platform to connect, observe,
secure, and control microservices.
Istio!

Connect: Traﬃc Control, Discovery,
Load Balancing, Resiliency
Observe: Metrics, Logging, Tracing
Secure: Encryption (TLS),
Authentication, and Authorization of
service-to-service communication
Control: Policy Enforcement
Istio!

BA
call
How does it work? 
!

1. Deploy a proxy (Envoy) beside your application (“sidecar deployment”)
Env
oy
A
Envoy
Env
oy
B
Envoy
call
!

2. Deploy Pilot to conﬁgure the sidecars
Envoy
Pilot
config
Env
oy
A
Envoy
Env
oy
B
Envoy
!

3. Deploy Telemetry to get telemetry
Envoy
Pilot
Env
oy
A
Envoy
Env
oy
B
Envoy
Envoy
Telemetry
telemetry
!

4. Deploy Citadel to assign identities and enable secure communication
Envoy
Pilot
Env
oy
A
Envoy
Env
oy
B
Envoy
Envoy
Telemetry
Envoy
Citadel
!

5. Deploy Policy to enforce policies
Envoy
Pilot
policy decisions
Envoy
A
Envoy Envoy
B
Envoy
Envoy
Policy
Envoy
Telemetry
Envoy
Citadel
!

50%
50%
Trafﬁc Splitting
Pod

user = jason
Trafﬁc Steering
Pod

And….Istio is part of Knative!!!!

AIOps
Model is deployed. We need to ensure triggers and actions can be deﬁned based
on any vulnerability detection, dataset changes, bias detection etc… !
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model

AIOps
Enter Apache OpenWhisk!
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model

Triggers
(response)
Rules
Actions
(code)
Source
(events) Results
OpenWhisk Apache OpenWhisk!
FaaS platform to
execute code in
response to events!
Delivered as 
Open source via Apache
openwhisk.org

Event
Provider Periodic IBM Cloudant Message Hub
Mobile Push Github
OpenWhisk
IBM App Connect
OpenWhisk!

98
And….OpenWhisk is being developed to work with Knative!!!!

AIOps
And last but not the least, what do we use for Pipeline orchestration?!
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model

100
•  Currently in IBM we use IBM DevOps toolchain pipeline, built on top of Jenkins
•  On open source, we are investigating, Pachyderm, Argo and Airflow.
•  Pachyderm and Argo are Dataflow and Workflow orchestrators on Kubernetes
•  Both have strong merits, and provide a DevOps Operator view around pipeline execution
times, logs, metrics, exit handlers, notification systems etc.
•  As discussed, Knative is now evolving its own Pipelining capabilities, though in early stage
AI Pipeline Orchestration!

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
AIOps platform is ready. How do Data Scientists interact with it using Notebooks?!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
Jupyter
Enterprise
Gateway

Jupyter Enterprise Gateway
March 30 2018 / © 2018 IBM Corporation
Jupyter Enterprise Gateway at IBM Code
https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/
Jupyter Enterprise Gateway source code at GitHub
https://github.com/jupyter-incubator/enterprise_gateway
Jupyter Enterprise Gateway Documentation
http://jupyter-enterprise-gateway.readthedocs.io/en/latest/
102
A lightweight, multi-tenant, scalable and secure
gateway that enables Jupyter Notebooks to
share resources across an Apache Spark or
Kubernetes cluster for Enterprise/Cloud use
cases
Kernel
Kernel
Kernel
Kernel
Kernel
KernelKernel

Jupyter Notebooks: Support for FfDL
User Experience
103
•  User select where to run the experiment
•  Job is packaged and submitted on behalf of
user
•  User has access to Job Console to monitor
experiment

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Together: A Transparent, and trusted Open Source AI Pipeline using Knative
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
FfDL kube-batch
MAX
AIF360 AIF360
Istio OpenWhisk
ART

Demo Model: Gender Classiﬁcation!
Data
UTKFace
Simple convolutional neural network (CNN)
with 3 convolutional layers
and 2 fully connected layers
Male

Female
SoftMax
Confident Score

Demo Flow!
Data
Robustness
Check
Model
Deployment
ADVERSARIAL ROBUSTNESS
TOOLBOX
Fairness
Check
AI Fairness 360
Model
Revision 1
10%
90%
Model
Revision 2
Gender Classiﬁcation Model
Source to Container:
Knative Build
ML
Code
Training

To fully automate the workﬂow, Knative Pipeline can be used with
Knative Eventing!
Data
Robustness
Check
Model
Deployment
ADVERSARIAL ROBUSTNESS
TOOLBOX
Fairness
Check
AI Fairness 360
Model
Revision 1
10%
90%
Model
Revision 2
Gender Classiﬁcation Model
Source to Container:
Knative Build
ML
Code
Training
Eventing

Knative Eventing Sources, Channel and Subscription can be deﬁned!

Knative Pipeline Tasks and Flow Deﬁnition can be deﬁned!

For this demonstration, we used Argo Workﬂows for Pipelining!

•  Knative brings serveless actions, events, workflows, apps, containers, service-mesh in one single
stack
•  Knative eliminates the need for stitching various individual components together (events, actions,
service mesh, pipeline etc.)
•  Knative Build and Serving are simple to use and integrate with our existing containers.
•  Traffic Routing and Canary testing on Knative can be done without any Istio knowledge.
•  Built-in monitoring tools are very helpful to visualize the benefits of building services using Knative.
•  Knative Eventing shows promise, needs to evolve more for production
•  Knative Build-Pipeline is still in very early stage, and changing rapidly.
•  Knative Build-Pipeline could get very complicated with many tasks and it needs a better debugging
tool/GUI to monitor errors and visualize the pipeline flow.
What we learnt!

AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Bringing it together: A Secure, transparent, and trusted Open Source AI Pipeline
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
FfDL kube-batch
MAX
AIF360 AIF360
Istio OpenWhisk
ART

Links:!
•  Knative: https://github.com/knative
•  Fabric for Deep Learning: https://github.com/IBM/FfDL
•  Adversarial Robustness Toolbox: https://github.com/IBM/adversarial-robustness-toolbox
•  AI Fairness 360: http://aif360.mybluemix.net/
•  Jupyter Enterprise Gateway: https://github.com/jupyter/enterprise_gateway
•  Kube-Batch: https://github.com/kubernetes-sigs/kube-batch
•  Istio: https://github.com/istio/istio
•  OpenWhisk: https://github.com/apache/incubator-openwhisk
•  Kiali: https://github.com/kiali/kiali
•  Argo: https://github.com/argoproj/argo
•  PyTorch: https://pytorch.org/

Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
@AnimeshSingh @Tomipli
Thank You!
AI & Machine Learning Pipelines with Knative

AI & Machine Learning Pipelines with Knative

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie AI & Machine Learning Pipelines with Knative

Ähnlich wie AI & Machine Learning Pipelines with Knative (20)

Mehr von Animesh Singh

Mehr von Animesh Singh (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AI & Machine Learning Pipelines with Knative