The document discusses the need for Knative to build cloud-native AI platforms. It describes that an AI lifecycle involves multiple iterative phases like data preparation, model training, deployment, and monitoring. It states that Kubernetes alone is not sufficient and that concepts like building, serving, eventing and pipelines are required to automate the end-to-end AI workflow. It introduces Knative as a set of building blocks on top of Kubernetes that provide these capabilities through custom resource definitions. Specifically, Knative provides capabilities for source-to-container builds, event delivery and subscription, request-driven scalable serving of models, and configuration of CI/CD-style pipelines for Kubernetes applications.
2. ≈
Center for Open Source
Data and AI
Technologies (CODAIT)
Code – Build and improve practical frameworks to enable
more developers to realize immediate value.
Content – Showcase solutions for complex and
real-world AI problems.
Community – Bring developers and data
scientists to engage with IBM
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
Python
Data Science
Stack
Fabric for
Deep Learning
(FfDL)
Mleap +
PFA
Scikit-LearnPandas
Apache
Spark
Apache
Spark
Jupyter
Model
Asset
eXchange
Keras +
Tensorflow
Improving Enterprise AI lifecycle in
Open Source
• Team contributes to over 10 open source projects
• 17 committers and many contributors in Apache projects
• Over 1100 JIRAs and 66,000 lines of code committed to Apache Spark itself; over 65,000
LoC into SystemML
• Over 25 product lines within IBM leveraging Apache Spark
• Speakers at over 100 conferences, meetups, unconferences and more
CODAIT
codait.org
3. Agenda
3
• Progress in ML and DL
• AI and Cloud: Complimentary
• Why we need Knative to build AI
Platform
• How to build a transparent and
trusted AI platform leveraging
Knative
• Demo
CODAIT
codait.org
17. API!
UI!
CLI!
Kubernetes !
!
Master!
Worker Node 1!
Worker Node 2!
Worker Node 3!
Worker Node n!
Registry
• Etcd
• API Server
• Controller Manager
Server
• Scheduler Server
And Cloud native means we use Kubernetes!!
22. AIOps
Prepared
and
Analyzed
Data
AI
PLATFORM
Trained
Model
Deployed
Model
Also, we need a transparent and trusted AI Platform!
Prepared
and
Analyzed
Data
AI
PLATFORM
Initial Model
Deployed
Model
Is model training is showing increasing
loss?
Are model weights biased?
Is the model vulnerable to adversarial
attacks?
Has the training data changed?
Are the model predictions less accurate?
Are Hyperparameters suboptimal?
Are predictions biased?
Is the dataset biased?
23. AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Infact, we need a transparent, trusted and automated AI Pipeline!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
Is model training is showing increasing
loss?
Are model weights biased?
Is the model vulnerable to adversarial
attacks?
Has the training data changed?
Are ate model predictions less accurate?
Are Hyperparameters suboptimal?
Are predictions biased?
Is the dataset biased?
24. Trigger: Trained Model is showing increasing loss
Prepare Data
AIOps
Prepared
and
Analyzed
Data
Initial Model
Trained
Model
Deployed
Model
Train
Deploy
Harden
Train
Deploy
Trigger: Model performance is suboptimal
Implement Defense
Train
Deploy
Trigger: Model is vulnerable to attack
Trigger: Data changed
Optimize
Hyperparameters
Train
Trigger: Bias Detected
Actions
Transparent, trusted, automated and event driven AI Pipeline!
25. Trigger: Trained Model is showing increasing loss
Prepare Data
AIOps
Prepared
and
Analyzed
Data
Initial Model
Trained
Model
Deployed
Model
Train
Deploy
Harden
Train
Deploy
Trigger: Model performance is suboptimal
Implement Defense
Train
Deploy
Trigger: Model is vulnerable to attack
Trigger: Data changed
Optimize
Hyperparameters
Train
Trigger: Bias Detected
Actions
Transparent, trusted, automated, event driven and auditable AI Pipeline!
26. Trigger: Trained Model is showing increasing loss
Prepare Data
AIOps
Prepared
and
Analyzed
Data
Initial Model
Trained
Model
Deployed
Model
Train
Deploy
Debias
Train
Deploy
Trigger: Model performance is biased
Implement Defense
Train
Deploy
Trigger: Model is vulnerable to attack
Trigger: Data changed
Optimize
Hyperparameters
Train
Trigger: Bias Detected
Data Scientist
AIOps Engineer
Actions
Transparent, trusted, automated, event driven, auditable AI Pipeline as a Service!
27.
Training Pipe
Model
Validation
Pipe
If we translate it to logical architecture, it looks like this…!
Data Pipe
Model
Deployment
Pipe
Deployment
Analysis
Pipe
AI Pipeline (Python Definition – Orchestrate and Track)
AI Developer and Data Scientist
AI Operator
Python
Function
Python
Function
Python
Function
Python
Function
Python
Function
28. So we need more than Kubernetes. We need to be able to……!
build Data
Scientists
Code
orchestrate
the ML
code
automate
the ML
workflow
send event
notifications
Data Scientist
AIOps Engineer
29. That means, we need the concepts of..!
Build
Serving
Pipeline
Eventing
31. ● Build
● Eventing
● Serving
● Pipeline
KNative provides a set of building blocks that enable modern, source-centric and container-based
serverless workloads on Kubernetes. Uses K8S CRDs to define a new set of APIs around
Well….what is Knative?!
Build
Serving
Pipeline
Eventing
32. Knative Build Components
• Build
• Builder
• BuildTemplate
For example, you can write a build that uses
Kubernetes-native resources to obtain your source
code from a repository, build a container image, then
run that image.
• A Build can include multiple steps where each step specifies
a Builder.
• A Builder is a type of container image that you create to
accomplish any task, whether that's a single step in a
process, or the whole process itself.
• The steps in a Build can push to a repository.
• A BuildTemplate can be used to defined reusable
parameterized templates.
Knative Build!
Build — Source-to-container build orchestration
33. KNative Eventing components
● Sources — Sources that are firing events
● Channels — A single event forwarding and
persistence layer
● Subscriptions — Deliver and forward
events to channels/services.
Knative Eventing is designed around the
following goals:
● Knative Eventing services are loosely coupled.
● Event producers and event sources are independent.
● Other services can be connected to the Eventing
system..
● Ensure cross-service interoperability. Knative
Eventing is consistent with the CloudEvents
specification that is developed by the CNCF
Serverless WG.
Knative Eventing!
Eventing — Delivery and management of events, universal subscription, binding services to event
ecosystems,
34. Knative Pipeline!
Knative Pipeline components
● Task — A collection of sequential steps you would want
to run as part of your CI flow. It including the inputs/
outputs and steps
● Pipeline — A graph of Tasks to execute.
● Runs — To invoke a Pipeline or a Task.
High level details of this design:
• Pipelines do not know what will trigger them, they can be triggered
by events or by manually creating PipelineRuns
• Tasks can exist and be invoked completely independently of
Pipelines
• Test results are a first class concept, being able to navigate test
results easily is powerful
• Tasks can depend on artifacts, output and parameters created by
other tasks.
• Resources are the artifacts used as inputs and outputs of TaskRuns.
Pipeline— Configure and run CI/CD style pipelines for your kubernetes application
35. ● Configuration
○ Desired current state of deployment (#HEAD)
○ Records both code and configuration (separated, ala 12 factor)
○ Stamps out builds / revisions as it is updated
● Revision
○ Code and configuration snapshot
○ k8s infra: Deployment, ReplicaSet, Pods, etc
● Route
○ Traffic assignment to Revisions (fractional scaling or by name)
○ Built using Istio
● Service
○ Provides a simple entry point for UI and CLI tooling to achieve
common behavior
○ Acts as a top-level controller to orchestrate Route and
Configuration.
Configuration
Revision
Revision
Revision
Route
latest
explicit
creates
Service
creates
creates
Knative Serving!
Serving — Request-driven compute model, scale to zero, autoscaling, routing and managing
traffic
Knative Serving components
52. ● Build — Source-to-container build orchestration
● Eventing — Universal subscription, binding services to event ecosystems, delivery and management of events
● Serving — Request-driven compute model, scale to zero, autoscaling, routing and managing traffic
● Pipeline — Configure and run CI/CD style pipelines for your kubernetes application.
.
So Knative uses K8S CRDs to define a new set of APIs
!
Build
Serving
Pipeline
Eventing
56. AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
We need a multi framework ML- DL cloud native platform !
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
e.g. Tensorflow is awesome ,
but has static graphs so
PyTorch’s dynamic graphs are
becoming more popular.
How do we handle multiple
Open Source Deep Learning
frameworks in a consistent way.
and leverage
the power of cloud for
distributed computing?
58. Fabric for Deep Learning
https://github.com/IBM/FfDL
FfDL Github Page
https://github.com/IBM/FfDL
FfDL dwOpen Page
https://developer.ibm.com/code/open/projects/
fabric-for-deep-learning-ffdl/
FfDL Announcement Blog
http://developer.ibm.com/code/2018/03/20/
fabric-for-deep-learning
FfDL Technical Architecture Blog
http://developer.ibm.com/code/2018/03/20/
democratize-ai-with-fabric-for-deep-learning
Deep Learning as a Service within Watson Studio
https://www.ibm.com/cloud/deep-learning
Research paper: “Scalable Multi-Framework
Management of Deep Learning Training Jobs”
http://learningsys.org/nips17/assets/papers/
paper_29.pdf
• Fabric for Deep Learning or FfDL (pronounced as ‘fiddle’
aims at making Deep Learning easily accessible to Data
Scientists, and AI developers.
• FfDL Provides a consistent way to train and visualize Deep
Learning jobs across multiple frameworks like TensorFlow,
Caffe, PyTorch, Keras etc.
FfDL
58
Community Partners
FfDL is one of InfoWorld’s 2018 Best of Open
Source Software Award winners for machine
learning and deep learning!
59. Fabric for Deep Learning
https://github.com/IBM/FfDL
FfDL is built using Microservices architecture
on Kubernetes
• FfDL platform uses a microservices architecture to offer
resilience, scalability, multi-tenancy, and security without
modifying the deep learning frameworks, and with no or
minimal changes to model code.
• FfDL control plane microservices are deployed as pods on
Kubernetes to manage this cluster of GPU- and CPU-
enabled machines effectively
• Tested Platforms: Kube DIND, IBM Cloud Public, IBM Cloud
Private, GPUs using both Kubernetes feature gate
Accelerators and NVidia device plugins
59
62. OBJECT
STORAGE
Model
Definition
Training
Data
Trained
Models
REST
API
Parameter
Server
Lifecycle
Manager
Job Monitor
Training
Data
Mongo
DB
Trainer
Service
Launch
Job
Status
Job Info
"
Prometheus
Push Gateway
Alert Manager
Web UI
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Log Collector
Training Job
MOUNT
OBJECT
STORAGE
BUCKET
Elastic
Searc
h
FfDL: Architecture - Current Release!
Open MPI
CLIs
Browser
64. OBJECT
STORAGE
Model
Definition
Training
Data
Trained
Models
REST
API
Parameter
Server
Job Monitor
Mongo
DB
Trainer
Service
Launch
Job
Job Info
"
Prometheus
Push Gateway
Alert Manager
Web UI
Learner (e.g. TensorFlow, Caffe,
PyTorch, Keras etc.)
Controller
Learner Pod
Log Collector
Training Job
MOUNT
OBJECT
STORAGE
BUCKET
Elastic
Searc
h
Open MPI
CLIs
Browser
Scheduling
Engine
Kube-
Batch
QueueJob
Advanced Batch Scheduling with Kube-Batch !Advanced Batch Scheduling with Kube-Batch !
65. !
!
!
!
!
!
!
!
!
!
!
!
Kube-Batch: Filling gaps for FfDL and AI Workloads!
• Job Queuing for enabling job priority and preemption
• Holistic scheduling
• Support for job dynamic prioritization/budgeting
• User requested time based scheduling
• Preemption/resumption in a supported checkpoint/
restart env
• Topology aware placement (data intensive workloads
requiring network topology aware placement)
• Partial preemption of elastic jobs
• Performance estimation
• Kube-Batch
– Kubernetes incubator project
– Introduces batch scheduling in Kubernetes
– Support for simple job definition and lifecycle
management
– Support for job queuing
– Introduces queue-based quota management
68. Is it fair?
Is it easy to
understand?
Did anyone
tamper with it? Is it accountable?
#21, #32, #93
#21, #32, #93
What does it take to trust a decision made by a machine?!
(Other than that it is 99% accurate)?!
71. IBM Adversarial Robustness
Toolbox
ART
ART is a library dedicated to adversarial
machine learning. Its purpose is to allow rapid
crafting and analysis of attack and defense
methods for machine learning models. The
Adversarial Robustness Toolbox provides an
implementation for many state-of-the-art
methods for attacking and defending
classifiers.
71
https://github.com/IBM/adversarial-robustness-
toolbox
The Adversarial Robustness Toolbox contains
implementations of the following attacks:
Deep Fool (Moosavi-Dezfooli et al., 2015)
Fast Gradient Method (Goodfellow et al., 2014)
Jacobian Saliency Map (Papernot et al., 2016)
Universal Perturbation (Moosavi-Dezfooli et al., 2016)
Virtual Adversarial Method (Moosavi-Dezfooli et al.,
2015)
C&W Attack (Carlini and Wagner, 2016)
NewtonFool (Jang et al., 2017)
The following defense methods are also supported:
Feature squeezing (Xu et al., 2017)
Spatial smoothing (Xu et al., 2017)
Label smoothing (Warde-Farley and Goodfellow, 2016)
Adversarial training (Szegedy et al., 2013)
Virtual adversarial training (Miyato et al., 2017)
74. AI Fairness 360
https://github.com/IBM/AIF360
AIF360AIF360 toolkit is an open-source library to
help detect and remove bias in machine
learning models.
The AI Fairness 360 Python package includes
a comprehensive set of metrics for datasets
and models to test for biases, explanations for
these metrics, and algorithms to mitigate bias
in datasets and models.
Toolbox
Fairness metrics (30+)
Fairness metric explanations
Bias mitigation algorithms (9+)
74
Supported bias mitigation algorithms
Optimized Preprocessing (Calmon et al., 2017)
Disparate Impact Remover (Feldman et al., 2015)
Equalized Odds Postprocessing (Hardt et al., 2016)
Reweighing (Kamiran and Calders, 2012)
Reject Option Classification (Kamiran et al., 2012)
Prejudice Remover Regularizer (Kamishima et al., 2012)
Calibrated Equalized Odds Postprocessing (Pleiss et al.,
2017)
Learning Fair Representations (Zemel et al., 2013)
Adversarial Debiasing (Zhang et al., 2018)
Supported fairness metrics
Comprehensive set of group fairness metrics derived
from selection rates and error rates
Comprehensive set of sample distortion metrics
Generalized Entropy Index (Speicher et al., 2018)
75. (d’Alessandro et al., 2017)
AIF 360 detects for fairness in building and deploying models throughout AI
Lifecycle!
78. AIOps
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Model is trained, tested and validated. Just deploy as microservices on
Kubernetes? Do we need anything else?!
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
q As I develop multiple versions
of my models, how can I
easily dark launch and shift
traffic?
q How can I add and enforce
policies on my model
microservices?
q The network among microservices
is not reliable.
q How can I monitor and trace my
model microservices?
q How can I ensure the
communication among
microservices is secure?
83. 1. Deploy a proxy (Envoy) beside your application (“sidecar deployment”)
Env
oy
A
Envoy
Env
oy
B
Envoy
call
How does it work?
!
84. 2. Deploy Pilot to configure the sidecars
Envoy
Pilot
config
Env
oy
A
Envoy
Env
oy
B
Envoy
How does it work?
!
85. 3. Deploy Telemetry to get telemetry
Envoy
Pilot
Env
oy
A
Envoy
Env
oy
B
Envoy
Envoy
Telemetry
telemetry
How does it work?
!
86. 4. Deploy Citadel to assign identities and enable secure communication
Envoy
Pilot
Env
oy
A
Envoy
Env
oy
B
Envoy
Envoy
Telemetry
Envoy
Citadel
How does it work?
!
87. 5. Deploy Policy to enforce policies
Envoy
Pilot
policy decisions
Envoy
A
Envoy Envoy
B
Envoy
Envoy
Policy
Envoy
Telemetry
Envoy
Citadel
How does it work?
!
93. AIOps
Model is deployed. We need to ensure triggers and actions can be defined based
on any vulnerability detection, dataset changes, bias detection etc… !
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
94. AIOps
Enter Apache OpenWhisk!
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
99. AIOps
And last but not the least, what do we use for Pipeline orchestration?!
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
and
Analyzed
Data
Initial Model
Deployed
Model
100. 100
• Currently in IBM we use IBM DevOps toolchain pipeline, built on top of Jenkins
• On open source, we are investigating, Pachyderm, Argo and Airflow.
• Pachyderm and Argo are Dataflow and Workflow orchestrators on Kubernetes
• Both have strong merits, and provide a DevOps Operator view around pipeline execution
times, logs, metrics, exit handlers, notification systems etc.
• As discussed, Knative is now evolving its own Pipelining capabilities, though in early stage
AI Pipeline Orchestration!
103. Jupyter Notebooks: Support for FfDL
User Experience
103
• User select where to run the experiment
• Job is packaged and submitted on behalf of
user
• User has access to Job Console to monitor
experiment
106. Demo Model: Gender Classification!
Data
UTKFace
Simple convolutional neural network (CNN)
with 3 convolutional layers
and 2 fully connected layers
Male
Female
SoftMax
Confident Score
107. Demo Flow!
Fabric for Deep Learning
Data
Robustness
Check
Model
Deployment
ADVERSARIAL ROBUSTNESS
TOOLBOX
Fairness
Check
AI Fairness 360
Model
Revision 1
10%
90%
Model
Revision 2
Gender Classification Model
Source to Container:
Knative Build
ML
Code
Training
108. To fully automate the workflow, Knative Pipeline can be used with
Knative Eventing!
Fabric for Deep Learning
Data
Robustness
Check
Model
Deployment
ADVERSARIAL ROBUSTNESS
TOOLBOX
Fairness
Check
AI Fairness 360
Model
Revision 1
10%
90%
Model
Revision 2
Gender Classification Model
Source to Container:
Knative Build
ML
Code
Training
Eventing
112. • Knative brings serveless actions, events, workflows, apps, containers, service-mesh in one single
stack
• Knative eliminates the need for stitching various individual components together (events, actions,
service mesh, pipeline etc.)
• Knative Build and Serving are simple to use and integrate with our existing containers.
• Traffic Routing and Canary testing on Knative can be done without any Istio knowledge.
• Built-in monitoring tools are very helpful to visualize the benefits of building services using Knative.
• Knative Eventing shows promise, needs to evolve more for production
• Knative Build-Pipeline is still in very early stage, and changing rapidly.
• Knative Build-Pipeline could get very complicated with many tasks and it needs a better debugging
tool/GUI to monitor errors and visualize the pipeline flow.
What we learnt!