2. Tikal Knowledge
TIKAL INTRO
WHO WE ARE ?
▸ Tikal helps ISV’s in Israel & abroad in their technological
challenges.
▸ Our Engineers are Fullstack Developers with expertise in
Android, DevOps, Java, JS, Python, ML
▸ We are passionate about technology and specialise in
OpenSource technologies.
▸ Our Tech and Group leaders help establish & enhance
existing software teams with innovative & creative
thinking.
https://www.meetup.com/full-stack-developer-il/
3. FullStack Developers Israel
SELF INTRODUCTION
▸ My open thinking and open techniques
ideology is driven by Open Source
technologies and the collaborative manner
defining my M.O.
▸ My solution driven approach is strongly
based on hands-on and deep understanding
of Operating Systems, Applications stacks
and Software languages, Networking, Cloud
in general and today more an more Cloud
Native solutions.
▸ Technologies:
▸ Linux { just pick a flavour …}
▸ *Scripting
▸ Git
▸ Python/Go
▸ Cloud { public/private/hybrid }
▸ Docker
▸ Kubernetes
HAGGAI PHILIP ZAGURY - DEVOPS ARCHITECT AND GROUP TECH LEAD
5. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
WE NEED “CI/CD” FOR OUR MODEL TRAINING …
▸ What he didn’t say is …
▸ In-browser training
▸ Backed training
▸ Tensorflow training
▸ Tensorflow serving
▸ Storage [ for raw data & model ] …
7. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
A RELATIVELY SIMPLE USE CASE …
TENSOR-FLOW
TRAINING
Server
SERVER
CLIENT
- SERVE FRONTEND APP
- COLLECT IMAGES
- TRAIN
-INFER
Upload Images
Serve
Model
Get trained
Model
Enrich
Model
with new data
Upload
Images
Serve
Protobuf
Object store
1
2
3
4
5
6
8. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
A CLASSIC APP
SERVER
CLIENT
- SERVE FRONTEND APP
- COLLECT IMAGES
- TRAIN
-INFER
Upload Images
Serve
Model
Get trained
Model
Upload
Images
Object store
1
2 5
6
9. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
MODEL TRAINING …
‣ If your using a pre-trained model - it’s no different
than using a backend / an api endpoint !
‣ Training processes are complex and require
Infrastructure As A Service & On demand
‣ Scalability
‣ faster Time to Market vs. faster results
‣ Scaling costs …
12. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
CONTINUOS INTEGRATION
‣ A Jenkins pipeline
‣ Build - get sample data /
updated data
‣ Deploy model to cpu/gpu
‣ Train and record results
‣ Promote upload new
model for “space invaders”
micro service backend
13. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
THE GAME IS JUST A MEANS TO AN END …
TENSOR-FLOW TRAINING
TENSOR-FLOW
TRAINING
# epochs lr more flags
1 flags = tf.app.flags
2 flags.DEFINE_float("lr", 0.0001, "Learning Rate")
3 flags.DEFINE_string("units", "((50, 0.2), (40, 0.1))", "Configuration of hidden un
4 "Expected: tuple of tuple pairs. Each pair represent one hidde
5 "For instance: "((100, 0.2), (50, 0.3))" will create dense h
6 "dropout layer with rate of 0.2. Afterwards, it will create de
7 "dropout layer with rate of 0.3. If you wish to have hidden la
8 "second value. Example: "((100,), (50, 0.3))"")
9 flags.DEFINE_integer("epochs", 10, "Number of epochs")
10 flags.DEFINE_float("batch_frac", 0.3, "The fraction of training examples to consid
11 "For instance, 0.1 will divide the training to 10 batches")
12 flags.DEFINE_boolean("draw_plot", False, "Whether to draw a plot at the end")
13 flags.DEFINE_boolean("export_js", False, "Whether to export to a tenorflow.js mode
14 FLAGS = flags.FLAGS
TENSOR-FLOW TRAINING
# epochs lr more flags
‣ We need to train our
model
With different parameters
to
Reach the Optimal model
parameters …
14. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
SACALING / MULTIPLEXING … TENSORFLOW SUPPORTS MULTI-PART / DISTRIBUTED FLOWS
‣ Running the same model with
different parameters in order to
choose the most efficient vs most
accurate vs cost affective pipeline !
‣ most efficient #of epochs /
params
https://www.tensorflow.org/performance/datasets_performance
15. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
A/B TESTING / CANARY RELEASES ?!
MODEL VER 1.0
MODEL VER 1.7
MODEL VER 2.0
Storage Provider
60%
30%
10%
Collect In-Browser
training
16. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
TRANSLATION …
▸ A flexible training model
▸ Parametarized flow
▸ Model Testing
▸ Promotion mechanism
▸ Data Import and preprocessing
▸ Post Processing
21. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
OPTIONS - GCP ML/DL
▸ Assume you develop in the
cloud / on the cloud
▸ Consume C/G/Tpu’s
constantly
▸ Adjust your workflow to
Google Patterns (which isn’t
a bad thing …)
22. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
OPTIONS - GCP ML/DL
▸ TPC lock-in ?
▸ Wouldn’t it be nice to
benchmark TPU & GPU on
another provider ?!
26. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
THERES A PATTERN HERE …
IDE
Model Serving
Model Storage
Parameter injectionParameterized training
Training Orchestrator
1
2
3
4
5
6
28. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
DO I CARE ABOUT VENDOR LOCK-IN ?! - LET’S TALK MULTI-CLOUD
my laptop
cloud
I need CPU / GPU / TPU
Adjust / Wrap our code to
suit the Vendor
TENSOR-FLOW
TRAINING
TENSOR-FLOW
TRAINING
TENSOR-FLOW
TRAINING
29. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
IT’S NOT ONLY A MATTER OF VENDOR LOCK-IN! - IT’S MULTI-CLOUD
Only in Google ATM
CPU GPU TPU
my laptop
cloud
I need CPU / GPU / TPU
33. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
ML/DL AS A SERVICE - ON YOUR INFRASTRUCTURE
‣ Package model
‣ Package configuration
34. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
PRE PACKAGE MODELS FOR TRAINING / SERVING
‣ Apply to Kubernetes via
ksonnet
35. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
MODEL TRAINING
DevEnv
Push Tensorflow
container to registry
Create
tfjob
https://www.slideshare.net/barbarafusinska/hassle-free-scalable-machine-learning-learning-with-kubeflow
https://codelabs.developers.google.com/codelabs/kubeflow-introduction/index.html?index=..%2F..%2Fio2018#2
Store
Results
36. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
MODEL SERVING
DevEnv
Consume / Use model In local development Or in the Cloud
Deploy app to K8s
Use
Results
Push Application
container to registry
Use & Improve model
37. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
MODEL TRAINING & SERVING
DevEnv
Consume / Use model In local development Or in the Cloud
Deploy app to K8s
Use
Results
Push Application
container to registry
Use & Improve modelPush Tensorflow
container to registry
1
2 3
4
Train model in Kubeflow
Store
Results
5
6
5
38. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
A/B TESTING
DevEnv
Consume / Use model In local development Or in the Cloud
Deploy app to K8s
Use
Results
Push Application
container to registry
Use & Improve model
Push Tensorflow
container to registry
1
2 3
4
Train model in Kubeflow
Store
Results
5
6
5
Use Ambassador for
A/B testing 7
39. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
A ONE STOP SHOP FOR EVERYTHING …
On Prem /
Cloud
“PaaS" on K8s
▸ Job
▸ Cron Job
▸ POD
▸ Replica sets (multi-step /
distributed)
40. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
TFJOB CRD - CUSTOM RESOURCE DEFINITION
hagzag@model-tarining 👉 kubectl get tfjob
NAME AGE
wcm 1d
41. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
OUR IMAGE IN KUBEFLOW …
…
11 clusterName: “minikube"
12 creationTimestamp: 2018-06-23T07:31:54Z
13 generation: 1
14 labels:
15 app.kubernetes.io/deploy-manager: ksonnet
16 name: wcm
17 namespace: wcm
18 resourceVersion: "94971"
19 selfLink: /apis/kubeflow.org/v1alpha1/namespaces/wcm/tfjobs/wcm
20 uid: 80ab9472-76b7-11e8-be6d-0800279cc216
21 spec:
22 RuntimeId: werb
23 replicaSpecs:
24 - replicas: 3
25 template:
26 metadata:
27 creationTimestamp: null
28 spec:
29 containers:
30 - image: tikal/webcam-controller-model:latest
31 name: tensorflow
32 resources: {}
33 restartPolicy: OnFailure
34 tfPort: 2222
35 tfReplicaType: WORKER
36 - replicas: 2
37 template:
‣ Next step is to wrap our model
with some Operator / TF data
so kubeflow can display it …
42. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
USE S3 AND TERNSORBAORD …
‣ Reuse training results
and display in your
common tensor-flow
tooling.
43. FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
WANT MORE
‣ Demo model -> https://github.com/tikalk/
webcam-controller-model
‣ Kubeflow - the main “engine” kubeflow.io
‣ It also supports other tools …
https://github.com/dwhitena/
kubeflow_pachyderm
‣ https://github.com/SeldonIO/seldon-core