Build and Monitor Machine Learning Services in Kubernetes

Build and Monitor
Machine Learning
Services in Kubernetes
Kirk Kaiser
@burningion

Dedicated hardware
for inference
Availability of large,
labeled datasets
Original gains with
CNN to detect objects
in images
There is (almost) nothing new in
universal turing machines.
Why is machine learning
suddenly a thing?

Build, clean, and label
a dataset
Build a model Deploy and Evaluate
How does it work?
???

How to build a deep learning
model
model
build a deep learning
model

Actual model used for a project:

Build, clean, and label
a dataset
Build a model Deploy and Evaluate
How does it work?
Real world data is
filled with garbage
and messy
Models tend to be
brittle and opaque,
difficult to debug
How do you
measure
‘performance’?

ML Development Lifecycle
is Different from Traditional
Development

Suddenly the machine you’re
running on matters.
(GPUs and TPUs necessary for
training quickly on massive
datasets.)

Driver updates can speed up your
code 300% (?!)

Training cost to reproduce GPT-2
text generation model (via
OpenGPT-2) from scratch ~$50,000
on GCP*
* https://blog.usejournal.com/opengpt-2-we-replicated-
gpt-2-because-you-can-too-45e34e6d36dc
Training Computational
Needs Can Be Staggering
Computational costs for running
AlphaGo Zero estimated to cost
~$35 million*
* https://www.yuzeh.com/data/agz-cost.html

Cell phones are all adding
acceleration to native chips
Bandwidth and latency limitations
on cameras forces inference
computation to edge
Inference Gets Pushed Out to
the Edge

Gather, clean, and label data Once a model is deployed, can
then be used to bootstrap better
data
Datasets Become Integral
Part of Your Code

Web Software
Development
is Changing

Especially in Larger
Environments

And Containers Usually Means
Orchestration

Moving from managing and deploying
to individual servers to pushing code to
a “sea of infinite computation”

– etcd, default database to
manage kubernetes state,
recommends 5 (!) server
instances to durability
– Growing list of subpackages
to control networking layers,
load balancing (Istio, Envoy)
– Completely rethink
development, testing,
deployment as kubernetes
grows beyond single dev
machine
Kubernetes
comes with extra
complexity

etcd
Docker
Kubernetes
Envoy
Minikube
Istio
containerd
Spinnaker

That complexity is trade off in move
from building software as a service
to software as a utility.

More Pieces to Manage and
Deploy

Kubeflow
ML toolkit for
Kubernetes
from Google
TensorRT Inference
Server
Custom Inference server
with Optimizations for
NVIDIA Hardware
Pachyderm
Version control for data,
and data pipelines
Kubernetes Native ML Tools

Kubeflow supports both TensorRT
and PyTorch for Inference

That’s a lot of
moving pieces!

Run Kubernetes
locally, with GPU
acceleration.

NVIDIA Container Registry
https://ngc.nvidia.com

PyTorch & FFMPEG w/ Hardware
Acceleration Dockerfile

Example YAML
Manifest for
Kubernetes

Observability in systems become a
necessity to understand what’s
happening in software with so many
moving pieces.

See complete units of work, as they
pass through your entire system,
especially useful in Pipelines with
multiple steps.
Add tags to be able to see specific
customers, organizations, and their
direct experience with your
systems.
Compress Distributed System
Complexity with Traces

See bottlenecks in CPU, disk space,
memory usage.
For GPU / TPU, see hardware level
metrics
Correlate with logs and traces to
isolate errors to software or
hardware level issues.
See History of Machine State
with Metrics

Add Observability with
DaemonSets

Ingested logs show the history of
work on each individual system
component.
Correlate with traces and metrics to
isolate errors to library, software
dependency level
See Auditable Trail of Side
Effects with Logs

– Either of these advances take
in isolation adds to a
platform
– Taken together, we have a
chance to rethink the way
software behaves. (Images as
code and APIs.)
Object detection
alone opens up
new platforms

Bin-e, a self
sorting recycling
bin

Dab & T-Pose
controlled lights

Jetson Nano
$99 GPU Accelerated
Machine

Datadog is hiring!
Example Kubernetes Project:
https://dtdg.co/ml-kubernetes

Build and Monitor Machine Learning Services in Kubernetes

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Build and Monitor Machine Learning Services in Kubernetes

Ähnlich wie Build and Monitor Machine Learning Services in Kubernetes (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Build and Monitor Machine Learning Services in Kubernetes