2. Disclaimer
I’m not a Machine Learning expert.
I work on infrastructure and distributed systems for a
living.
3. Kubernetes a year ago...
● Was used primarily for stateless workloads
● Needed an understanding of several core concepts to operate
● Applications had to be written to fit into core controller abstractions
4. Kubernetes today...
● Has abstractions to support Stateful applications and now data
processing and machine learning.
● Has a wide range of extension points including ones that allow API
extensions and custom controllers.
● Has support for building higher level abstractions and APIs to hide
infrastructure & operational complexity.
5. What’s changed?
● Workload controller abstractions moving to GA/stable.
● Custom Resource Definitions & Aggregated API Servers
● Kubernetes Operators
● Community support for external frameworks
● Work on scheduling and resource management (ongoing)
9. Kubeflow
https://github.com/google/kubeflow/
Our goal is not to recreate other services, but to provide a straightforward
way for spinning up best of breed OSS solutions.
● A JupyterHub to create & manage interactive Jupyter notebooks
● A Tensorflow Training Controller that can be configured to use CPUs
or GPUs, and adjusted to the size of a cluster with a single setting
● A TF Serving container
10. JupyterHub
● A single hub & proxy for managing interactive sessions
● Can run entirely within Kubernetes - notebooks are backed by
Kubernetes pods
● Can request required resources - CPUs, GPUs, etc
● Has pluggable authentication (oauth, kdc, etc)
Made possible by: https://github.com/jupyterhub/kubespawner
11. Tensorflow Training Controller
● A Kubernetes “operator” to help run distributed/non-distributed TF
training.
● Exposes an API through a CustomResourceDefinition
● Controller manages complexity of distributed training using
Tensorflow.
Made possible by: https://github.com/tensorflow/k8s
12. Tensorflow Serving
● A Kubernetes Deployment that can serve saved models
● Deployment - replicas can be scaled.
Future work:
● Custom metrics & Autoscaling
13. But there were so many stages!
● Clearly there are many other challenges faced by people building
Machine Learning infrastructure.
● How do I preprocess data?
● How do I describe my pipeline?
● How do I orchestrate my pipeline?
● We have some ideas.
14. Apache Spark
● Spark on Kubernetes is an ongoing effort since Dec 2016.
● It is being upstreamed into Spark and expected to land in Spark 2.3
(due sometime in January).
● The changes make Spark itself aware of a new Kubernetes Scheduler
that can directly run Spark applications for the user.
16. Apache Spark
Kubernetes Scheduler for Spark
● Spark 2.3 will support
○ Running Java/Scala jobs
○ Static allocation of executors
○ Some dependency management
● Our fork (github.com/apache-spark-on-k8s/spark) has several
additional features which we’re slowly upstreaming.
○ It’s being run by several organizations right now.
17. Apache Airflow
● A DAG scheduler.
● Has a rich ecosystem of “operators” to allow interacting with different
applications.
● Community working on a Kubernetes native executor for Airflow.
● Currently in the process of being upstreamed.
18. Apache Airflow
BashOperator(
task_id = ‘account-test’,
bash_command = ‘run-something.sh’,
dag = dag,
executor_config = {
‘request_memory’: ‘128Mi’,
‘limit_memory’: ‘128Mi’
‘image’: ‘airflow/scipy:1.1.5’
}
)
The operators can specify various Kubernetes executor constraints within each DAG step.
For example:
19. Putting it all together
HDFS
or GCS/S3
Spark
Airflow Pipeline
JupyterHub
Tensorflow
Other ML
Frameworks
20. Get Involved
Kubeflow
● Slack Channel (See https://github.com/google/kubeflow for joining instructions)
● Twitter (http://twitter.com/kubeflow)
● Mailing List (https://groups.google.com/forum/#!forum/kubeflow-discuss)
SIG Big Data
● Slack Channel (https://kubernetes.slack.com/messages/sig-big-data)
● Mailing list (https://groups.google.com/forum/#!forum/kubernetes-sig-big-data)
● Weekly meeting (https://github.com/kubernetes/community/tree/master/sig-big-data)