We will walk through the exploration, training and serving of a machine learning model by leveraging Kubeflow's main components. We will use Jupyter notebooks on the cluster to train the model and then introduce Kubeflow Pipelines to chain all the steps together, to automate the entire process.
3. Make it easy for everyone to develop, deploy,
and manage portable, scalable ML everywhere
4. Why Kubeflow?
● Composability
○ Choose from existing popular tools
● Portability
○ Build using cloud native, portable Kubernetes APIs
● Scalability
○ TF already supports CPU/GPU/distributed
○ K8s scales to 5k nodes with same stack
5. What’s in the Box?
● Jupyter Hub - for collaborative & interactive training
● A TensorFlow Training Controller
● A TensorFlow Serving Deployment
● Argo for workflows
● Much more
8. Kubeflow is composable
Training
• Perform distributed training with TF-Jobs
• Run pipelines with regular containers as steps.
• Run pipelines with TF-Jobs and other CRDs as steps.
Serving
• KF-Serving, Seldon Core
• Azure ML Service and other frameworks.
10. TF-Job: Distributed Training
A distributed TensorFlow job typically contains 0 or more of the following
processes:
• Chief: The chief is responsible for orchestrating training and performing
tasks like checkpointing the model.
• PS: The ps are parameter servers; these servers provide a distributed
data store for the model parameters.
• Worker: The workers do the actual work of training the model. In some
cases, worker 0 might also act as the chief.
• Evaluator: The evaluators can be used to compute
evaluation metrics as the model is trained.
12. Kubeflow Pipelines
• A user interface (UI) for managing and tracking experiments, jobs, and
runs.
• An engine for scheduling multi-step ML workflows.
• An SDK for defining and manipulating pipelines and components.
• Notebooks for interacting with the system using the SDK.
13. Anatomy of a pipeline
• Containerized implementations of ML Tasks
• Pre-built components: Just provide params or code snippets. Create
your own components from code or libraries
• Use any runtime, framework, data types
• Attach k8s objects - volumes, secrets
• Specification of the sequence of steps
• Specified via Python DSL
• Inferred from data dependencies on input/output
• Input Parameters
• A “Run” = Pipeline invoked w/ specific parameters
• Schedules
• Invoke a single run or create a recurring scheduled pipeline