When it comes to Large Scale data processing and Machine Learning, Apache Spark is no doubt one of the top battle-tested frameworks out there for handling batched or streaming workloads. The ease of use, built-in Machine Learning modules, and multi-language support makes it a very attractive choice for data wonks. However bootstrapping and getting off the ground could be difficult for most teams without leveraging a Spark cluster that is already pre-provisioned and provided as a managed service in the Cloud, while this is a very attractive choice to get going, in the long run, it could be a very expensive option if it’s not well managed.
As an alternative to this approach, our team has been exploring and working a lot with running Spark and all our Machine Learning workloads and pipelines as containerized Docker packages on Kubernetes. This provides an infrastructure-agnostic abstraction layer for us, and as a result, it improves our operational efficiency and reduces our overall compute cost. Most importantly, we can easily target our Spark workload deployment to run on any major Cloud or On-prem infrastructure (with Kubernetes as the common denominator) by just modifying a few configurations.
In this talk, we will walk you through the process our team follows to make it easy for us to run a production deployment of our Machine Learning workloads and pipelines on Kubernetes which seamlessly allows us to port our implementation from a local Kubernetes set up on the laptop during development to either an On-prem or Cloud Kubernetes environment
2. About MavenCode
MavenCode is an Artificial Intelligence Solutions company located in Dallas, Texas - We do
training, product development, and consulting services in the following areas:
● Provisioning Scalable Data Processing Pipelines on Cloud Infrastructure
● Development & Deployment of Machine Learning and Artificial Intelligence Platforms
● Streaming and Big Data Analytics Edge-IoT and Sensors
3. About The Presenters
Charles Adetiloye is an ML Platforms Engineer
at MavenCode. He has well over 15 years of
experience building large-scale, distributed
applications. He has extensive experience
working and consulting with several companies
implementing production grade ML and AI
platforms
twitter.com/cadetiloye
Abiodun Akogun is a Machine Learning and Data
Science Consultant at Mavencode. He has extensive
experience building and deploying large-scale Machine
Learning Applications in different industries that
include Healthcare, Finance, Telecommunications, and
Insurance. He has experience solving several business
problems using Data Analytics, Sentiment Analysis,
Topic Modelling, Named Entity Recognition(N.E.R),
Opinion Mining, Data Mining, Time Series, Spatial
Statistics and Marketing Analytics
twitter.com/akogz
4. Agenda
▪ Overview of Machine Learning Model Deployment
Workflow
▪ Various Approaches to model training,
management, and serving in the Cloud
▪ Deploying Machine Learning Workloads in the
Cloud
▪ Implementing Feature Storage backend for ML
model training
▪ Running Spark Workloads for ML training on
Kubernetes with Kubeflow
5. Overview of Machine Learning Deployment Workflow
Data
Sourcing
Pre
Processing
Feature
Engineering
Model
Training /
Evaluation
Model Scoring
/Management
Model
Inferencing
6. Machine Learning Workload Deployment
Data
Sourcing
Pre
Processing
Feature
Engineering
Model
Training /
Evaluation
Model Scoring
/Management
Model
Inferencing
Google Cloud AWS Azure On Prem
8. Overview of Machine Learning Deployment Workflow
Data
Sourcing
Pre
Processing
Feature
Engineering
Model
Training /
Evaluation
Model Scoring
/Management
Model
Inferencing
32%
10%
36%
2% 4%
16%
9. A Typical Machine Learning Developer Workflow
Data
Sourcing
Pre
Processing
Feature
Engineering
Model
Training /
Evaluation
Model
Scoring
/Management
Model
Inferencing
Azure
Storage
Google Storage
AWS S3
Storage
Raw Data Transformation Processed Data
Storage Compute
1 2
Google Cloud AI AWS Sage Maker Azure ML
Data Scientist / ML Engineers
works on pulling or processing
data first before starting ML
training on a Managed Cloud
Service
Raw Data Processing and
Transformation Pipeline
Cloud Training Platforms
10. What Enterprise Machine Learning Workflow In the
Cloud Looks Like!
Data
Sourcing
Pre
Processing
Feature
Engineering
Azure
Storage
Google Storage
AWS S3
Storage
Raw Data Transformation Processed Data
Storage Compute
1 2
Team A
Team B
Team C
Team D
Google Cloud AI
AWS SageMaker
AWS SageMaker
Azure ML
Running ML workflow across
the enterprise with multiple
teams using different Cloud
Provider technology stacks
12. If we plan to be Cloud Neutral, can we abstract our
● Machine Learning Compute Workload→Kubernetes?
● Machine Storage → Feature Store?
13. Google Cloud AI AWS Sage Maker Azure ML
A Typical Machine Learning Developer Workflow
Data
Sourcing
Pre
Processing
Feature
Engineering
Model
Training /
Evaluation
Model Scoring
/Management
Model
Inferencing
Azure
Storage
Google Storage
AWS S3
Storage
Data Source Transformation Processed Data
Storage Compute
1 2
14. Towards A Cloud Neutral ML Deployment Environment
Data Sourcing Pre Processing
Feature
Engineering
Model Training /
Evaluation
Model Scoring
/Management
Model
Inferencing
Storage Compute
1 2
Feature Store
Kubernetes
15. Why the need for
Cloud Agnostic
Deployment
Infrastructure?
16. ● Makes it easier to migrate workloads in a Hybrid Cloud Environment
● We are not tied to particular Cloud Infrastructure technology stack
● It’s easier to Implement best practice patterns and solutions
● Your team will have a common base denominator for all Enterprise ML workload
● Easy to control cost, manage utilization and forecast demand
17. Cloud Agnostic Machine Learning Development
Data Sourcing Pre Processing
Feature
Engineering
Model Training /
Evaluation
Model Scoring
/Management
Model
Inferencing
Storage Compute
1 2
Feature Store
Kubernetes
Azure Storage
Google Storage
AWS S3 Storage
18. What’s Feature Store All about?
A Feature is a measurable observable attribute that is part of the input to a
Machine Learning Model.
Model Training
X1
X2
X3
Xn
[Feature Vector]
Model
19. What’s Feature Store All about?
Model Training
X1
X2
X3
Xn
[Feature Vector]
Model
Model 1
Features are derived from
● Raw Datastore
● Streaming Datasource
● Aggregates of Raw Inputs
● Windows (mins, hourly, daily, weekly)
20. Features Change Over time!
Model Training
X1
X2
X3
Xn
X1
X2
X3
Xn
X1
X2
X3
Xn
Time
21. Machine Learning Feature Store
● Makes it easy to operationalize our ML workload, most importantly Data
Management and Storage for Model training
● Features can be shared easily amon teams running different Model
training pipelines
● We can get to version of datasets and track changes easily
● Consistency in Feature input attributes between Model Training and
Serving
22. ● Offline Feature Store → Batching Training
● Online Feature Store → Inferencing / Serving
Types Of Feature Store
23. Implementing Offline Feature Storage with Apache Hudi
Azure
Storage
Google Storage
AWS S3
Storage
Streaming Source
Batch Job Operations
Datasource with
Streaming sources like
MQTT, Kafka, Pubsub
etc
Batch Operations on
Databases, FileStorage,
Distributed Storage etc
Feature Store
Workflow Scheduling
Orchestration with
Kubeflow Pipelines or
Airflow Dags on
Kubernetes
Feature Store
Implementation on any
of the Major Cloud
Storage
24. ● A need for a Unified Platform where new data can be made available in addition to historical
data within minutes.
● The need for a quick computation (or derivation ) of Feature vectors in other to make them
available for our model input.
● Incremental Versioning of our Feature collections so that we can time-travel and use a
particular set of features for Model training.
● Our Hudi dataset can be stored in Azure, Google Cloud, AWS cloud storage layer.
● Easy to implement all our code and everything we need to do with Spark and PySpark
Why did we use Apache Hudi?
25. Getting Data into Hudi Feature Store with Kubeflow Pipeline
import kfp
from kfp import components
KafkaDatastreamer_op =
kfp.components.create_component_from_func(KafkaDatastreamer,base_image="python:3.7.1”)
ValidatorOnSchema_op =
kfp.components.create_component_from_func(ValidatorOnSchema,base_image="python:3.7.1")
PreProcessor_op =
kfp.components.create_component_from_func(PreProcessor,base_image="python:3.7.1")
HudiTableWriter_op= kfp.components.create_component_from_func(HudiTableWriter,
base_image="mavencode.io/spark:v3.1.1")
26. The Hudi Data Store writer
Configure the Spark Session
with the packages needed to
run hudi and avro
Hudi configuration Options
Writing the data into our
Hudi data store in the right
format
27. Cloud Agnostic Machine Learning Development
Data Sourcing Pre Processing
Feature
Engineering
Model Training /
Evaluation
Model Scoring
/Management
Model
Inferencing
Storage Compute
1 2
Feature Store
Kubernetes
Cloud Native ML Workload
Deployment with Operators
on Kubeflow
Cloud Native ML Training Deployment
● Containerized Workload
● Scalable + Can Run in Distributed Mode
● Efficient Compute Utilization
● Language Agnostic!
28. Machine Learning Operators with Kubeflow on
Kubernetes
● An Machine Learning Operator helps the deployment
monitoring and management a model training life-
cycle
● Some ML Operators found in Kubeflow are:
○ TF-operator → Tensorflow Job
○ Pytorch-operator → Pytorch Job
○ Xgboost-operator → Xgboost Job
○ Spark-operator → Spark and Spark ML Jobs
29. Cloud Agnostic Machine Learning Development
MLOps Model Training and Deployment Platform
Kubeflow Jupyter NoteBook Kubeflow Jupyter NoteBook Kubeflow Jupyter NoteBook Kubeflow Jupyter NoteBook
Namespace Namespace Namespace Namespace
Auto-Scalable CPU Node Pool Auto-Scalable GPU Node Pool
Spark Operator Spark Operator
TensorFlow Operator Tensorflow Operator
Cloud Infrastructure Layer
Running
Auto Scaling Node Pools
Running Kubernetes
Machine Learning
Operators running with
Kubeflow
Feature Store
30. Using Spark Operator for Training ML Steps
PySpark
ML Code
Containerize
the Python
Code
Create
SparkApplication
Kubernetes YAML
Deployment
Apply
Deployment to
Kubernetes
31. Spark Operator on Kubernetes
API
Scheduler
OR OR OR
Spark Driver
Executors
33. Deployment Configuration YAML
Spark Application Config
that describes the job and
the namespace where the
job will run
Container that will run our
Spark ML Code
Spark Drive and Executor
Configuration
35. Cost comparison with Managed Cloud service on AWS
30%
100%
15s
66s
Compute Utilization Cost Compute Startup Uptime Team Agility & Productivity
6x Productivity
Managed Services Running on AWS
Kubeflow + S3 Feast Storage ML workload
36. Summary
● Implementing a Cloud neutral ML deployment approach
simplifies most of the complexities in a Multi-Cloud
environment
● After the initial hump, learning curve and the overall
team efficiency improves significantly
● Teams is not locked in to a particular Cloud
Infrastructure stack
● Easy to control cost and forecast future capacity
demands
38. Thank You!
If you are interested in learning more about how to run your
Machine Learning Workloads on any Cloud Infrastructure or
Onprem reach out to us
Drop us a mail hello@mavencode.com
Visit Us Online
https://www.mavencode.com
Follow Us
https://www.twitter.com/mavencode