This document discusses using ScyllaDB as the data store for machine learning workflow pipelines processing IoT device data on Kubernetes. It describes SmartDeployAI's goal of creating reusable AI/ML pipelines and the challenges of previous approaches using Cassandra. ScyllaDB allows building cloud native ML pipelines that can efficiently run multiple workflows on Kubernetes and store model metadata, hyperparameters, and inference results for real-time analysis of IoT sensor data. Examples of computer vision pipelines for object detection and scene parsing are provided.
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Applications on Kubernetes with Scylla
1. Simplifying the Creation of ML
Workflow Pipelines for IoT Application
on Kubernetes with ScyllaDB
Timo Mechler, Product Manager
Charles Adetiloye, ML Platform Engineer
2. Presenters
Timo Mechler, Product Manager & Architect
Timo Mechler is a Product Manager and Architect at SmartDeployAI. He has close to a
decade of financial data modeling experience working as both an analyst and
strategist in the energy commodities sector. At SmartDeployAI he now works closely
with product development and engineering teams to solve interesting data modeling
challenges.
Charles Adetiloye, ML Platform Engineer
Charles is a Lead ML platforms engineer at SmartDeployAI, He was well over 15 years
of experience building large-scale distributed application. He is always been
interested in building distributed Event-Driven systems that are composable from
independent asynchronous subsystems. He has extensive experience working with
Kubernetes and NoSQL databases like ScyllaDB and Cassandra
3. About SmartDeployAI
At SmartDeployAI, we develop software platform and frameworks for
running and deploying AI and ML workflow.
Our primary focus is on
- Increasing productivity and agile team release cycle
- Increasing collaboration and visibility between team members
- Sharable and re-usable AI and ML workflow pipeline components
9. IoT Devices - Group or Cluster of Devices
Customer 1
Customer 2
Customer 3
Customer 4
10. Generalized IoT Pipeline for AI and ML
Data ingestion
at scale
Data Processing
Pipeline
Data lake and
Data warehouse AI and Analytics
Streaming Data, Ingested
through a secure input
endpoint
Data Processing Pipeline, to
Clean, format raw data ingested
Data Storage, where organized
and cleansed data are stored
Model Training, Deployment,
Analytics and Insights
12. Our Goal!
- Create a Workflow Pipeline that abstract whole process of provisioning IoT pipelines
- Efficient Utilization of Compute Resources
- Support for Multi-tenant deployment of Workflow Pipelines on a Kubernetes Cluster
- Quick instantiation of new Workflow Pipeline from Deployment Config
- Quick Access to ingested dataset for near real-time inference and model retraining
- Store Model metadata from training and Hyperparameters for Model training/re-training
- Super fast Aggregation, Rollup or Grouping of results over a given time-window
13. IoT Stream Ingestion Pipeline - 2014
Ingestion Process Store AnalyzeData
ML Learning
14. IoT Stream Ingestion Pipeline - 2014
Pros
- Scales to Support several devices
- Easy path towards ML deployment
- High write throughput
Cons
- Not easily scalable
- Very expensive setup
- We still had downtime
- Cassandra needed occasional tuning
- Bootstrapping new environment took a while
15. IoT Stream Ingestion Pipeline- 2017
Ingestion Process Store AnalyzeData
ML Learning
Akka Streams
POD POD
POD POD
POD
POD POD POD
POD
POD POD
16. IoT Stream Ingestion Pipeline - 2017
Pros
- Scales to Support several devices
- Easy path towards ML deployment
- High write throughput
- Efficient Compute Resource Utilization
- Easily Scalable
Cons
- Not easily scalable
- We still had downtime with Cassandra
- Bootstrapping our Cassandra Datastore still a pain point
- Entire workflow not easily Cloneable or Reproduced
Compute
Resource
17. IoT Stream Ingestion Pipeline - 2017
Ingestion Process Store AnalyzeData
ML Learning
Akka Streams
POD
POD
POD
POD
POD POD POD
POD POD
POD POD POD
POD POD
POD
18. IoT Stream Ingestion Pipeline - 2017
Pros
- Scales to Support several devices
- Easy path towards ML deployment
- High write throughput
- Efficient Compute Resource Utilization
- Easily Scalable Deployment Pipeline
Cons
- Cassandra JVM is extremely greedy! >= 60% of resources
- Bootstrapping Cassandra pods took over 6000ms
- Entire workflow not easily Cloneable or Reproduced
20. Why did we go with ScyllaDB?
- Drop in replacement for Cassandra
- Low memory footprint - VERY Important on Kubernetes
- More than 8x faster than Cassandra
- Easy to containerize and deploy as Kubernetes POD
- We could easily run it as part of our ML Workflow Pipeline
25. Scene Parsing, Object Detection and Counting
Pipeline
Pipeline Workflow
- Time Lapse camera capturing event stream onsite
- Time stamped keyframes from the video streams are tagged and uploaded as
images to the Cloud
- AI Models are used to perform real-time analytics of Key Objects/Entities on
Image Scene - Workers Onsite, Trucks, Cranes etc
Workflow Output
- Trigger Notification whenever Events of Interest Occurs e.g. Daily Activity Start
time, Equipment was Delivered
- Daily Report Notification generated from AI model emailed or via SMS
26. Scene Parsing, Object Detection and Counting
Pipeline
Event
Payload
Processor
Daily
Analytics
Tagged
Object
Counting
Trigger Event
Notification
ML Training &
Deployment
1
2
3
4
5
6
Pipeline 2 - Model Serving Pipeline
Pipeline 1
Model
Training
ScyllaDB Datastore
- Model MetaData
- Metrics Inference
- Inference Result
Entities Detected
Database Tables &
Materialized Views
- uuid
- entity_person_count
- entity_crane_count
- entity_truck_count
- location
- timestamp
Event Payload
27. Pipeline Workflow
- Ingest image pictures of a view - living room, bed room, kitchen etc
- Use AI model to identify the room type
- Identify the Walls in the room and allow Users to specify the Color
Scheme
Workflow Output
- Modified image output with painted Walls
Scene Parsing, Object Identification and
Contextual Modification
28. Scene Parsing, Object Identification and
Contextual Modification
Image Scene
Detection
Model
Bedroom Model
Living Room Model
Kitchen Model
Post Processing
ScyllaDB Datastore
- Model MetaData
- Metrics Inference
- Inference Result
ML Training &
Deployment
1
2
3
45
32. Hyper-Param and ML Metadata Store
METADATA STORE
component-1 component-2 component-3
param1 => [a1, b1,...n1]
param2 => [a2, b2,...n2]
Param3 => [a3, b3,...n3]
33. device_id group_id pay_load .
Materialized View of Tables to display relevant
Info Event Info
- device_id
- reg_id
- group_id
- cust_id
- model_id
- event_id
- lat
- lng
- pay_load
- checksum
- timestamp
device_id reg_id group_id cust_id …...
TABLE: device_event_tbl
CREATE TABLE indoor_sensor (
device_id uuid,
reg_id uuid,
group_id uuid,
cust_id uuid,
model_id uuid,
event_id uuid,
lat float,
lng float,
Pay_load_size bigint,
checksum, bigint
timestamp TIMESTAMP,
PRIMARY KEY (device_id, timestamp) ) WITH CLUSTERING ORDER
BY (timestamp, DESC)
VIEW: indoor_sensor_group
CREATE MATERIALIZED VIEW indoor_sensor_group AS
SELECT device_id, lat, lng FROM indoor_sensor
WHERE group_id IS NOT NULL
PRIMARY KEY (device_id, group_id)
1
2
3
DASHBOARD
4
34. Thank you Stay in touch
Any questions?
Charles Adetiloye
charles@smartdeploy.ai
@cadetiloye
Timo Mechler
timo@smartdeploy.ai
Connect with us on Slack
http://bit.ly/ai-pipelines
Hinweis der Redaktion
Simplifying the Creation ML Workflow
IoT Devices Generates continuous stream of time-bounded events
IoT Devices Generates continuous stream of time-bounded events
IoT Devices Generates continuous stream of time-bounded events
IoT Devices Generates continuous stream of time-bounded events
IoT Devices Generates continuous stream of time-bounded events
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Ingestion Process, building Time-Series Pipeline with Kafka, Spark, Cassandra
Because of the Pipeline Abstraction we have created, Each workflow Artifacts can run in a Shared Kubernetes Environment in different NS
ScyllaDB Operator is used to instantiate dedicated DB to each pipeline that can be scaled independently
Result - more efficient Utilization of resources, Easy for us to do Capacity Planning
Use ScyllaDB to store Hyper-Param for Model Training
We instantiate the Pipeline and Create Experiments with each Run and there Run Parameters
The META data (and artifacts) for each run is stored in the METADATA store
With this We can do quick AB Testing and do multiple PARALLEL runs
We recieve events from all this devices
Transform the payload into highly denormalized form
Write the denormalized data into ScyllaDB
Using Materialized View can create different subviews of the DATA set that we serialize and use to upload the views on Dashboards