8257 interfacing 2 in microprocessor for btech students
ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing platforms» Lviv DevOps Conference 2019
1. 1Privileged and confidential 1
Data Science at Scale
Privileged and confidential
October 2019
Next generation data processing platforms
Solution Architect
yravlinko@griddynamics.com
4. 4Privileged and confidential
Hidden Tech Debt of ML/DS System
Configuration Data collection
Feature extraction
Data verification
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box
in the middle. The required surrounding infrastructure is vast and complex.
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
5. 5Privileged and confidential
Data Science
Configuration Data collection
Feature extraction
Data verification
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
6. 6Privileged and confidential
+Data Engineering
Configuration Data collection
Feature extraction
Data verification
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
9. 9Privileged and confidential
Machine Learning and Data Processing Workflow
Data ingestion
Feature
engineering
Model selection
validation
Serving
production
Prototyping
training
Data Science/ML platform
Developers point of view
10. 10Privileged and confidential
Revisited Machine Learning and Data Processing Workflow
Data ingestion Data processing
Insight serving
production
Something important
Data Science/ML platform
Ops engineer point of view
Scheduler Workflow management
ML magic
12. 12Privileged and confidential
Decision tree
Are your services
relying on HDFS
as persistent
storage?
Are your tasks
mostly ETL like?
ETL > Apps
Do you need
mostly run and
deploy apps?
ETL < Apps
NO NO
YESYESYES
14. 14Privileged and confidential
DS/ML Platform blueprint components
UI and exposed
API/Contracts
Integrations with third party
service providers
Platform/Engine to setup,
manage and execute business
logic
Data Science and ML code
15. 15Privileged and confidential
DS/ML Platform blueprint components
Application runtimes and serving MLP UI/API Sandbox
‘Big’ data processing
toolset
Data Science and
Machine Learning
toolset
Release management
Data ingestion system
Resource management system
Encryption, secret
management
Infrastructure (VM, Network, Disk, GPU)
Scheduler and workflow
management
User management
Monitoring/log
management
17. 17Privileged and confidential
MVP on GCP
MongoDB + REST facade kubectl, k8s UI GCP DataLab
BigQuery, Cloud ML
Engine
Python Code Argo
GCP Kubernetes Cluster
GCP VM, Cloud Storage, Persistance Disk
Argo CLI , Argo UI
G-Suit + K8s RBAC
GCP Stacktrace,
K8s logs
Apache Beam,
Google DataFlow
Google Pub/Sub,
Custom connectors
GCP BigQuery,
Google Cloud
Storage
18. 18Privileged and confidential
Allocation
Ingest (Data Platform) ML Processing (Training) Serving
ML Platform
Big Query Tables
Data Bucket
Cloud datalab
Custom framework
Cloud Machine
Learning
Container registry
Custom application
ArgoKubernetes Persistent DiskCloud Pub/Sub
19. 19Privileged and confidential
Integration with Data Platform
ML Processing (Training) Serving
ML Platform
Cloud
datalab
Custom
framework
Cloud Machine
Learning
Container
registry
Custom
application
ArgoKubernetes
Persistent
Disk
ML Platform
Data
Platform API
Data
Processing
Cloud
Dataflow
GCS Data
Bucket
GCS
preprocessing
bucket
Cloud
Pub/Sub
Ingest (internal)
Data Sources
(external)
Adobe
Experian
Facebook
Interflora
SAS
Calyx
BG Tables
Objects
Big Query
tables
21. 21Privileged and confidential
Use case
Data sources
SQL
#NoSQL
Other
On-premise services
HDFS
HDFS API
(Google
storage)
Google
Persistant
disk
Google
storage
HBase API,
BigTable
ALS-API
Workflow/Scheduler
k8sGCP services
GET
GET
GET
GET
ETL Training Serving Validation
Argo
Produce GET/Produce GET Produce Deploy Post
Copy Copy GET
1
1
2
3 5
9
876
4
22. 22Privileged and confidential
MVP on GCP and on-premise Datacenter
Scala REST facade kubectl, k8s UI JupyterHub
ML Flow Python Code Argo
GCP Kubernetes Cluster
GCP VM, Cloud Storage, Persistance Disk
Web UI
(Custom App)
G-Suit + K8s RBAC,
ADFS 2.0
GCP Stacktrace,
K8s logs, ELK,
Prometheus
Apache Spark
Google Pub/Sub,
Custom connectors
BigTable, Redis
On-premise
Hadoop Cluster
23. 23Privileged and confidential
Allocation
Ingest (Data Platform) ML Processing (Training) Serving
ML Platform
Big Query
Tables(Feature
Store)
Data Bucket
Container
registry
Custom
application
ArgoKubernetes
Persistent
Disk
Cloud Pub/Sub
On-premise
HDFS cluster
DWH
Kafka cluster
MLFlow
Custom ML
code (Python)
Spark on k8s
Custom ML
workflow UI
JupyterHub
25. 25Privileged and confidential
Demo: Recommendation System
Data sources
SQL
#NoSQL
Other
On-premise services
HDFS
HDFS API
(Google
storage)
Google
Persistant
disk
Google
storage
HBase API,
BigTable
ALS-API
Workflow/Scheduler
k8sGCP services
GET
GET
GET
GET
ETL Training Serving Validation
Argo
Produce GET/Produce GET Produce Deploy Post
Copy Copy GET
1
1
2
3 5
9
876
4
26. 26Privileged and confidential
Some numbers
・ Reduced time of development at 90%
・ More efficient usage of resources (VMs, Disk, Network)
ー Reduced resources usage up to 70% using k8s autoscaling and ephemeral object
・ Increase release time of new model (from month to hours)
・ Reduce time of “ETL-Model Training-Serving” workflow from 24 hours to 3 hours
27. 27Privileged and confidential
Some conclusions
・ We see some pivoting from Hadoop only solutions to more general purposes solutions
as Kubernetes (kubeflow), GCP ML, Amazon ML
・ Back to SQL as main interface to work with DS/ML platforms
・ ML/DS solution still between “genesis” and “product” stage of evolution
・ It is fun but sometimes too much ;)
29. 29Privileged and confidential
Founded in 2006, Grid Dynamics is an engineering services company
built on the premise that cloud computing is disruptive within the
enterprise technology landscape