SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
1Privileged and confidential 1
Data Science at Scale
Privileged and confidential
October 2019
Next generation data processing platforms
Solution Architect
yravlinko@griddynamics.com
2Privileged and confidential
About me
●
●
●
●
●
solution architect
Grid Dynamics, Lviv, Ukraine
Yaroslav Ravlinko
3Privileged and confidential
Problem Definition
4Privileged and confidential
Hidden Tech Debt of ML/DS System
Configuration Data collection
Feature extraction
Data verification
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box
in the middle. The required surrounding infrastructure is vast and complex.
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
5Privileged and confidential
Data Science
Configuration Data collection
Feature extraction
Data verification
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
6Privileged and confidential
+Data Engineering
Configuration Data collection
Feature extraction
Data verification
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
7Privileged and confidential
Ops
Configuration
Feature extraction
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
Data collection
Data verification
8Privileged and confidential
Development and
Release Process
9Privileged and confidential
Machine Learning and Data Processing Workflow
Data ingestion
Feature
engineering
Model selection
validation
Serving
production
Prototyping
training
Data Science/ML platform
Developers point of view
10Privileged and confidential
Revisited Machine Learning and Data Processing Workflow
Data ingestion Data processing
Insight serving
production
Something important
Data Science/ML platform
Ops engineer point of view
Scheduler Workflow management
ML magic
11Privileged and confidential
Some solutions
12Privileged and confidential
Decision tree
Are your services
relying on HDFS
as persistent
storage?
Are your tasks
mostly ETL like?
ETL > Apps
Do you need
mostly run and
deploy apps?
ETL < Apps
NO NO
YESYESYES
13Privileged and confidential
Blueprint
14Privileged and confidential
DS/ML Platform blueprint components
UI and exposed
API/Contracts
Integrations with third party
service providers
Platform/Engine to setup,
manage and execute business
logic
Data Science and ML code
15Privileged and confidential
DS/ML Platform blueprint components
Application runtimes and serving MLP UI/API Sandbox
‘Big’ data processing
toolset
Data Science and
Machine Learning
toolset
Release management
Data ingestion system
Resource management system
Encryption, secret
management
Infrastructure (VM, Network, Disk, GPU)
Scheduler and workflow
management
User management
Monitoring/log
management
16Privileged and confidential
Blueprint: ML Platform on GCP
17Privileged and confidential
MVP on GCP
MongoDB + REST facade kubectl, k8s UI GCP DataLab
BigQuery, Cloud ML
Engine
Python Code Argo
GCP Kubernetes Cluster
GCP VM, Cloud Storage, Persistance Disk
Argo CLI , Argo UI
G-Suit + K8s RBAC
GCP Stacktrace,
K8s logs
Apache Beam,
Google DataFlow
Google Pub/Sub,
Custom connectors
GCP BigQuery,
Google Cloud
Storage
18Privileged and confidential
Allocation
Ingest (Data Platform) ML Processing (Training) Serving
ML Platform
Big Query Tables
Data Bucket
Cloud datalab
Custom framework
Cloud Machine
Learning
Container registry
Custom application
ArgoKubernetes Persistent DiskCloud Pub/Sub
19Privileged and confidential
Integration with Data Platform
ML Processing (Training) Serving
ML Platform
Cloud
datalab
Custom
framework
Cloud Machine
Learning
Container
registry
Custom
application
ArgoKubernetes
Persistent
Disk
ML Platform
Data
Platform API
Data
Processing
Cloud
Dataflow
GCS Data
Bucket
GCS
preprocessing
bucket
Cloud
Pub/Sub
Ingest (internal)
Data Sources
(external)
Adobe
Experian
Facebook
Interflora
SAS
Calyx
BG Tables
Objects
Big Query
tables
20Privileged and confidential
Blueprint: ML Platform
on Hybrid Cloud
21Privileged and confidential
Use case
Data sources
SQL
#NoSQL
Other
On-premise services
HDFS
HDFS API
(Google
storage)
Google
Persistant
disk
Google
storage
HBase API,
BigTable
ALS-API
Workflow/Scheduler
k8sGCP services
GET
GET
GET
GET
ETL Training Serving Validation
Argo
Produce GET/Produce GET Produce Deploy Post
Copy Copy GET
1
1
2
3 5
9
876
4
22Privileged and confidential
MVP on GCP and on-premise Datacenter
Scala REST facade kubectl, k8s UI JupyterHub
ML Flow Python Code Argo
GCP Kubernetes Cluster
GCP VM, Cloud Storage, Persistance Disk
Web UI
(Custom App)
G-Suit + K8s RBAC,
ADFS 2.0
GCP Stacktrace,
K8s logs, ELK,
Prometheus
Apache Spark
Google Pub/Sub,
Custom connectors
BigTable, Redis
On-premise
Hadoop Cluster
23Privileged and confidential
Allocation
Ingest (Data Platform) ML Processing (Training) Serving
ML Platform
Big Query
Tables(Feature
Store)
Data Bucket
Container
registry
Custom
application
ArgoKubernetes
Persistent
Disk
Cloud Pub/Sub
On-premise
HDFS cluster
DWH
Kafka cluster
MLFlow
Custom ML
code (Python)
Spark on k8s
Custom ML
workflow UI
JupyterHub
24Privileged and confidential
Demo
25Privileged and confidential
Demo: Recommendation System
Data sources
SQL
#NoSQL
Other
On-premise services
HDFS
HDFS API
(Google
storage)
Google
Persistant
disk
Google
storage
HBase API,
BigTable
ALS-API
Workflow/Scheduler
k8sGCP services
GET
GET
GET
GET
ETL Training Serving Validation
Argo
Produce GET/Produce GET Produce Deploy Post
Copy Copy GET
1
1
2
3 5
9
876
4
26Privileged and confidential
Some numbers
・ Reduced time of development at 90%
・ More efficient usage of resources (VMs, Disk, Network)
ー Reduced resources usage up to 70% using k8s autoscaling and ephemeral object
・ Increase release time of new model (from month to hours)
・ Reduce time of “ETL-Model Training-Serving” workflow from 24 hours to 3 hours
27Privileged and confidential
Some conclusions
・ We see some pivoting from Hadoop only solutions to more general purposes solutions
as Kubernetes (kubeflow), GCP ML, Amazon ML
・ Back to SQL as main interface to work with DS/ML platforms
・ ML/DS solution still between “genesis” and “product” stage of evolution
・ It is fun but sometimes too much ;)
28Privileged and confidential
Q & A
29Privileged and confidential
Founded in 2006, Grid Dynamics is an engineering services company
built on the premise that cloud computing is disruptive within the
enterprise technology landscape
30Privileged and confidential 30
Thank you!
www.griddynamics.com

Weitere ähnliche Inhalte

Was ist angesagt?

How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
Databricks
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Databricks
 

Was ist angesagt? (20)

Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
Natalie Godec - AirFlow and GCP: tomorrow's health service data platformNatalie Godec - AirFlow and GCP: tomorrow's health service data platform
Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
 
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning Platform
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
Using Apache Spark to Predict Installer Retention from Messy Clickstream Data...
Using Apache Spark to Predict Installer Retention from Messy Clickstream Data...Using Apache Spark to Predict Installer Retention from Messy Clickstream Data...
Using Apache Spark to Predict Installer Retention from Messy Clickstream Data...
 
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
 
Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...
 
Portable Scalable Data Visualization Techniques for Apache Spark and Python N...
Portable Scalable Data Visualization Techniques for Apache Spark and Python N...Portable Scalable Data Visualization Techniques for Apache Spark and Python N...
Portable Scalable Data Visualization Techniques for Apache Spark and Python N...
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Implementing BigPetStore with Apache Flink
Implementing BigPetStore with Apache FlinkImplementing BigPetStore with Apache Flink
Implementing BigPetStore with Apache Flink
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Democratizing Data
Democratizing DataDemocratizing Data
Democratizing Data
 
Conference on Nagios: Reinhard Scheck on Cacti
Conference on Nagios: Reinhard Scheck on CactiConference on Nagios: Reinhard Scheck on Cacti
Conference on Nagios: Reinhard Scheck on Cacti
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al Essa
 

Ähnlich wie ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing platforms» Lviv DevOps Conference 2019

Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 

Ähnlich wie ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing platforms» Lviv DevOps Conference 2019 (20)

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
 
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
 
Big Data and ML on Google Cloud
Big Data and ML on Google CloudBig Data and ML on Google Cloud
Big Data and ML on Google Cloud
 
Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google
Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, GoogleHybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google
Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google
 
Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google
Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, GoogleHybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google
Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google
 
20181127 オラクル講演資料(DataRobot AI Experience Tokyo)
20181127 オラクル講演資料(DataRobot AI Experience Tokyo)20181127 オラクル講演資料(DataRobot AI Experience Tokyo)
20181127 オラクル講演資料(DataRobot AI Experience Tokyo)
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
 
EPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHUEPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHU
 

Mehr von UA DevOps Conference

Mehr von UA DevOps Conference (10)

ІЛЛЯ ЛУБЕНЕЦЬ «DevSecOps наступний етап розвитку DevOps» GO DevOps
ІЛЛЯ ЛУБЕНЕЦЬ «DevSecOps наступний етап розвитку DevOps»  GO DevOpsІЛЛЯ ЛУБЕНЕЦЬ «DevSecOps наступний етап розвитку DevOps»  GO DevOps
ІЛЛЯ ЛУБЕНЕЦЬ «DevSecOps наступний етап розвитку DevOps» GO DevOps
 
ОЛЕКСАНДР СНІГОВИЙ «Continuous Deployment: Challenges, Solutions, and Lesson...
ОЛЕКСАНДР СНІГОВИЙ «Continuous Deployment: Challenges, Solutions, and Lesson...ОЛЕКСАНДР СНІГОВИЙ «Continuous Deployment: Challenges, Solutions, and Lesson...
ОЛЕКСАНДР СНІГОВИЙ «Continuous Deployment: Challenges, Solutions, and Lesson...
 
АРТЕМ КОБРІН «Achieve Networking at Scale with a Self-Service Network Solutio...
АРТЕМ КОБРІН «Achieve Networking at Scale with a Self-Service Network Solutio...АРТЕМ КОБРІН «Achieve Networking at Scale with a Self-Service Network Solutio...
АРТЕМ КОБРІН «Achieve Networking at Scale with a Self-Service Network Solutio...
 
ОЛЕКСАНДР СИРОТЕНКО «DataKernel: майструючи український фреймворк для highloa...
ОЛЕКСАНДР СИРОТЕНКО «DataKernel: майструючи український фреймворк для highloa...ОЛЕКСАНДР СИРОТЕНКО «DataKernel: майструючи український фреймворк для highloa...
ОЛЕКСАНДР СИРОТЕНКО «DataKernel: майструючи український фреймворк для highloa...
 
ОЛЕКСАНДР ВІЛЬЧИНСЬКИЙ «DevOps culture» Lviv DevOps Conference 2019
ОЛЕКСАНДР ВІЛЬЧИНСЬКИЙ «DevOps culture» Lviv DevOps Conference 2019ОЛЕКСАНДР ВІЛЬЧИНСЬКИЙ «DevOps culture» Lviv DevOps Conference 2019
ОЛЕКСАНДР ВІЛЬЧИНСЬКИЙ «DevOps culture» Lviv DevOps Conference 2019
 
КОСТЯНТИН СЕВЕРЕНЧУК «Monitoring and Automation in DevTestSecOps world» Lviv ...
КОСТЯНТИН СЕВЕРЕНЧУК «Monitoring and Automation in DevTestSecOps world» Lviv ...КОСТЯНТИН СЕВЕРЕНЧУК «Monitoring and Automation in DevTestSecOps world» Lviv ...
КОСТЯНТИН СЕВЕРЕНЧУК «Monitoring and Automation in DevTestSecOps world» Lviv ...
 
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
 
ОЛЕКСАНДР СНІГОВИЙ «Extension of DevOps: Policy as Code» Lviv DevOps Confere...
ОЛЕКСАНДР СНІГОВИЙ «Extension of DevOps: Policy as Code» Lviv DevOps Confere...ОЛЕКСАНДР СНІГОВИЙ «Extension of DevOps: Policy as Code» Lviv DevOps Confere...
ОЛЕКСАНДР СНІГОВИЙ «Extension of DevOps: Policy as Code» Lviv DevOps Confere...
 
СТАНІСЛАВ КОЛЕНКІН «Cilium – Network security for microservices. Let’s see ho...
СТАНІСЛАВ КОЛЕНКІН «Cilium – Network security for microservices. Let’s see ho...СТАНІСЛАВ КОЛЕНКІН «Cilium – Network security for microservices. Let’s see ho...
СТАНІСЛАВ КОЛЕНКІН «Cilium – Network security for microservices. Let’s see ho...
 
ОЛЕГ МАЦЬКІВ «Crash course on Operator Framework» Lviv DevOps Conference 2019
ОЛЕГ МАЦЬКІВ «Crash course on Operator Framework» Lviv DevOps Conference 2019ОЛЕГ МАЦЬКІВ «Crash course on Operator Framework» Lviv DevOps Conference 2019
ОЛЕГ МАЦЬКІВ «Crash course on Operator Framework» Lviv DevOps Conference 2019
 

Kürzlich hochgeladen

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Kürzlich hochgeladen (20)

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 

ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing platforms» Lviv DevOps Conference 2019

  • 1. 1Privileged and confidential 1 Data Science at Scale Privileged and confidential October 2019 Next generation data processing platforms Solution Architect yravlinko@griddynamics.com
  • 2. 2Privileged and confidential About me ● ● ● ● ● solution architect Grid Dynamics, Lviv, Ukraine Yaroslav Ravlinko
  • 4. 4Privileged and confidential Hidden Tech Debt of ML/DS System Configuration Data collection Feature extraction Data verification Machine resource management Process management tools Analysis tools Serving infrastructure Monitoring ML core Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
  • 5. 5Privileged and confidential Data Science Configuration Data collection Feature extraction Data verification Machine resource management Process management tools Analysis tools Serving infrastructure Monitoring ML core
  • 6. 6Privileged and confidential +Data Engineering Configuration Data collection Feature extraction Data verification Machine resource management Process management tools Analysis tools Serving infrastructure Monitoring ML core
  • 7. 7Privileged and confidential Ops Configuration Feature extraction Machine resource management Process management tools Analysis tools Serving infrastructure Monitoring ML core Data collection Data verification
  • 9. 9Privileged and confidential Machine Learning and Data Processing Workflow Data ingestion Feature engineering Model selection validation Serving production Prototyping training Data Science/ML platform Developers point of view
  • 10. 10Privileged and confidential Revisited Machine Learning and Data Processing Workflow Data ingestion Data processing Insight serving production Something important Data Science/ML platform Ops engineer point of view Scheduler Workflow management ML magic
  • 12. 12Privileged and confidential Decision tree Are your services relying on HDFS as persistent storage? Are your tasks mostly ETL like? ETL > Apps Do you need mostly run and deploy apps? ETL < Apps NO NO YESYESYES
  • 14. 14Privileged and confidential DS/ML Platform blueprint components UI and exposed API/Contracts Integrations with third party service providers Platform/Engine to setup, manage and execute business logic Data Science and ML code
  • 15. 15Privileged and confidential DS/ML Platform blueprint components Application runtimes and serving MLP UI/API Sandbox ‘Big’ data processing toolset Data Science and Machine Learning toolset Release management Data ingestion system Resource management system Encryption, secret management Infrastructure (VM, Network, Disk, GPU) Scheduler and workflow management User management Monitoring/log management
  • 17. 17Privileged and confidential MVP on GCP MongoDB + REST facade kubectl, k8s UI GCP DataLab BigQuery, Cloud ML Engine Python Code Argo GCP Kubernetes Cluster GCP VM, Cloud Storage, Persistance Disk Argo CLI , Argo UI G-Suit + K8s RBAC GCP Stacktrace, K8s logs Apache Beam, Google DataFlow Google Pub/Sub, Custom connectors GCP BigQuery, Google Cloud Storage
  • 18. 18Privileged and confidential Allocation Ingest (Data Platform) ML Processing (Training) Serving ML Platform Big Query Tables Data Bucket Cloud datalab Custom framework Cloud Machine Learning Container registry Custom application ArgoKubernetes Persistent DiskCloud Pub/Sub
  • 19. 19Privileged and confidential Integration with Data Platform ML Processing (Training) Serving ML Platform Cloud datalab Custom framework Cloud Machine Learning Container registry Custom application ArgoKubernetes Persistent Disk ML Platform Data Platform API Data Processing Cloud Dataflow GCS Data Bucket GCS preprocessing bucket Cloud Pub/Sub Ingest (internal) Data Sources (external) Adobe Experian Facebook Interflora SAS Calyx BG Tables Objects Big Query tables
  • 20. 20Privileged and confidential Blueprint: ML Platform on Hybrid Cloud
  • 21. 21Privileged and confidential Use case Data sources SQL #NoSQL Other On-premise services HDFS HDFS API (Google storage) Google Persistant disk Google storage HBase API, BigTable ALS-API Workflow/Scheduler k8sGCP services GET GET GET GET ETL Training Serving Validation Argo Produce GET/Produce GET Produce Deploy Post Copy Copy GET 1 1 2 3 5 9 876 4
  • 22. 22Privileged and confidential MVP on GCP and on-premise Datacenter Scala REST facade kubectl, k8s UI JupyterHub ML Flow Python Code Argo GCP Kubernetes Cluster GCP VM, Cloud Storage, Persistance Disk Web UI (Custom App) G-Suit + K8s RBAC, ADFS 2.0 GCP Stacktrace, K8s logs, ELK, Prometheus Apache Spark Google Pub/Sub, Custom connectors BigTable, Redis On-premise Hadoop Cluster
  • 23. 23Privileged and confidential Allocation Ingest (Data Platform) ML Processing (Training) Serving ML Platform Big Query Tables(Feature Store) Data Bucket Container registry Custom application ArgoKubernetes Persistent Disk Cloud Pub/Sub On-premise HDFS cluster DWH Kafka cluster MLFlow Custom ML code (Python) Spark on k8s Custom ML workflow UI JupyterHub
  • 25. 25Privileged and confidential Demo: Recommendation System Data sources SQL #NoSQL Other On-premise services HDFS HDFS API (Google storage) Google Persistant disk Google storage HBase API, BigTable ALS-API Workflow/Scheduler k8sGCP services GET GET GET GET ETL Training Serving Validation Argo Produce GET/Produce GET Produce Deploy Post Copy Copy GET 1 1 2 3 5 9 876 4
  • 26. 26Privileged and confidential Some numbers ・ Reduced time of development at 90% ・ More efficient usage of resources (VMs, Disk, Network) ー Reduced resources usage up to 70% using k8s autoscaling and ephemeral object ・ Increase release time of new model (from month to hours) ・ Reduce time of “ETL-Model Training-Serving” workflow from 24 hours to 3 hours
  • 27. 27Privileged and confidential Some conclusions ・ We see some pivoting from Hadoop only solutions to more general purposes solutions as Kubernetes (kubeflow), GCP ML, Amazon ML ・ Back to SQL as main interface to work with DS/ML platforms ・ ML/DS solution still between “genesis” and “product” stage of evolution ・ It is fun but sometimes too much ;)
  • 29. 29Privileged and confidential Founded in 2006, Grid Dynamics is an engineering services company built on the premise that cloud computing is disruptive within the enterprise technology landscape
  • 30. 30Privileged and confidential 30 Thank you! www.griddynamics.com