SlideShare ist ein Scribd-Unternehmen logo
1 von 2
Downloaden Sie, um offline zu lesen
HORIZONTALLY SCALABLE ML PIPELINES WITH A FEATURE STORE
Alexandru A. Ormenis
, an 1
Mahmoud Ismail 1
Kim Hammar 2
Robin Andersson 2
Ermias Gebremeskel 2
Theofilos Kakantousis 2
Antonios Kouzoupis 2
Fabio Buso 2
Gautier Berthou 2 3
Jim Dowling 1 2
Seif Haridi 1 3
ABSTRACT
Machine Learning (ML) pipelines are the fundamental building block for productionizing ML models. However,
much introductory material for machine learning and deep learning emphasizes ad-hoc feature engineering and
training pipelines to experiment with ML models. Such pipelines have a tendency to become complex over time
and do not allow features to be easily re-used across different pipelines. Duplicating features can even lead to
correctness problems when features have different implementations for training and serving. In this demo, we
introduce the Feature Store as a new data layer in horizontally scalable machine learning pipelines.
1 INTRODUCTION
In this demonstration, we introduce an open-source data
platform with support for machine learning (ML) pipelines,
Hopsworks, where every stage in the pipeline can be scaled
out to potentially hundreds of servers, enabling faster train-
ing of larger models with more data. An end-to-end machine
learning pipeline covers all stages in the production and op-
eration of ML models, from data ingestion and preparation,
to experimentation, to training, to deployment and operation.
Our demonstration will have the added twist of introducing
a new data management layer, the Feature Store. We will
show how the Feature Store simplifies such pipelines, pro-
viding an API for data engineers to produce features and
an API with which data scientists can easily select features
when designing new models. Our Feature Store is based on
open-source frameworks, see Figure 3, using Apache Spark
to compute features, Apache Hive to store feature data, and
MySQL Cluster (NDB) for feature metadata. Our Feature
Store also supports generating training data, from selected
features, in native file formats for TensorFlow, Keras, and
PyTorch. Our Feature Store is integrated in a multi-tenant
security model that enables the governance and reuse of
features as first-class entities within an organization. We
also provide Python API support for both the Feature Store
and for building pipelines in our framework. APIs ease the
programmer’s burden when discovering and using services
such as Kafka, the Feature Store, our REST API, and model
serving, as well as when performing distributed model train-
ing and hyperparameter optimization.
1
KTH Royal Institute of Technology, Sweden 2
Logical Clocks
AB, Sweden 3
RISE AB, Sweden. Correspondence to: Alexandru
A. Ormenis
, an <aaor@kth.se>.
Proceedings of the 2nd
SysML Conference, Palo Alto, CA, USA,
2019. Copyright 2019 by the author(s).
Figure 1. End-to-End ML Pipeline, orchestrated by Airflow
2 END-TO-END ML PIPELINES
In ML pipelines on Hopsworks, see Figure 1, horizontal
scalability is provided using Apache Spark for feature engi-
neering, Apache Hive for storing feature data, HopsFS (Ni-
azi et al., 2017) (a next generation HDFS) as the distributed
filesystem for training data, Ring-AllReduce using Tensor-
Flow for distributed training, and Kubernetes to serve mod-
els. We also support Kafka for storing logs for ML models
served from Kubernetes, and Spark streaming applications
for processing those logs in near-realtime. Distributed train-
ing and hyperparameter optimization with TensorFlow and
PyTorch use PySpark to distribute computations, replicated
conda environments to ensure Python libraries are available
at all hosts, and YARN to allocate GPUs, CPUs, and mem-
ory to applications. Our pipelines are driven and managed
by a scalable, consistent, distributed shared memory service
built on MySQL Cluster.
2.1 Demonstration
In our live demonstration, we will use a managed version of
Hopsworks running from a web browser at www.hops.site.
We will start with an Airflow job orchestrating an entire ML
pipeline that consists of a series of different jobs, all im-
plemented in Jupyter notebooks and invoked from Airflow
by an operator that calls the REST API to Hopsworks, see
Horizontally Scalable ML Pipelines with a Feature Store
Figure 2. Hopsworks Data and ML Platform Architecture
Figure 3. Feature Store Architecture
Figure 2. We will then go through the Jupyter notebooks in-
dividually. We will start with discovering and downloading a
ML dataset to work with - Hopsworks supports peer-to-peer
sharing of large datasets. We will then write/run a Spark
job to engineer features from the dataset. We also support
Beam/Flink as jobs, so we can show an additional data vali-
dation job using TensorFlow Extended (TFX) (Baylor et al.,
2017). After the feature engineering stage, the features
will be published to our Feature Store. The next stage is
where a Data Scientist will write a notebook to select the
features that will be used to train a model. This job will out-
put training data in a chosen file format (such as .tfrecords,
.csv, .parquet, .npy, .hdf5) to HopsFS. The next job in the
pipeline is the experimentation or training step. Here, the
data scientist designs (or evolves) a model architecture and
discovers good hyperparameters by running lots of parallel
experiments on GPUs. Similar to MLflow (Zaharia et al.,
2018), Hopsworks manages and archives ML experiments
for easy reproduction and archiving. The next stage is to
train a model with the chosen hyperparameters using lots of
GPUs and RingAll-Reduce for distributed training. The out-
put of the training step is a model, saved to HopsFS. Now,
the model can be analyzed and validated (such as using
TFX model analysis), and then deployed for serving using
TensorFlow Serving (with many options available, such as
A/B testing and number of replicated instances behind a
load-balancer). Finally, we will show client applications
using the deployed model, and a streaming job will be run
to monitor model predictions, notify of any anomalies, and
collect more training data (by storing predictions that are
later linked with outcomes).
2.2 Teaching Platform and Interactive Demo
Our research organization runs a managed version of
Hopsworks with over 1000 users. This managed platform,
see Figure 2, is also used for lab and project work in Deep
Learning and Big Data courses at a large technical university.
Hopsworks has a User Interface and strong security support,
built around TLS certificates for applications, users, and ser-
vices. Similar to Jupyter Hub, students can write interactive
programs in Jupyter notebooks. In contrast to JupyterHub,
users can run jobs that access up to 100s of TBs of data on
1000s of cores, and use 10s of GPUs for model experimen-
tation or training. Hopsworks also includes a project-based
multi-tenancy environment, so students can work in groups
on sandboxed data. Our managed platform provides many
popular datasets for machine learning, including Imagenet,
Wikipedia, Youtube-8M, and Kinetics.
In the interactive demo, participants will be invited to use
their laptops and an Internet connection to create accounts
on our managed platform. With an account, a user can
create a project, link in a large machine learning dataset, and
run pre-canned notebooks or end-to-end machine learning
pipelines. Participants will be given enough quota to use
100s of CPUs, and several GPUs, if they so choose.
2.3 Summary
We will demonstrate a data and ML platform, Hopsworks,
that supports the design, debugging, and running of deep
learning pipelines at scale. All stages of the pipeline can
be scaled out horizontally, as the platform manages the
whole stack from resource management (YARN with GPU
support) to API support for running distributed training and
hyperparameter optimization experiments.
ACKNOWLEDGMENT
This work was funded by the Swedish Foundation for Strate-
gic Research projects “Smart Intra-body network, RIT15-
0119” and “Continuous Deep Analytics, BD15-0006”, and
by the EU Horizon 2020 project AEGIS under Grant Agree-
ment no. 732189.
REFERENCES
Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y.,
Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al.
Tfx: A tensorflow-based production-scale machine learn-
ing platform. In Proceedings of the 23rd ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining, pp. 1387–1395. ACM, 2017.
Niazi, S., Ismail, M., Haridi, S., Dowling, J., Grohsschmiedt,
S., and Ronström, M. Hopsfs: Scaling hierarchical file
system metadata using newsql databases. In FAST, pp.
89–104, 2017.
Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong,
S. A., Konwinski, A., Murching, S., Nykodym, T.,
Ogilvie, P., Parkhe, M., et al. Accelerating the machine
learning lifecycle with mlflow. Data Engineering, pp. 39,
2018.

Weitere ähnliche Inhalte

Was ist angesagt?

Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastDatabricks
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...Yahoo Developer Network
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Databricks
 
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
Using Apache Spark for Predicting Degrading and Failing Parts in AviationUsing Apache Spark for Predicting Degrading and Failing Parts in Aviation
Using Apache Spark for Predicting Degrading and Failing Parts in AviationDatabricks
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Databricks
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentDatabricks
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceDatabricks
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development KitJen Aman
 
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...Sri Ambati
 
Hopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AIHopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AIQAware GmbH
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriDatabricks
 
Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark Herman Wu
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Databricks
 

Was ist angesagt? (20)

Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
 
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
Using Apache Spark for Predicting Degrading and Failing Parts in AviationUsing Apache Spark for Predicting Degrading and Failing Parts in Aviation
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...
 
Hopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AIHopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AI
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer Nori
 
Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
 

Ähnlich wie Horizontally Scalable ML Pipelines

Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...Orgad Kimchi
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Tushar Katarki
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine LearningVasu S
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...Abhinav Joshi
 
MLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to productionMLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to productionFabian Hadiji
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edgeConference Papers
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...OpenStack
 
Building ML Pipelines with DCOS
Building ML Pipelines with DCOSBuilding ML Pipelines with DCOS
Building ML Pipelines with DCOSQAware GmbH
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with DatabricksLiangjun Jiang
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricksLiangjun Jiang
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsGianmario Spacagna
 
1645 goldenberg using our laptop
1645 goldenberg using our laptop1645 goldenberg using our laptop
1645 goldenberg using our laptopRising Media, Inc.
 
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.Theofilos Kakantousis
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopExtremeEarth
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Luciano Resende
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
 

Ähnlich wie Horizontally Scalable ML Pipelines (20)

Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
 
MLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to productionMLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to production
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
Building ML Pipelines with DCOS
Building ML Pipelines with DCOSBuilding ML Pipelines with DCOS
Building ML Pipelines with DCOS
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with Databricks
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricks
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
1645 goldenberg using our laptop
1645 goldenberg using our laptop1645 goldenberg using our laptop
1645 goldenberg using our laptop
 
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open Workshop
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
 

Kürzlich hochgeladen

VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 

Kürzlich hochgeladen (20)

VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 

Horizontally Scalable ML Pipelines

  • 1. HORIZONTALLY SCALABLE ML PIPELINES WITH A FEATURE STORE Alexandru A. Ormenis , an 1 Mahmoud Ismail 1 Kim Hammar 2 Robin Andersson 2 Ermias Gebremeskel 2 Theofilos Kakantousis 2 Antonios Kouzoupis 2 Fabio Buso 2 Gautier Berthou 2 3 Jim Dowling 1 2 Seif Haridi 1 3 ABSTRACT Machine Learning (ML) pipelines are the fundamental building block for productionizing ML models. However, much introductory material for machine learning and deep learning emphasizes ad-hoc feature engineering and training pipelines to experiment with ML models. Such pipelines have a tendency to become complex over time and do not allow features to be easily re-used across different pipelines. Duplicating features can even lead to correctness problems when features have different implementations for training and serving. In this demo, we introduce the Feature Store as a new data layer in horizontally scalable machine learning pipelines. 1 INTRODUCTION In this demonstration, we introduce an open-source data platform with support for machine learning (ML) pipelines, Hopsworks, where every stage in the pipeline can be scaled out to potentially hundreds of servers, enabling faster train- ing of larger models with more data. An end-to-end machine learning pipeline covers all stages in the production and op- eration of ML models, from data ingestion and preparation, to experimentation, to training, to deployment and operation. Our demonstration will have the added twist of introducing a new data management layer, the Feature Store. We will show how the Feature Store simplifies such pipelines, pro- viding an API for data engineers to produce features and an API with which data scientists can easily select features when designing new models. Our Feature Store is based on open-source frameworks, see Figure 3, using Apache Spark to compute features, Apache Hive to store feature data, and MySQL Cluster (NDB) for feature metadata. Our Feature Store also supports generating training data, from selected features, in native file formats for TensorFlow, Keras, and PyTorch. Our Feature Store is integrated in a multi-tenant security model that enables the governance and reuse of features as first-class entities within an organization. We also provide Python API support for both the Feature Store and for building pipelines in our framework. APIs ease the programmer’s burden when discovering and using services such as Kafka, the Feature Store, our REST API, and model serving, as well as when performing distributed model train- ing and hyperparameter optimization. 1 KTH Royal Institute of Technology, Sweden 2 Logical Clocks AB, Sweden 3 RISE AB, Sweden. Correspondence to: Alexandru A. Ormenis , an <aaor@kth.se>. Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA, 2019. Copyright 2019 by the author(s). Figure 1. End-to-End ML Pipeline, orchestrated by Airflow 2 END-TO-END ML PIPELINES In ML pipelines on Hopsworks, see Figure 1, horizontal scalability is provided using Apache Spark for feature engi- neering, Apache Hive for storing feature data, HopsFS (Ni- azi et al., 2017) (a next generation HDFS) as the distributed filesystem for training data, Ring-AllReduce using Tensor- Flow for distributed training, and Kubernetes to serve mod- els. We also support Kafka for storing logs for ML models served from Kubernetes, and Spark streaming applications for processing those logs in near-realtime. Distributed train- ing and hyperparameter optimization with TensorFlow and PyTorch use PySpark to distribute computations, replicated conda environments to ensure Python libraries are available at all hosts, and YARN to allocate GPUs, CPUs, and mem- ory to applications. Our pipelines are driven and managed by a scalable, consistent, distributed shared memory service built on MySQL Cluster. 2.1 Demonstration In our live demonstration, we will use a managed version of Hopsworks running from a web browser at www.hops.site. We will start with an Airflow job orchestrating an entire ML pipeline that consists of a series of different jobs, all im- plemented in Jupyter notebooks and invoked from Airflow by an operator that calls the REST API to Hopsworks, see
  • 2. Horizontally Scalable ML Pipelines with a Feature Store Figure 2. Hopsworks Data and ML Platform Architecture Figure 3. Feature Store Architecture Figure 2. We will then go through the Jupyter notebooks in- dividually. We will start with discovering and downloading a ML dataset to work with - Hopsworks supports peer-to-peer sharing of large datasets. We will then write/run a Spark job to engineer features from the dataset. We also support Beam/Flink as jobs, so we can show an additional data vali- dation job using TensorFlow Extended (TFX) (Baylor et al., 2017). After the feature engineering stage, the features will be published to our Feature Store. The next stage is where a Data Scientist will write a notebook to select the features that will be used to train a model. This job will out- put training data in a chosen file format (such as .tfrecords, .csv, .parquet, .npy, .hdf5) to HopsFS. The next job in the pipeline is the experimentation or training step. Here, the data scientist designs (or evolves) a model architecture and discovers good hyperparameters by running lots of parallel experiments on GPUs. Similar to MLflow (Zaharia et al., 2018), Hopsworks manages and archives ML experiments for easy reproduction and archiving. The next stage is to train a model with the chosen hyperparameters using lots of GPUs and RingAll-Reduce for distributed training. The out- put of the training step is a model, saved to HopsFS. Now, the model can be analyzed and validated (such as using TFX model analysis), and then deployed for serving using TensorFlow Serving (with many options available, such as A/B testing and number of replicated instances behind a load-balancer). Finally, we will show client applications using the deployed model, and a streaming job will be run to monitor model predictions, notify of any anomalies, and collect more training data (by storing predictions that are later linked with outcomes). 2.2 Teaching Platform and Interactive Demo Our research organization runs a managed version of Hopsworks with over 1000 users. This managed platform, see Figure 2, is also used for lab and project work in Deep Learning and Big Data courses at a large technical university. Hopsworks has a User Interface and strong security support, built around TLS certificates for applications, users, and ser- vices. Similar to Jupyter Hub, students can write interactive programs in Jupyter notebooks. In contrast to JupyterHub, users can run jobs that access up to 100s of TBs of data on 1000s of cores, and use 10s of GPUs for model experimen- tation or training. Hopsworks also includes a project-based multi-tenancy environment, so students can work in groups on sandboxed data. Our managed platform provides many popular datasets for machine learning, including Imagenet, Wikipedia, Youtube-8M, and Kinetics. In the interactive demo, participants will be invited to use their laptops and an Internet connection to create accounts on our managed platform. With an account, a user can create a project, link in a large machine learning dataset, and run pre-canned notebooks or end-to-end machine learning pipelines. Participants will be given enough quota to use 100s of CPUs, and several GPUs, if they so choose. 2.3 Summary We will demonstrate a data and ML platform, Hopsworks, that supports the design, debugging, and running of deep learning pipelines at scale. All stages of the pipeline can be scaled out horizontally, as the platform manages the whole stack from resource management (YARN with GPU support) to API support for running distributed training and hyperparameter optimization experiments. ACKNOWLEDGMENT This work was funded by the Swedish Foundation for Strate- gic Research projects “Smart Intra-body network, RIT15- 0119” and “Continuous Deep Analytics, BD15-0006”, and by the EU Horizon 2020 project AEGIS under Grant Agree- ment no. 732189. REFERENCES Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al. Tfx: A tensorflow-based production-scale machine learn- ing platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1387–1395. ACM, 2017. Niazi, S., Ismail, M., Haridi, S., Dowling, J., Grohsschmiedt, S., and Ronström, M. Hopsfs: Scaling hierarchical file system metadata using newsql databases. In FAST, pp. 89–104, 2017. Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong, S. A., Konwinski, A., Murching, S., Nykodym, T., Ogilvie, P., Parkhe, M., et al. Accelerating the machine learning lifecycle with mlflow. Data Engineering, pp. 39, 2018.