CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks

•

0 gefällt mir•459 views

Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance. All these qualities of data & ML projects lead us to the necessity of continuous testing and monitoring of our models and pipelines.

Daten & Analysen

CI/CD Templates: Continuous
Delivery of ML-Enabled Data
Pipelines on Databricks
Michael Shtelma, Sr. Solutions Architect
Ivan Trusov, Solutions Architect

Agenda
The Challenges of implementing
CI/CD for ML pipelines
The CI/CD challenges forcing ML teams to choose
between Databricks notebooks or local IDEs
Introducing DatabricksLabs
CI/CD Templates
How CI/CD Templates solves ML team production
challenges
Demo and Next Steps

Problem:
Organisations are struggling to get Business to start using
their models to drive additional revenue
Cause:
Due to complexity of ML lifecycle only few models end up
in production and drive additional revenue for business.
Most of them are either delayed or discontinued during
different ML Project stages
It is challenging for organizations to
gain value from ML due to complexity of
the ML lifecycle

What challenges do ML teams
face when then try to
implement CD4ML?

ML teams struggle to combine traditional CI/CD
tools with Databricks notebooks
1. Benefits to Databricks notebooks
Easy to use
Scalable
Provides access to ML tools such as mlflow for model logging and serving
2. Challenges
Non-trivial to hook into traditional software development tools such as CI tools or local IDEs.
3. Result
Teams find themselves choosing between
using traditional IDE based workflows but struggling to test and deploy at scale or
using Databricks notebooks or other cloud notebooks but then struggling to ensure
testing and deployment reliability via CI/CD pipelines.

CI/CD Templates gives you the benefits of
traditional CICD workflows and the scale of
databricks clusters
CI/CD Templates allows you to
● create a production pipeline via template in a few steps
● that automatically hooks to github actions and
● runs tests and deployments on databricks upon git commit or
whatever trigger you define and
● gives you a test success status directly in github so you know if your
commit broke the build

A scalable CI/CD pipeline in 5 easy steps
1. Install and customize with a single command
2. Create a new github repo containing your databricks host
and token secrets
3. Initialize git in your repo and commit the code.
4. Push your new cicd templates project to the repo. Your tests will
start running automatically on Databricks. Upon your tests’ success
or failure you will get a green checkmark or red x next to your commit
status.
5. You’re done! You now have a fully scalable CICD pipeline.
1
2
3
4
5

Project structure
1. Python package where the logic of the project will be developed.
Your models and pipelines will be developed here.
2. Configuration where you can configure define Databricks jobs
which can run pipelines developed in python package
3. Tests directory where local unit tests and integration tests will be
developed
1
2
3

CI/CD Templates execute tests and deployments
directly on databricks while storing packages, model
logging and other artifacts in Mlflow

CI/CD Templates - now powered by dbx
With dbx you can:
● customize project structure and specify it during deployments
● use new CI tools easily (PRs are welcome!)
● run custom data pipelines pipelines directly from IDE on interactive clusters

Summary
The Challenges of implementing
CD4ML
The CI/CD challenges forcing ML teams to choose
between Databricks notebooks or local IDEs
Introducing DatabricksLabs
CI/CD Templates
How CI/CD Templates solves ML team production
challenges
Next Steps
Search DatabricksLabs cicd-templates or go
directly to
https://github.com/databrickslabs/cicd-templates
to get started
michael.shtelma@databricks.com
ivan.trusov@databricks.com

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Empfohlen

How to Build a ML Platform Efficiently Using Open-SourceDatabricks

Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks

Introduction to MLflowDatabricks

Apache Spark MLlib Zahra Eskandari

Change Data Feed in DeltaDatabricks

ONNX and MLflowamesar0

MLOps Using MLflowDatabricks

Managing the Machine Learning Lifecycle with MLflowDatabricks

Empfohlen

How to Build a ML Platform Efficiently Using Open-SourceDatabricks

Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks

Introduction to MLflowDatabricks

Apache Spark MLlib Zahra Eskandari

Change Data Feed in DeltaDatabricks

ONNX and MLflowamesar0

MLOps Using MLflowDatabricks

Managing the Machine Learning Lifecycle with MLflowDatabricks

Thoughts on kafka capacity planningJamieAlquiza

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Streaming data for real time analysisAmazon Web Services

Introducing RiakKevin Smith

Optimising Geospatial Queries with Dynamic File PruningDatabricks

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Lucidworks

Machine Learning for AdTech in Action with Cyrille Dubarry and Han JuDatabricks

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Databricks

Data Ingestion, Extraction & Parsing on Hadoopskaluska

Deep Dive into GPU Support in Apache Spark 3.xDatabricks

Pinterest - Big Data Machine Learning Platform at PinterestAlluxio, Inc.

Machine Learning Data Lineage with MLflow and Delta LakeDatabricks

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks

The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner

CQRS & EVS with MongoDbLluis Fernandez

MLOps with Kubeflow Saurabh Kaushik

Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Databricks

Data engineeringSuman Debnath

MLflow: A Platform for Production Machine LearningMatei Zaharia

Etl overview trainingMondy Holten

Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflowDatabricks

Continuous Integration & Continuous DeliveryDatabricks

Weitere ähnliche Inhalte

Was ist angesagt?

Thoughts on kafka capacity planningJamieAlquiza

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Streaming data for real time analysisAmazon Web Services

Introducing RiakKevin Smith

Optimising Geospatial Queries with Dynamic File PruningDatabricks

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Lucidworks

Machine Learning for AdTech in Action with Cyrille Dubarry and Han JuDatabricks

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Databricks

Data Ingestion, Extraction & Parsing on Hadoopskaluska

Deep Dive into GPU Support in Apache Spark 3.xDatabricks

Pinterest - Big Data Machine Learning Platform at PinterestAlluxio, Inc.

Machine Learning Data Lineage with MLflow and Delta LakeDatabricks

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks

The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner

CQRS & EVS with MongoDbLluis Fernandez

MLOps with Kubeflow Saurabh Kaushik

Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Databricks

Data engineeringSuman Debnath

MLflow: A Platform for Production Machine LearningMatei Zaharia

Etl overview trainingMondy Holten

Was ist angesagt? (20)

Thoughts on kafka capacity planning

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Streaming data for real time analysis

Introducing Riak

Optimising Geospatial Queries with Dynamic File Pruning

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...

Machine Learning for AdTech in Action with Cyrille Dubarry and Han Ju

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...

Data Ingestion, Extraction & Parsing on Hadoop

Deep Dive into GPU Support in Apache Spark 3.x

Pinterest - Big Data Machine Learning Platform at Pinterest

Machine Learning Data Lineage with MLflow and Delta Lake

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

The Top 5 Apache Kafka Use Cases and Architectures in 2022

CQRS & EVS with MongoDb

MLOps with Kubeflow

Building a MLOps Platform Around MLflow to Enable Model Productionalization i...

Data engineering

MLflow: A Platform for Production Machine Learning

Etl overview training

Ähnlich wie CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks

Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflowDatabricks

Continuous Integration & Continuous DeliveryDatabricks

Ds for finance day 4QuantUniversity

Don't Repeat Yourself - An Introduction to Agile SSIS Development (24 Hours o...Cathrine Wilhelmsen

Использование AzureDevOps при разработке микросервисных приложенийVitebsk Miniq

Development workflow guide for building docker appsAbdul Khan

Continuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with ConcourseVMware Tanzu

Ship code like a keptnRob Jahn

Continuous Integration for Oracle Database DevelopmentVladimir Bakhov

[20200720]cloud native develoment - Nelson LinHanLing Shen

SDLC ModernizationNick Carpenter

Gitops Hands OnBrice Fernandes

Docs as Code: Publishing Processes for API ExperiencesAnne Gentle

Webcast Presentation: Be lean. Be agile. Work together with DevOps Services (...GRUC

“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...Edge AI and Vision Alliance

Productionizing Predictive Analytics using the Rendezvous Architecture - for ...danielschulz2005

DevOps: Age Of CI/CDMoogleLabs default

Data Modeling Comparison: Tableau, Cognos and Power BISenturus

Software Architecture and Architectors: useless VS valuableComsysto Reply GmbH

Ähnlich wie CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks (20)

Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow

Continuous Integration & Continuous Delivery

Ds for finance day 4

Don't Repeat Yourself - An Introduction to Agile SSIS Development (24 Hours o...

Использование AzureDevOps при разработке микросервисных приложений

Development workflow guide for building docker apps

Continuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with Concourse

Ship code like a keptn

Continuous Integration for Oracle Database Development

[20200720]cloud native develoment - Nelson Lin

SDLC Modernization

Gitops Hands On

Docs as Code: Publishing Processes for API Experiences

Webcast Presentation: Be lean. Be agile. Work together with DevOps Services (...

“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...

Productionizing Predictive Analytics using the Rendezvous Architecture - for ...

DevOps: Age Of CI/CD

Data Modeling Comparison: Tableau, Cognos and Power BI

Software Architecture and Architectors: useless VS valuable

Mehr von Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Kürzlich hochgeladen

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal

Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg

Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823

Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Discover Why Less is More in B2B Researchmichael115558

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823

Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Kürzlich hochgeladen (20)

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -

Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...

Detecting Credit Card Fraud: A Machine Learning Approach

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...

Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Discover Why Less is More in B2B Research

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...

Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...

CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks

1. CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks Michael Shtelma, Sr. Solutions Architect Ivan Trusov, Solutions Architect

2. Agenda The Challenges of implementing CI/CD for ML pipelines The CI/CD challenges forcing ML teams to choose between Databricks notebooks or local IDEs Introducing DatabricksLabs CI/CD Templates How CI/CD Templates solves ML team production challenges Demo and Next Steps

3. Problem: Organisations are struggling to get Business to start using their models to drive additional revenue Cause: Due to complexity of ML lifecycle only few models end up in production and drive additional revenue for business. Most of them are either delayed or discontinued during different ML Project stages It is challenging for organizations to gain value from ML due to complexity of the ML lifecycle

4. What challenges do ML teams face when then try to implement CD4ML?

5. ML teams struggle to combine traditional CI/CD tools with Databricks notebooks 1. Benefits to Databricks notebooks Easy to use Scalable Provides access to ML tools such as mlflow for model logging and serving 2. Challenges Non-trivial to hook into traditional software development tools such as CI tools or local IDEs. 3. Result Teams find themselves choosing between using traditional IDE based workflows but struggling to test and deploy at scale or using Databricks notebooks or other cloud notebooks but then struggling to ensure testing and deployment reliability via CI/CD pipelines.

6. What’s the solution?

7. CI/CD Templates gives you the benefits of traditional CICD workflows and the scale of databricks clusters CI/CD Templates allows you to ● create a production pipeline via template in a few steps ● that automatically hooks to github actions and ● runs tests and deployments on databricks upon git commit or whatever trigger you define and ● gives you a test success status directly in github so you know if your commit broke the build

8. A scalable CI/CD pipeline in 5 easy steps 1. Install and customize with a single command 2. Create a new github repo containing your databricks host and token secrets 3. Initialize git in your repo and commit the code. 4. Push your new cicd templates project to the repo. Your tests will start running automatically on Databricks. Upon your tests’ success or failure you will get a green checkmark or red x next to your commit status. 5. You’re done! You now have a fully scalable CICD pipeline. 1 2 3 4 5

9. Project structure 1. Python package where the logic of the project will be developed. Your models and pipelines will be developed here. 2. Configuration where you can configure define Databricks jobs which can run pipelines developed in python package 3. Tests directory where local unit tests and integration tests will be developed 1 2 3

10. CI/CD Templates execute tests and deployments directly on databricks while storing packages, model logging and other artifacts in Mlflow

11. CI/CD Templates - now powered by dbx With dbx you can: ● customize project structure and specify it during deployments ● use new CI tools easily (PRs are welcome!) ● run custom data pipelines pipelines directly from IDE on interactive clusters

12. Push Flow

13. Release Flow

14. Demo: CI/CD Templates

15. Summary The Challenges of implementing CD4ML The CI/CD challenges forcing ML teams to choose between Databricks notebooks or local IDEs Introducing DatabricksLabs CI/CD Templates How CI/CD Templates solves ML team production challenges Next Steps Search DatabricksLabs cicd-templates or go directly to https://github.com/databrickslabs/cicd-templates to get started michael.shtelma@databricks.com ivan.trusov@databricks.com

16. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.